Word disambiguation combining LDA and word embeddings
Topical search tries to take into account the occurence of search-term-related words to avoid matching irrelevant documents. This post lines out how LDA can be used to disambiguate a word-set obtained using word embeddings.
'Java' in google news. A mix of programming, coffee farmer and Indonesia news.
Most people are not aware of the massive ambiguity of human language. Even those who should know constantly underestimate it ;). So when doing a plain by-word search, results easily may contain 50% to 70% of unrelated (in the sense of "I didn't mean that Java") documents (see image above). In order to improve on that, search results can be sharpened by also requiring search term related words to occur inside matching documents. Unfortunately capturing topical "related words" of a given search word using "word embeddings" still suffers from ambiguity.
e.g. 'piano' and 'violin' will have a high similarity score, while 'piano' and 'organ' will have a comparable low similarity score as 'organ' is ambiguous (medicine and music).
The basic mechanics of word embeddings and why they work
"Word embeddings" (also the foundation of the word2vec, glove and similar algorithms) are a way to capture the meaning of a word by looking at its neighbourhood words.
Consider the following sentences:
I fed my pet.
I fed my dog.
I fed my cat.
There are two ways to extract relationship information :
- "pet", "dog" and "cat" are somehow similar as they are replacable in an identical context. (replacability)
- "fed" relates to "pet", "dog", "cat". (neighbourhood)
When running this kind of analysis on huge amounts of text (millions of documents), one obtains a (weighted) set of related words for each word analyzed. Results of word embedding analysis can be used to construct a vector space allowing for famous and often cited word-level topical algebra such as 'king-man=queen'.
Issue is, if a word has multiple, context-dependent meanings, those word vectors reflect that ambiguity resulting in inaccuracy of similarity/related-word metrics.
What's LDA ?
LDA stands for "Latent Dirichlet allocation" and is an algorithm to cluster words by evaluating the likeliness of words to occur in the same document (simplified). It then creates N (=user defined number of topics to find) weighted sets of topical related words.
The resulting word-sets are unlabeled (there is no name for a computed topic) and do not necessarily match human mind topics. E.g. if a lot of documents deal with politics and terrorism, it will combine "politics" and "terrorism" into a single topic.
Example of an arbitrary LDA topic (N=1000) created from wikipedia
object-oriented monty ide annotation petri mpi python compiler ee asp jar java php interface lang framework logic implementation module programming net component object independence method code library language ..
As one can see, this topic is related to "programming", but mixes up several subtopics into a single one (asp, python, java, php, petri networks).
Note that the programming language "python" was named after "monty python", seems this fact is mentioned often enough in the training set to make LDA correlate "monty" with "programming" :)
Applying disambiguation combining word embeddings and LDA
Let's take the topic 'Java' as an example:
"Java" can be an island (2 islands: one near amsterdam and the one of Indonesia), a beverage, a dance, a chicken, several towns, a programming language, and many more (see Wikipedia)
Below the related-word set for the topic 'Java' computed using word embeddings. Obviously two major topics compete here: "java programming language" and a topic related to the "java island" of Indonesia (in bold):
object-oriented sql iphone yunnan monty gui virtual-machine ide food-beverage east-timor free-software ipad bhutan mobile-device blackberry smartphone firefox sdk internet-explorer desktop window-phone cross-platform hyatt balinese annotation lombok schema unix caledonia vm batavia assembler web-page memoriam fox-sport am-main file-format papua alla new-caledonia north-coast source-software babar software-development petri gamelan johor dil east-indie mpi sumatra borneo bali sarawak programming-language timor ..
What we want is
- find which meaning is the most common / likely interpretation of 'Java' (in order to have a best guess to default to).
- disambiguate the topic "Java" into "Java/programming" and "Java/island" and "Java/coffee" ...
For this, I use LDA-topics obtained from the same training corpus. A lookup for java related LDA-topics delivers (shortened topic and word list):
0.99788508988368 malaysia indonesia malay singapore sultan java sarawak jakarta brunei state kuala-lumpur sabah chinese borneo penang malaya regency ..
0.9975599837332249 language object type code program java application example implementation programming use interface tool function class system ..
0.9925407925407925 singapore burma burmese myanmar thailand asia tan malaysia indonesia chinese lim shan southeast-asia asian china java thai yangon british ..
0.9849439775910365 data application database software system user server tool version support information service file web oracle product feature format ..
0.8894092219020173 device phone audio video user screen technology feature mobile display use software model application computer digital version player ..
0.8729598588442876 bit instruction memory program code value data register operation function example address type use set processor system pointer time ..
0.8341307814992025 dutch van netherland amsterdam holland belgium brussel het flemish jan antwerp der den hague flander van-der external utrecht rotterdam een ..
0.7989795918367347 ship sea expedition captain crew voyage vessel boat sailor coast men pirate island journey day port water ocean explorer board return year ..
0.7922971114167813 carrier japanese ship force attack fleet aircraft destroyer day operation plane harbor escort group cruiser war island enemy patrol american ..
0.7497263772345859 attack force japanese aircraft operation raid ship carrier destroyer british german day landing american fire battle convoy hit bomber troop ..
0.7453018792483006 moon earth planet sun orbit jupiter venu object surface system asteroid mercury time comet observation crater year neptune distance solar ..
0.7228714524207012 specie snake spider genus lizard reference new description name gecko pp external family viper natural-history reptile length range ..
0.630074126367808 beer drink sugar bottle brewery alcohol milk beverage brand product ale juice water cola flavor lemon fruit production brewing drinking ..
0.617198838896952 temple shrine place festival lord day goddess king year hindu worship image pagoda name statue god idol people ancient buddha century shiva ..
0.6105867870332375 restaurant food chef coffee mcdonald chocolate cooking kitchen cook recipe menu cookie bar cookbook meal cuisine pizza location external ..
Obviously there are some surprises in there (military operations probably related to 'java sea", religion, animals, planets [as java was created by 'Sun Microsystems' :)) ] etc.) .
In order to disambiguate, we cluster matching LDA topics using a similarity metric (e.g. tech, geographic, sea war, food).
To clean our initial multiple-topic word-set, we just take all words of the largest LDA topic cluster and apply it as a positive filter to the initial word-set.
When filtering the initial "Java" word-vector using the superset of all tech- and java-related LDA-topics we get:
object-oriented free-software sql virtual-machine ide mobile-device internet-explorer desktop window-phone cross-platform annotation schema vm assembler web-page file-format source-software software-development programming-language xml browser syntax compiler operating-system ee open-source android asp proprietary jar interface app framework programmer hardware logic processor tablet pl server software implementation database keyboard protocol feed module web programming index developer file client phone ..
Using this refined word-set to sharpen search results removes off-topic documents with high accuracy.
Finding the most common topical meaning
- for each word of the initial word set, find N best matching LDA topics
- take the overall most often matched LDA-topics from (1) and find similar LDA-topics using a similarity metric
- use these combined as positive filter on the word-set obtained from word-embeddings
Transferability to other languages
This algorithm transfers well, e.g. German:
after disambiguation, we get:
open-source compiler world-wide license quellcode template servers objektorientierte plugin kernel lego frameworks entwicklungsumgebung freie-software programmiersprache browser oracle programmiersprachen interface framework ...
Limits / possible improvements
- as disambiguation is performed after analysing word-embeddings, rare meanings of a word are washed out. Consequently only popular meanings are captured, rare ones get lost. This could be avoided by performing disambiguation upfront word-embedding analysis (basically splitting a single word into multiple ones upfront. E.g. 'Java (island)', 'Java (programming)', ...).
- Labeling LDA topics is a staff and time consuming task. At juptr.io we currently fall back to "most-common-topic" search as this does not require labeling with each model iteration (staff/time constraints).
- Highly abstract topics (e.g. "politics", "psychology") cannot be captured well using word embeddings. For those, using appropriate LDA-topics as a word-set instead of word-embedding based sets might deliver better results.
- Some topics change quickly (e.g. news related), so a cyclic recomputation of the underlying LDA and word embedding model can improve results for those.