Building a spell-checker with FastText word embeddings
Word vector representations with subword information are great for NLP modeling. But can we make lexical corrections using a trained embeddings space? Can its accuracy be high enough to beat Peter Norvig’s spell-corrector? Let’s find out!
Let’s first define some terminology.
What are word embeddings?
Word Embedding techniques have been an important factor behind recent advancements and successes in the field of Natural Language Processing. Word Embeddings provide a way for Machine Learning modelers to represent textual information as the input to ML algorithms. Simply put, they are a hashmap where the key is a language word, and the corresponding value is a vector of real numbers fed to the models in place of that word.
There are different kinds of word embeddings available out there that vary in the way they learn and transform a word to a vector. The word vector representations can be as simple as a hot-encoded vector, or they can be more complex (and more successful) representations that are trained on large corpus, take context into account, break the words into subword representations etc
What are subword embeddings?
Suppose you have a deep learning NLP model, say a chatbot, running in production. Your customers are directly interacting with the ML model. The model is being fed the vector values, such that each word from the customer is queried against a hashmap and the value corresponding to the word is the input vector. All of the keys in your hashmap represents the vocabulary, i.e. all the words you know. Now, how would you handle a case where the customer uses a word that is not already present in your vocabulary?
There are many ways to solve this out-of-vocabulary (OOV) problem. One popular approach is to split the words into “subword” units, and use those subwords to learn your hashmap during model training stage. At the time of inference, you would again divide each incoming word into smaller subword units, find the word vector corresponding to each subword unit using the hashmap, then aggregate each subword vector to get the vector representation for the complete word.
For example, let’s say we have the word
tiktok, the corresponding subwords could be
ktok etc. This is the character n-gram division for the input word, where n is the subword sequence length, and is fixed by the modeler.
Figure: example subword representations
What is FastText?
FastText is an open-source project from Facebook Research. It is a library for fast text-representations and classifications. It is written in C++ and supports multiprocessing. It can be used to train unsupervised word vectors and supervised classification tasks. For learning word-vectors it supports both skipgram and continuous bag-of-words approaches.
FastText supports subword representations such that each word can be represented as a bag of character n-grams in addition to the word itself. Incorporating finer (subword level) information is pretty good for handling rare words. You can read more about the FastText approach in their paper here .
For an example, let’s say you have a word “superman” in FastText trained word embeddings (“hashmap”). Let’s assume the hyperparameters minimum and maximum length of ngram was set to
4. Corresponding to this word, the hashmap would have the following keys:
Original word: superman
where “<” and “>” characters mark the start and end of a word respectively
The Task: Spell-Checking
In NLP, the learnt word embedding vectors not only have lexical representations for words, but the vector values also have semantically related positioning in the embedding space. We are going to try and build a spell-checker application based on FastText word vectors such that given a misspelled word, our task will be to find the word vector representation closest to the vector representation of that word in trained embedding space.
We will work based on this simple heuristic:
IF word exists in the Vocabulary
Do not change the word
Replace the word with the one closest to its sub-word representation
Dataset & Benchmark
For fun, let’s build and evaluate our spell-checker on the same training and testing data as this classic article: “How to write a Spelling Corrector” by Peter Norvig , Director of Research at Google. In this article, Norvig build a simple spelling corrector based on basic probability theory. Let’s see how does this FastText based approach hold up against it.
Installation for FastText is straightforward. FastText can be used as a command line tool or via Python client. Click here to access the latest installation instructions for both approaches.
Method 1: Using Pre-trained Word Vectors
FastText provides pretrained word vectors based on common-crawl and wikipedia datasets. The details and download instructions for the embeddings can be found here. For a quick experiment, let’s load the largest pretrained model available from FastText and use that to perform spelling-correction.
Download and unzip the trained vectors and binary model file.
There are two files inside the zip:
crawl-300d-2M-subword.vec: This file contains the number of words (2M) and the size of each word (vector dimensions; 300) in the first line. All of the following lines start with the word (or the subword) followed by the 300 real number values representing the learnt word vector.
crawl-300d-2M-subword.bin: This binary file is the exported model trained on the Common-Crawl dataset.
As mentioned in his article, Norvig used spell-testset1.txt and spell-testset2.txt as development and test set respectively, to evaluate the performance of his spelling-corrector. I’m going to load the pretrained FastText model and make predictions based on the heurisitic defined above. I’m also going to borrow some of the evaluation code from Norvig.
For the sake of brevity, I’m reducing the output log to just the metrics trace.
0% of 270 correct at 3 words per second.
0% of 400 correct at 4 words per second.
That didn’t go well! 😅 None of the corrections from the model were right. Let’s dig into the results. Here are a few snapshots from the output log:
word: sysem exists in the vocabulary. No correction required
word: controled exists in the vocabulary. No correction required
word: reffered exists in the vocabulary. No correction required
word: irrelavent exists in the vocabulary. No correction required
word: financialy exists in the vocabulary. No correction required
word: whould exists in the vocabulary. No correction required
found replacement: reequipped for word: reequired
found replacement: putput for word: oputput
found replacement: catecholate for word: colate
found replacement: yeahNow for word: yesars
found replacement: detale for word: segemnt
found replacement: <li><strong>Style for word: earlyest
Looks like there are a lot of garbage subwords in the pretrained vocabulary that directly matches our misspelled input. Also, the performed lexical corrections show that the model replaced misspelled input words with the closest semantic neighbor. However none of those neighbors are meaningful English words.
We can verify this by checking out the neighbors for random misspellings.
As you can see, misspelled words have misspelled semantic neighbors. And the top ones also seem syntactically related. This makes sense. But we were hoping that the subword approach would be able to prioritize the potentially correct lexical variants of the word, and not just it’s semantic neighbors.
wordNgrams(max length of word ngram) was set to
5during training. If we train our own word vectors, we could keep the
1, so that FastText trains with 0 neighbors (i.e. each word is considered a line on it’s own).
The pre-trained vectors do not seem to be sufficient for this task. Let’s try training our word embeddings and see what results we can get.
Method 2: Trained Our Own Word Vectors
For a fair comparison with Norvig’s spell-checker, let’s use the same training data that he used (big.txt).
The training data has around 128K lines, 1M words, 6.5M characters.
As discussed above, we should try with keeping the
wordNgrams hyperparameter to
1, and use the trained FastText model to perform spell-checking.
There are a couple of changes in this code. We are training the model with specific hyper-parameters, and the output traces will inform us what went wrong in our decisions. Note that FastText doesn’t provide an option to set seed in the above implementation, so the results may vary by 1%-2% on every execution.
This is a big improvement! 😄 We have almost the same accuracy as Norvig’s spell-corrector.
Here are Norvig’s results for comparison.
Looking at some of the edits in the output traces, I can see that the model output isn’t essentially incorrect, but the model is biased to certain edit-based operations.
Edited reffered to referred, but the correct word is: refered
Edited applogised to apologized, but the correct word is: apologised
Edited speeking to seeking, but the correct word is: speaking
Edited guidlines to guideline, but the correct word is: guidelines
Edited throut to throughout, but the correct word is: through
Edited nite to unite, but the correct word is: night
Edited oranised to organised, but the correct word is: organized
Edited thay to thy, but the correct word is: they
Edited stoped to stooped, but the correct word is: stopped
Edited upplied to supplied, but the correct word is: applied
Edited goegraphicaly to geographical, but the correct word is: geographically
As you can see our trained model is nicely producing dictionary-based words as output. It’s likely that with contextual training approaches and evaluations, along with more training data, we can come up with an even better approaches that would understand context in a full sentence and produce the correct word as the spell-checked output.
Let’s also compare the model with the pretrained model from method one.
The trained model’s outputs are much more sensible now and the topmost word is also the word that we are actually looking for.
Our model was able to match Peter Norvig’s spell-corrector. 😊 Looking at the failing cases, we realize that the model could potentially do even better with more training data and a contextual training and evaluation strategy.