Building a spell-checker with FastText word embeddings

Sumit Kumar included in category NLP

2020-07-18 2020-07-18 2238 words 10 minutes

Contents

Word vector representations with subword information are great for NLP modeling. But can we make lexical corrections using a trained embeddings space? Can its accuracy be high enough to beat Peter Norvig’s spell-corrector? Let’s find out!

Introduction

Let’s first define some terminology.

What are word embeddings?

Word Embedding techniques have been an important factor behind recent advancements and successes in the field of Natural Language Processing. Word Embeddings provide a way for Machine Learning modelers to represent textual information as the input to ML algorithms. Simply put, they are a hashmap where the key is a language word, and the corresponding value is a vector of real numbers fed to the models in place of that word.

There are different kinds of word embeddings available out there that vary in the way they learn and transform a word to a vector. The word vector representations can be as simple as a hot-encoded vector, or they can be more complex (and more successful) representations that are trained on large corpus, take context into account, break the words into subword representations etc

What are subword embeddings?

Suppose you have a deep learning NLP model, say a chatbot, running in production. Your customers are directly interacting with the ML model. The model is being fed the vector values, such that each word from the customer is queried against a hashmap and the value corresponding to the word is the input vector. All of the keys in your hashmap represents the vocabulary, i.e. all the words you know. Now, how would you handle a case where the customer uses a word that is not already present in your vocabulary?

There are many ways to solve this out-of-vocabulary (OOV) problem. One popular approach is to split the words into “subword” units, and use those subwords to learn your hashmap during model training stage. At the time of inference, you would again divide each incoming word into smaller subword units, find the word vector corresponding to each subword unit using the hashmap, then aggregate each subword vector to get the vector representation for the complete word.

For example, let’s say we have the word tiktok, the corresponding subwords could be tik, ikt, kto, tok, tikt, ikto, ktok etc. This is the character n-gram division for the input word, where n is the subword sequence length, and is fixed by the modeler.

Figure: example subword representations

What is FastText?

FastText is an open-source project from Facebook Research. It is a library for fast text-representations and classifications. It is written in C++ and supports multiprocessing. It can be used to train unsupervised word vectors and supervised classification tasks. For learning word-vectors it supports both skipgram and continuous bag-of-words approaches.

FastText supports subword representations such that each word can be represented as a bag of character n-grams in addition to the word itself. Incorporating finer (subword level) information is pretty good for handling rare words. You can read more about the FastText approach in their paper here .

For an example, let’s say you have a word “superman” in FastText trained word embeddings (“hashmap”). Let’s assume the hyperparameters minimum and maximum length of ngram was set to 4. Corresponding to this word, the hashmap would have the following keys:

Original word: superman

n-gram size	subword
4	<sup
4	supe
4	uper
4	perm
4	erma
4	rman
4	man>

where “<” and “>” characters mark the start and end of a word respectively

The Task: Spell-Checking

In NLP, the learnt word embedding vectors not only have lexical representations for words, but the vector values also have semantically related positioning in the embedding space. We are going to try and build a spell-checker application based on FastText word vectors such that given a misspelled word, our task will be to find the word vector representation closest to the vector representation of that word in trained embedding space.

We will work based on this simple heuristic:

heuristic

IF word exists in the Vocabulary

Do not change the word

ELSE

Replace the word with the one closest to its sub-word representation

Dataset & Benchmark

For fun, let’s build and evaluate our spell-checker on the same training and testing data as this classic article: “How to write a Spelling Corrector” by Peter Norvig , Director of Research at Google. In this article, Norvig build a simple spelling corrector based on basic probability theory. Let’s see how does this FastText based approach hold up against it.

Installing FastText

Installation for FastText is straightforward. FastText can be used as a command line tool or via Python client. Click here to access the latest installation instructions for both approaches.

Method 1: Using Pre-trained Word Vectors

FastText provides pretrained word vectors based on common-crawl and wikipedia datasets. The details and download instructions for the embeddings can be found here. For a quick experiment, let’s load the largest pretrained model available from FastText and use that to perform spelling-correction.

Download and unzip the trained vectors and binary model file.

-> wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
-> unzip crawl-300d-2M-subword.zip

There are two files inside the zip:

crawl-300d-2M-subword.vec : This file contains the number of words (2M) and the size of each word (vector dimensions; 300) in the first line. All of the following lines start with the word (or the subword) followed by the 300 real number values representing the learnt word vector.
crawl-300d-2M-subword.bin: This binary file is the exported model trained on the Common-Crawl dataset.

As mentioned in his article, Norvig used spell-testset1.txt and spell-testset2.txt as development and test set respectively, to evaluate the performance of his spelling-corrector. I’m going to load the pretrained FastText model and make predictions based on the heurisitic defined above. I’m also going to borrow some of the evaluation code from Norvig.

import io
import fasttext

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

def spelltest(tests, model, vocab):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.clock()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = wrong
        if w in vocab:
            print('word: {} exists in the vocabulary. No correction required'.format(w))
        else:
            w_old = w
            w = model.get_nearest_neighbors(w, k=1)[0][1]
            print("found replacement: {} for word: {}".format(w, w_old))
        good += (w == right)
    dt = time.clock() - start
    print('{:.0%} of {} correct at {:.0f} words per second '
          .format(good / n, n, n / dt))

def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

if __name__ == "__main__":
    model = fasttext.load_model("crawl-300d-2M-subword.bin")
    vocab = load_vectors("crawl-300d-2M-subword.vec")
    
    spelltest(Testset(open('spell-testset1.txt')), model, vocab)
    spelltest(Testset(open('spell-testset2.txt')), model, vocab)

For the sake of brevity, I’m reducing the output log to just the metrics trace.

output

…
0% of 270 correct at 3 words per second.
…
0% of 400 correct at 4 words per second.

That didn’t go well! 😅 None of the corrections from the model were right. Let’s dig into the results. Here are a few snapshots from the output log:

output

…
word: sysem exists in the vocabulary. No correction required
word: controled exists in the vocabulary. No correction required
word: reffered exists in the vocabulary. No correction required
word: irrelavent exists in the vocabulary. No correction required
word: financialy exists in the vocabulary. No correction required
word: whould exists in the vocabulary. No correction required
…
found replacement: reequipped for word: reequired
found replacement: putput for word: oputput
found replacement: catecholate for word: colate
found replacement: yeahNow for word: yesars
found replacement: detale for word: segemnt
found replacement: <li><strong>Style for word: earlyest
…

Looks like there are a lot of garbage subwords in the pretrained vocabulary that directly matches our misspelled input. Also, the performed lexical corrections show that the model replaced misspelled input words with the closest semantic neighbor. However none of those neighbors are meaningful English words.

We can verify this by checking out the neighbors for random misspellings.

>>> model.get_nearest_neighbors("incorrct", k=10)
[(0.68821, 'corrct'), (0.67809, 'corrctly'), (0.63196, 'altadena'), (0.58083, 'huntersville', (0.579281, 'concorda'), (0.578464, 'bartlesville'), (0.576386, 'controlin'), (0.576119, 'variaton'), (0.574023, 'relavence'), (0.572117, 'concorrenza'))]

As you can see, misspelled words have misspelled semantic neighbors. And the top ones also seem syntactically related. This makes sense. But we were hoping that the subword approach would be able to prioritize the potentially correct lexical variants of the word, and not just it’s semantic neighbors.

why do we see bad output?

One reason for this behavior could be that the pretrained model was originally trained by FastText with a >1 neighborhood window. This FastText documentation page confirms the fact that the wordNgrams (max length of word ngram) was set to 5 during training. If we train our own word vectors, we could keep the wordNgrams hyperparameter to 1, so that FastText trains with 0 neighbors (i.e. each word is considered a line on it’s own).

The pre-trained vectors do not seem to be sufficient for this task. Let’s try training our word embeddings and see what results we can get.

Method 2: Trained Our Own Word Vectors

For a fair comparison with Norvig’s spell-checker, let’s use the same training data that he used (big.txt).

From Norvig's article:

… by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus.

-> wc big.txt
128457 1095695 6488666 big.txt

The training data has around 128K lines, 1M words, 6.5M characters.

As discussed above, we should try with keeping the wordNgrams hyperparameter to 1, and use the trained FastText model to perform spell-checking.

import io
import fasttext

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

def spelltest(tests, model):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.clock()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w_old = wrong
        w = wrong
        if w in model.words:
            pass
        else:
            w = model.get_nearest_neighbors(w, k=1)[0][1]
        good += (w == right)
        if not (w == right):
            if w_old != w:
                print("Edited {} to {}, but the correct word is: {}".format(w_old, w, right))
    dt = time.clock() - start
    print('{:.0%} of {} correct at {:.0f} words per second '
          .format(good / n, n, n / dt))

def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

if __name__ == "__main__":
    model = fasttext.train_unsupervised('big.txt', wordNgrams=1, minn=1, maxn=2, dim=300, ws=8, neg=8, epoch=4, minCount=1, bucket=900000)
    
    spelltest(Testset(open('spell-testset1.txt')), model)
    spelltest(Testset(open('spell-testset2.txt')), model)

There are a couple of changes in this code. We are training the model with specific hyper-parameters, and the output traces will inform us what went wrong in our decisions. Note that FastText doesn’t provide an option to set seed in the above implementation, so the results may vary by 1%-2% on every execution.

-> python fasttext_trained.py
Read 1M words
Number of words: 81398
Number of labels: 0
Progress: 100.0% words/sec/thread:	20348 lr: 0.000000 avg. loss 1.943424 ETA:	0h 0m 0s
...
73% of 270 correct at 46 words per second.
...
69% of 400 correct at 48 words per second.

This is a big improvement! 😄 We have almost the same accuracy as Norvig’s spell-corrector.

Here are Norvig’s results for comparison.

75% of 270 correct at 41 words per second
68% of 400 correct at 35 words per second

Looking at some of the edits in the output traces, I can see that the model output isn’t essentially incorrect, but the model is biased to certain edit-based operations.

output

…
Edited reffered to referred, but the correct word is: refered
Edited applogised to apologized, but the correct word is: apologised
Edited speeking to seeking, but the correct word is: speaking
Edited guidlines to guideline, but the correct word is: guidelines
Edited throut to throughout, but the correct word is: through
Edited nite to unite, but the correct word is: night
Edited oranised to organised, but the correct word is: organized
Edited thay to thy, but the correct word is: they
Edited stoped to stooped, but the correct word is: stopped
Edited upplied to supplied, but the correct word is: applied
Edited goegraphicaly to geographical, but the correct word is: geographically
…

As you can see our trained model is nicely producing dictionary-based words as output. It’s likely that with contextual training approaches and evaluations, along with more training data, we can come up with an even better approaches that would understand context in a full sentence and produce the correct word as the spell-checked output.

Let’s also compare the model with the pretrained model from method one.

>>> model.get_nearest_neighbors("incorrct", k=10)
[(0.96987646, 'incorrect'), (0.937522232, 'correct'), (0.91315931, 'corrects'), (0.90868479, 'correcting', (0.89917653, 'corrective'), (0.89813524, 'contractor'), (0.89806741, 'correction'), (0.896661, 'disconcert'), (0.894903957, 'contradict'), (0.8944035768, 'concurrent'))]

The trained model’s outputs are much more sensible now and the topmost word is also the word that we are actually looking for.

Conclusion

Our model was able to match Peter Norvig’s spell-corrector. 😊 Looking at the failing cases, we realize that the model could potentially do even better with more training data and a contextual training and evaluation strategy.

Contents

Contents

Building a spell-checker with FastText word embeddings

Introduction

What are word embeddings?

What are subword embeddings?

What is FastText?

The Task: Spell-Checking

Dataset & Benchmark

Installing FastText

Method 1: Using Pre-trained Word Vectors

Method 2: Trained Our Own Word Vectors

Conclusion

Related Content

Be the First to Know

Contents

Building a spell-checker with FastText word embeddings

Introduction

What are word embeddings?

What are subword embeddings?

What is FastText?

The Task: Spell-Checking

Dataset & Benchmark

Installing FastText

Method 1: Using Pre-trained Word Vectors

Method 2: Trained Our Own Word Vectors

Conclusion

Related Content

Embedding Collapse in Recommender Systems: Causes, Consequences, and Solutions

Shallow Embedding Models for Heterogeneous Graphs

Shallow Embedding Models for Homogeneous Graphs

A Guide to Graph Representation Learning

Self-Supervised Contrastive Approaches for Video Representation Learning

Be the First to Know