Stemming and Lemming with NLTK
One of the first and most important things you will do when you start working with language data is too clean it. As data scientists we have many ways and methods of cleaning data, things like checking for nulls or class imbalance or any numerical outliers. Cleaning language data is a little different. It is often centered around reduction. There are many things you can do clean language data such as removing stop words and punctuation or case folding. This blog post, however, is going to focus on the processes of Lemmatization and Stemming. These two methods of cleaning often get mixed up as they do something visually similar.
Stemming is the process of stripping off the endings of words to reduce them to their root word or their stem, which is where the name Stemming comes from. Stemming helps reduce the amount characters that need to be evaluated by removing all the extra affixes attached to the stem. This saves processing power and memory space as well as allowing the stemmed words to be grouped together for more analysis. The Natural Language Processing Toolkit has a variety of stemmers that we will take a look at below. You can find how to import and use NLTK below. You will need to do this before trying to use any of the stemmers and lemmers in this tutorial.
One of the most popular and most effective stemmers in the PorterStemmer. The algorithm for this stemmer was written by Dr.Martin Porter in 1980. You can read more about Martin Porter on his personal page here. Below is an example of the PorterStemmer.
At the top of this chunk of code you will see the import statement on the first line of the first code chunk. It is important to note that you will have to get all of NLTK downloaded and imported first for this to work even though its not shown here. In the second chunk of code you can see we have a variable named ‘porter_stemmer’. This is where we instantiate an instance of the PorterStemmer object.
In the next chunk of code is a sentence split up into a list of words. The stemmer will take each individual string within the list and strip it to is base word in the for loop. Its output will be what the algorithm thinks is its most likely stem. As you can see it is a bit of an odd sentence. The repeated use of the word ‘cook’ and all of its forms will best demonstrate what the stemmer does. You can see that it catches the -’s’ suffix on cooks as well as taking both the suffixes -’ie’ and -’y’ down to -’i’.
Even though this is often the standard use stemmer it had a few flaws. Dr. Martin Porter developed another, better stemmer called the Snowball Stemmer.
As you can see it works just the same way as above but with a few internal improvements.
Similar to stemming is lemming. It gets its name by bringing words down to their lemma, which is very similar to the root of a stemmed word. Lemmatizers are more powerful than stemmers because they use more information, information based on context, to provide a more accurate output.
As you can see the first thing we do here is instantiate a new WordNetLemmatizer object to use, just as we do for the other two stemmers. Now though, instead of a list of words we has a sentence as a single string. This is where the extra context comes in. The lemmatizer will use the place of a word within the sentence to determine more information about it. But as part of that you’ll notice we had to tokenize the sentence. You might also notice a few slight changes in our put compared to the two above stemmers. ‘Course’ remains untouched as it is already it lemma with no affixes of any kind attached to it, the suffic -’ie’ is truncated to -’y’ instead of -’i’. Another method that can be used to get better results and improve the lemmatisation is part of speech tagging.
The above is part of a function that part of speech tags each tokenized word within our sentence. Part of speech tagging is an NLP technique where each word is labeled with its syntactic category based on where in the sentence it falls. This helps distinguish between two words which are virtually identical without any context. Take for example the word ‘cook’. On its own it could be the verb as in “I cook dinner for myself” or it could be the noun in “I am a cook”. These distinctions can de discerned by where the word falls within the sentence and the context clues surrounding it. These clues allow the lemmer to do a better job. As you can see in the code example above, it properly identifies the gerund on ‘cooking’ and removes it.
You can find the full code for this tutorial on my github.