Text Classification with Bag of Bigrams and TF-IDF
I wanted to expand upon the Bag of Words method we learned in class in combination with NLTK and some other NLP processes. I found a data set on Kaggle that consists of five million Yelp reviews to use as my data set. My target was to use these reviews to create a text classification tool that will classify a piece of text as either positive or negative.
Five million reviews is a little bit too large of a data set to play around with in Jupyter so I had to trim it down a little. I decided that ten thousand reviews was a good starting point that wouldn’t tie my system up but would have enough content to produce results. I split my data into five categories based on their star rating. I decided to cut all the three star reviews my dataset as they might be ambiguous as they aren’t staunchly negative or positive. I also over sampled two and four star reviews to account for how overly negative negative reviews can be and how neutral five star reviews can be. Two star and four star reviews are more likely to a better example of positive or negative.
I made two text file each containing five thousand reviews, a positive text file with two thousand five star reviews and three thousand four star reviews and a negative text file with two thousand one star reviews and three thousand two star reviews. To start I did some EDA to get some basic stats on the data, the number of types, number of tokens and frequency distributions.
This particular dataset contains around 42k individual types and 1.3 million tokens.
The frequency distribution shows the count of how many times each object appears in the text. Here you can see the top ten most common objects in this data set.
We’ll clean most of these out later. This is a list of hapaxes, which are words that only appear once in the entire collection of text, so in this case, once in 1.3 million.
So before doing any kind of modeling we need to clean the data. I’ve used some common NLP techniques to clean the data. These include case folding, making everything lower case, stripping out all the stop words. A list of stop words, provided by NLTK is a list of words that provide no content to the text, things that often belong to a closed class of parts of speech such as pronouns, determiners, conjunctions. Next I stripped the punctuation out which cuts down on a lot of noise but also an area you need to be careful in when stripping out depending on what you’re doing. Punctuation adds a layer of semantic understanding for example: ‘good.’, ‘good!’ and ‘good!!!’ can each be semantically interpreted as a greater degree of positivity than the proceeding example. Next I stemmed all the data, this included getting the PorterStemmer from NLTK and instantiating it as an object. Stemming is the process of taking a word down to its root by getting rid of derivational or inflectional suffixes. Next was getting rid of words with a count of two or less with the reasoning being that words that show up less frequently are less important as well as cutting down on a lot of spelling errors. I also got rid of words with a length less than two. These words are also usually words that add very little semantic meaning.
Here you can see the change in the data as it goes through each cleaning process.
This visualization shows the comparison of the amount of tokens and types in every iteration in the data. As you can see the three biggest drops in amounts were at removing stop words, removing punctuation and removing words of a length less than two.
After all this cleaning we essentially have what constitutes our bag of words. But the bag of words can be greatly improved upon by two things: bigrams and TF-IDF. Bigrams are a two unit chunk of a text where the first half is the second half of the preceding bigram. You can also use trigrams or unigrams grams of any length called n grams. N grams are especially relevant for languages like English were we mark negation by having the negation place directly before the word it modifies. For example, a unigram would just capture (good) but a bigram can capture (not, good). Trigrams also come in handy for catching the occasional double negative such as ‘it wasn’t not good’ to (nt, not, good). The other half of this formula is TF-IDF. The TF stands for Term Frequency. Term frequency is the number times a specific term appears in a document. IDF stand for Inverse Document Frequency It deals with balancing the weights given to words that show up more frequently in that words that show up more often have less important meaning. It increases the weight on less common words and decreases the weights on more common words.
Sklearn has a tool for calculating the TF-IDF score for us that you can import. The TfidfVectorizer is instantiated with two parameters, analyzer set to word, which is the default that dictates the data and the ngram range. This is where our bigrams come in. Setting the ngram range to (1,2) will chunk things into unigrams and bigrams. It tells the vectorizer to create TF-IDF scores for both unigrams and bigrams. We then fit_transform the object to our data, which is the cleaned data column that consists of fully cleaned data that has been joined back together so it can be passed in as a document and not as tokenized terms. Turn that into an array and pop it back into a pandas data frame.
This is what our data frame looks like after the TF-IDF vectorizer. A huge array where each bigram is a feature with a score. It has 10k rows 300k+ columns. This is why the data cleaning is so important for this step. Without it this array would be almost double in size and very hard to work with. As you can also see this data frame could do with some more cleaning. There are accents and odd strings such as ‘aaa’ that could probably be removed if we really wanted to get this data squeaky clean.
So all that is left to do now is to train, test split and run the models!
So when I started this project I thought I might do some kind of ensemble method or bagging but my models actually seem to be doing pretty well. There are several other models I might run with this data set such as k-nearest neighbors, random forest, SVc, linearSVC, and nuSVC to get a better sense of what works and what doesn’t work well. Seeing as the Bernoulli Naive Bayes model did the worst I might drop it from my set.
As you may have also noticed I decided to time how long it took each jupyter cell to run each model. Some like the Multinomial Naive Bayes only took 15 seconds while others like the SGD took over two minutes. I would have liked to run this on the full five million review set but I think that might just be too much data for my available RAM and time. But since these classifiers did take so long to compute I decided to pickle my classifiers for later use.
Pickling is a process where you preserver things to use later, in Python is actually serializing. Having these classifiers serialized means I can use them later without having to retrain each and everyone. This can save you a lot of time and processing if you intend to use your classifiers more than once.
The pickling process involves writing to and reading from folder. You open a folder, name it, dump your classifier inside of it then close the file. You then open the file, read it, save the loaded pickle file to a variable and then close the file. An interesting next step for me would be to run my classifiers on some tagged three star reviews to see how well it sorts out ambiguous data and then to test it on a different kind of data such as tweets.
The Source code for my project can be found here:
https://github.com/edwardsrk/yelp_sentiment_analysis
Sources:
“Tf–Idf.” Wikipedia, Wikimedia Foundation, 23 Feb. 2021, en.wikipedia.org/wiki/Tf%E2%80%93idf.
Bag-of-words model. (2021, January 03). Retrieved March 08, 2021, from https://en.wikipedia.org/wiki/Bag-of-words_model
Brownlee, J. (2019, August 07). A gentle introduction to the bag-of-words model. Retrieved March 08, 2021, from https://machinelearningmastery.com/gentle-introduction-bag-words-model/
Huilgol, P. (2020, December 23). BoW model And tf-idf for creating feature from text. Retrieved March 08, 2021, from https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.