Vectorizers

4 min readJul 19, 2021

Transforming data into a more usable form is an important and frequent part of data science. It comes in especially handy when dealing with language data. Language data can be difficult for many to process as it is qualitative instead of quantitative. It is also hard for computers to interpret as they rely on the rigidness and prescriptivism of numbers and logic. Luckily we have a few tools at our disposal to deal with these tricky types of data. In this blog post I am going to be going over several different types of methods to convert language data into numbers for the purpose of being able to build better models. One of the best ways to accomplish this task is to use vectorizers. Vectorizing words or phrases means mapping the object to a vector in space giving it a number or vector. Sci-kit learn provides several different kinds of vectors. We will look at three, the Tf-Idf vectorizer, the Count vectorizer and the Hashing vectorizer.

The first vectorizer we will look at is the tfidf vectorizer. Tf-Idf is the acronym for Term frequency — Inverse document frequency. Term frequency is the process of counting how often a term appears within each document you are going over. Inverse document frequency is designed to allow for more content heavy words not to be drowned out by the more common less content heavy words, like stop words. It gives less weight to words that appear more frequently than those that do not. Tf-Idf is the product of these two counts. There are several different ways to manually calculate these statistics. You can read more about that and view the mathematical formulas and proofs here and the Sklearn documentation.

Lets take a look at the tf-idf vectorizer with code. The first step is to import it from Sklearn.

After you have imported the vectorizer you must instantiate a vectorizer object and assign it to a variable. This object contains functions for customizing your vectorizers as well as other methods to manipulate the data. Some of these are very useful for the data cleaning process such as, dealing with stopwords, strip accents or casefolding.

Above is an example of the vectorizer. I’ve got three example sentences here, the originals on the left and the cleaned version on the right.

We clean data before using vectorizers to keep the number of vectors low so things run faster and more efficiently. In the three lines of code above the first line is calling the fit_transform method on the data. This will do all of the actual vectorizing. The second line creates a sparse matrix and the third organizes it into a more readable form. As you can see, each word in each column has three numbers. The columns represent what percentage of the sentence that word makes up for each sentence. The out put of this, the tf-idf frequency is what we would put into our model to start running it.

There are many different vectorizers and another is the count vectorizer provided by sklearn. The count vectorizer differs from the tf-idf vectorizer in a number of ways. The most important being how it comes to assign its vectors to numbers.

As you can see above, for each row, it gives either a one or a zero. The number one means it exists within the sentence and the number zero means it does not exist within the sentence. It is the count of how many times a word appears within each document. This is a simpler vectorization than the tf-idf vectorizer. You might also recognize it as one hot encoding.

The final vectorizer we’re going to look at is the hashing vectorizer.

The hashing vectorizer uses the hashing trick to create its sparse matrix. However it uses an excess of space and memory to put things into vectors and is not as efficient as the other two vectors in this set.

Vectorizers

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rachel Edwards

No responses yet