Natural Language Tool Kit: NLTK
The Natural Language Tool Kit, NLTK, is a powerful package of tools and resources for computational linguistics and natural language processing. It contains over 50 corpora, lexical texts, and tools for tokenization, stemming, parsing and semantic reasoning. The toolkit was created by Steven Bird, a professor at University of Pennsylvania and Edward Loper, a student at the university, in 2001. Since then they have been adding content and supporting the toolkit and hope to do so for the foreseeable future. NLTK is designed to deal with natural language processing. Natural language processing (NLP) is defined as “a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.” Natural language is spoken language that was naturally developed over time by humans as a means to communicate. We also have constructed languages, such as Dothraki, Klingon and Esperanto and formal languages which are used as the underpinning grammar of programming languages. Listed below are some of the modules NLT has and the kinds of things they allow you to do.
A big branch of NLP is based in corpus linguistics. A corpus is a collection of real world language data. A corpus in the context of NLP can be a machine readable document such as an essay or a book but it can also be transcriptions of real world utterances collected for analysis. NLTK provides several corpora to work with. You can find written works like “Emma” or “Moby Dick”, but also reviews, movie scripts and government documents. All of these corpora can be used to train models on for different areas of NLP. You might use reviews to do semantic analysis of positive or negative reviews, Shakespeare to create a sonnet producing bot, or a combination of corpora for a spell checker. Some other useful tools that NLTK has are stemmers and lemmatizers. Tokenizing is the task of breaking up an utterance into tokens, which is some kind of smaller unit, words, phrases or chunks. Natural language works on several levels, breaking down phrases into smaller units allows us to retrieve more of that meaning. When we break things down sometimes we will use stemming or lemmatization. Both of these methods involve breaking down individual words to get at their stem, or main morphological meaning by parring off its affixes. For example, the words ‘barked’ and ‘barking’ have the inflectional suffixes ‘ed’ and ‘ing’ but have the same root word ‘bark’. Another big function in NLP is part of speech tagging. Each word has a particular part of speech given to it. This part of speech can be identified by the suffix, which you might use a stemmer to find, or by its place in the syntax of a sentence. A part of speech tagger tags each word with its part of speech, noun, verb, adjective etc. The rules of syntax can be easily predicted by which part of speech precedes the next word. Tagging a word in the context of an utterance and not as a standalone is important because it can help clear up some of the ambiguity created by natural language. For example take the word ‘barks’. This word could be either an adjective or a noun. The suffix of the word doesn’t offer us any clues but a part of speech tagger will be able to differentiate this word based on the words that preceded it. [(The:det) (dog:noun) (barks:verb)] or [(The:det) (collection:noun) (of:prep) (tree:adj) (barks:noun)]. These are just casual examples of parsed data as the aactual parsers contains 36+ part of speech tags.
We can use things, like how many tokens, types, words with a certain part of speech, and root words to create frequency distributions, calculate lexical diversity and a handful of other simple statistics.
Tokens are the number of objects, words, punctuation, emojis in a text, types are the unique tokens, the set of tokens in a text and the lexical diversity is the measure of the different types of words used in a text. These are fairly simple calculations but NLTK goes into more complicated material such as smoothing techniques for machine learning algorithms, building you own context free grammars, and using Hidden Markov Models. NLTK isn’t meant just for natural language processing but also as a user guide for new python users. It teaches about data structures, recursion, string transformations and operators and more. The tool is designed to be user friendly allow even beginners to be able to access the information provided. A book written by Steven Bird, Ewan Klein and Edward Loper is available for purchase in print or available online for free. The book version provides extra insight as well as setting up tasks and tutorials to help people work their way through the text. The NLTK does have a few competitors. Stanford CoreNLP performs many of the same tasks as NLTK but is written for Java users. “CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 6 languages: Arabic, Chinese, English, French, German, and Spanish.” It was created by the Stanford NLP group and focuses more on the pipeline. Another well known NLP source is Amazon Comprehend. It is more business oriented and structured mostly for machine learning and unlike NLTK, can be purchased through AWS.
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
“Natural Language Processing.” Wikipedia, Wikimedia Foundation, 9 Feb. 2021, en.wikipedia.org/wiki/Natural_language_processing.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. [pdf] [bib]