Natural Language Processing (NLP) is basically the combining of machine learning techniques with text and using math / statistics to get that text in a format that the machine learning algorithms can understand! π
NLP concerns itself with the interaction between natural human languages and computing devices. NLP is a major aspect of computational linguistics, and also falls within the realms of computer science and artificial intelligence.
As I got introduced to NLP, I came to the realization that the field is quite expanse and wide.
This will only highlight the basic concepts in this.
What new skills have you learned?
Natural Language Processing
One easily referable application of NLP is email classification since itβs relatable to many.
Using labeled these labeled ham(important mail) and spam examples, a machine learning model is trained to learn to discriminate between ham/spam automatically.
With this training, the model can then classify arbitrary unlabeled messages as ham or spam.
Some Components in NLP
Corpus
A collection of texts.
Feature engineering
Features to be used in training models. Domain knowledge on the data enhances your ability to engineer more features from it.
Normalization
Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on
TF-IDF
Term Frequency-Inverse Document Frequency, the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus
Stop words
Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, βthe,β βand,β and βa,β
Stemming
This is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.
Singing > Sing.
Lemmatization
lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a wordβs lemma.
Women > Woman
Tokenization
This is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.
Bag of words
Bag of words is a particular representation model used to simplify the contents of a selection of text. The bag of words model omits grammar and word order, but is interested in the number of occurrences of words within the text
Levenshtein
This is the number of characters that must be deleted, inserted, or substituted in order to make a pair of strings equal
Sentiment analysis
Described as the process of evaluating and determining the sentiment captured in a selection of text, with sentiment defined as feeling or emotion.
Here is a notebook with email classification as either ham / spam
So that was the ninth week.. π