week nine - Natural Language Processing.

Natural Language Processing (NLP) is basically the combining of machine learning techniques with text and using math / statistics to get that text in a format that the machine learning algorithms can understand! 📖

NLP concerns itself with the interaction between natural human languages and computing devices. NLP is a major aspect of computational linguistics, and also falls within the realms of computer science and artificial intelligence.

As I got introduced to NLP, I came to the realization that the field is quite expanse and wide.

This will only highlight the basic concepts in this.

What new skills have you learned?

Natural Language Processing

One easily referable application of NLP is email classification since it’s relatable to many.

Using labeled these labeled ham(important mail) and spam examples, a machine learning model is trained to learn to discriminate between ham/spam automatically.

With this training, the model can then classify arbitrary unlabeled messages as ham or spam.

Some Components in NLP

Corpus

A collection of texts.

Feature engineering

Features to be used in training models. Domain knowledge on the data enhances your ability to engineer more features from it.

Normalization

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on

TF-IDF

Term Frequency-Inverse Document Frequency, the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

Stop words

Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, “the,” “and,” and “a,”

Stemming

This is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

Singing  > Sing.

Lemmatization

lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word’s lemma.

Women   >  Woman

Tokenization

This is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.

Bag of words

Bag of words is a particular representation model used to simplify the contents of a selection of text. The bag of words model omits grammar and word order, but is interested in the number of occurrences of words within the text

Levenshtein

This is the number of characters that must be deleted, inserted, or substituted in order to make a pair of strings equal

Sentiment analysis

Described as the process of evaluating and determining the sentiment captured in a selection of text, with sentiment defined as feeling or emotion.

Here is a notebook with email classification as either ham / spam

So that was the ninth week.. 🔏