WHAT IS NATURAL LANGUAGE PROCESSING? INTRO TO NLPPosted by amritabansal on December 12th, 2020 What is Natural Language Processing (NLP)?NLP is a part of machine learning that deals with understanding, analyzing, and generating the languages that humans use naturally or communicate in order to interface with computers instead of machine language. Applications of NLP
NLP PipelineThere are 3 stages of an NLP pipeline:
This process/pipeline isn’t always linear and may require additional steps. Read also:- Regression vs classification in Machine Learning Why do we need to process text?To make our raw input text free from any constructs that are not required for the task.
Text ProcessingData CleaningHere we remove special characters, HTML tags, etc. from the raw text as they do not contain any info for the model to learn and are irrelevant or noisy data. 1. import urllib.request2. url = 'https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2'3. req = urllib.request.urlopen(url)4. article = req.read().decode()5. with open('ISO_3166-1_alpha-2.html', 'w') as fo:6. fo.write(article)1. from bs4 import BeautifulSoup2. 3. # Load article, turn into soup and get the s.4. article = open('ISO_3166-1_alpha-2.html').read()5. soup = BeautifulSoup(article, 'html.parser')6. tables = soup.find_all('table', class_='sortable')7. 8. # Search through the tables for the one with the headings we want.9. for table in tables:10. ths = table.find_all('th')11. headings = [th.text.strip() for th in ths]12. if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']:13. break14. 15. # Extract the columns we want and write to a semicolon-delimited text file.16. with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:17. for tr in table.find_all('tr'):18. tds = tr.find_all('td')19. if not tds:20. continue21. code, country, year, ccTLD = [td.text.strip() for td in tds[:4]]22. # Wikipedia does something funny with country names containing23. # accented characters: extract the correct string form.24. if '!' in country:25. country = country[country.index('!')+1:]26. print('; '.join([code, country, year, ccTLD]), file=fo) Data NormalizationData normalization involves steps such as case normalization, punctuation removal, etc. so that the text is in a single format for the machine to learn. Case NormalizationCar, car, CAR -> they all mean the same thing. So, convert all capitalization to lower to bring to a common case. 1. # Convert text to lower case2. text = text.lower()3. print(text) Punctuation RemovalReplace punctuation with space. 1. import re2. 3. # Remove punctuation characters4. text = re.sub(r"[^a-zA-Z0-9]", " ", text)5. print(text) TokenizationTokenization is the process of breaking up text documents into individual words called tokens. 1. # Split text into tokens (words)2. # Splits based on whitespace3. words = text.split()4. print(words) So, far we have been using python inbuilt functions for this task. Tokenization using NLTK1. import nltk2. nltk.download('punkt') # required for tokenizer3. 4. from nltk.tokenize import word_tokenize, sent_tokenize5. 6. # Split text into words using NLTK7. text = "Mr. Gyansetu graduated from IIT-Delhi. He later started an analytics firm called Lux, which catered to enterprise customers."8. words = word.tokenize(text)9. print(words)10. 11. ## Output12. ## ['Mr.', 'Gyansetu', 'graduated', 'from', 'IIT-Delhi', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']13. 14. 15. # Split text into sentences16. text = "Mr. Gyansetu graduated from IIT-Delhi. He later started an analytics firm called Lux, which catered to enterprise customers."17. sentences = sent_tokenize(text)18. print(sentences)19. 20. ## Output21. ## ['Mr. Gyansetu graduated from IIT-Delhi.', 'He later started an analytics firm called Lux, which catered to enterprise customers.'] Stop Word RemovalStop word removal means removal of non-important words like ‘a’, ‘is’, ‘the’, ‘and’, ‘an’, ‘are’, “me”, “i”, etc. There is an in-built stopword list in NLTK which we can use to remove stop words from text documents. However this is not the standard stopwords list for every problem, we can also define our own set of stop words based on the domain. 1. import nltk 2. nltk.download('stopwords') #corpora of stopwords (required for stopwords) 3. # List of stop words 4. from nltk.corpus import stopwords 5. print(stopwords.words("english")) 6. 7. ## Output 8. ## ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 9. 10. # Remove stop words 11. words = [w for w in words if w not in stopwords.words("english")] 12. print(words) 13. 14. ## Output 15. ## ['Mr.', 'Gyansetu', 'graduated', 'IIT-Delhi', '.', 'He', 'later', 'started', 'analytics', 'firm', 'called', 'Lux', ',', 'catered', 'enterprise', 'customers', '.'] Parts of Speech (POS) TaggingGiven a sentence, determine POS tags for each word (e.g., NOUN, VERB, ADV, ADJ). You can use an inbuilt part of speech tagger provided in NLTK. There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs). 1. import nltk 2. nltk.download('averaged_perceptron_tagger') 3. 4. from nltk import pos_tag 5. 6. # Tag parts of speech (PoS) 7. sentence = word_tokenize("I always lie down to tell a lie.") 8. pos_tag(sentence) 9. 10. ## Output 11. """ 12. [('I', 'PRP'), 13. ('always', 'RB'), 14. ('lie', 'VBP'), 15. ('down', 'RP'), 16. ('to', 'TO'), 17. ('tell', 'VB'), 18. ('a', 'DT'), 19. ('lie', 'NN'), 20. ('.', '.')] 21. """ Named Entity RecognitionIn information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. 1. import nltk 2. nltk.download('maxent_ne_chunker') 3. nltk.download('words') 4. 5. from nltk import pos_tag, ne_chunk 6. from nltk.tokenize import word_tokenize 7. 8. # Recognize named entities in a tagged sentence 9. out = ne_chunk(pos_tag(word_tokenize("Shalki joined Gyansetu Ltd. in Gurgaon."))) 10. print(out.__repr__()) 11. 12. ## Output 13. ## Tree('S', [Tree('PERSON', [('Shalki', 'NNP')]), ('joined', 'VBD'), Tree('PERSON', [('Gyansetu', 'NNP')]), ('Ltd.', 'NNP'), ('in', 'IN'), Tree('GPE', [('Gurgaon', 'NNP')]), ('.', '.')]) StemmingStemming is a process of reducing a word to its root form. branching branched, branches can all be reduced to branch. Caching, caches, cached can all be reduced to the cache. 1. from nltk.stem.porter import PorterStemmer 2. 3. # Reduce words to their stemmer 4. stemmed = [PorterStemmer().stem(w) for w in words] 5. print(stemmed) 6. 7. ## Output 8. ## ['mr.', 'gyansetu', 'graduat', 'iit-delhi', '.', 'He', 'later', 'start', 'analyt', 'firm', 'call', 'lux', ',', 'cater', 'enterpris', 'custom', '.'] Feature ExtractionFeature Extraction is a way of extracting feature vectors from the text after the text processing step so that it can be used in the machine learning model as input. This extracted feature from the text can be wordnet of a graph of nodes, a vector representing words (doc2vec, word2vec, sent2vec, glove, etc.) Word Embedding is one such technique where we can represent the text using vectors. The more popular forms of word embeddings are:
Bag-of-Words ModelBag of words model, or BoW for short, is a way of extracting features from the text for use in modeling, such as machine learning algorithms. It treats each document as a collection/bag of words. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
The intuition is that documents are similar if they have similar content. Further, from the content alone we can learn something about the meaning of the document. Example:-
ModelingThe final stage of the NLP pipeline is modeling, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data. The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them. Some of the used here are:-
Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless! Read Full Blog here:- WHAT IS NATURAL LANGUAGE PROCESSING? INTRO TO NLP Like it? Share it!More by this author |