Build a Text Classification Program: An NLP Tutorial
Deep learning has proven its power across many domains, from beating humans at complex board games to synthesizing music. It has also been used extensively in natural language processing.
In this article, Toptal Freelance Software Engineer Shanglun (Sean) Wang shows how easy it is to build a text classification program using different techniques and how well they perform against each other.
Deep learning has proven its power across many domains, from beating humans at complex board games to synthesizing music. It has also been used extensively in natural language processing.
In this article, Toptal Freelance Software Engineer Shanglun (Sean) Wang shows how easy it is to build a text classification program using different techniques and how well they perform against each other.
Sean is a passionate polyglot: A full-stack wizard, sys admin, and data scientist. He’s also developed market intelligence software.
Expertise
PREVIOUSLY AT
Deep learning is a technology that has become an essential part of machine learning workflows. Capitalizing on improvements of parallel computing power and supporting tools, complex and deep neural networks that were once impractical are now becoming viable.
The emergence of powerful and accessible libraries such as Tensorflow, Torch, and Deeplearning4j has also opened development to users beyond academia and research departments of large technology companies. In a testament to its growing ubiquity, companies like Huawei and Apple are now including dedicated, deep learning-optimized processors in their newest devices to power deep learning applications.
Deep learning has proven its power across many domains. Most notably, Google’s AlphaGo was able to defeat human players in a game of Go, a game whose mind-boggling complexity was once deemed a near-insurmountable barrier to computers in its competition against human players. Flow Machines project by Sony has developed a neural network that can compose music in the style of famous musicians of the past. FaceID, a security feature developed by Apple, uses deep learning to recognize the face of the user and to track changes to the user’s face over time.
In this article, we will apply deep learning to two of my favorite topics: natural language processing and wine. We will build a model to understand natural-language wine reviews by experts and deduce the variety of the wine they’re reviewing.
Deep Learning for NLP
Deep learning has been used extensively in natural language processing (NLP) because it is well suited for learning the complex underlying structure of a sentence and semantic proximity of various words. For example, the current state of the art for sentiment analysis uses deep learning in order to capture hard-to-model linguistic concepts such as negations and mixed sentiments.
Deep learning has several advantages over other algorithms for NLP:
- Flexible models: Deep learning models are much more flexible than other ML models. We can easily experiment with different structures, adding and removing layers as needed. Deep learning models also allow for building models with flexible outputs. The flexibility is key to developing models that are well suited for understanding complex linguistic structures. It is also essential for developing NLP applications such as translations, chatbots, and text-to-speech applications.
- Less domain knowledge required: While one certainly needs some domain knowledge and intuition to develop a good deep learning model, deep learning algorithms’ ability to learn feature hierarchies on its own means that a developer doesn’t need as much in-depth knowledge of the problem space to develop deep learning NLP algorithms. For a problem space as complex as natural language, this is a very welcome advantage.
- Easier ongoing learning: Deep learning algorithms are easy to train as new data comes in. Some machine learning algorithms require the entire dataset to be sent through the model to update, which would present a problem for live, large datasets.
The Problem Today
Today, we will build a deep learning algorithm to determine the variety of the wine being reviewed based on the review text. We will be using the wine magazine dataset at https://www.kaggle.com/zynicide/wine-reviews which is provided by Kaggle user zackthoutt.
Conceptually, the question is, can we take a wine review like…
Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn’t overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.
…and recognize that it is about a white blend? Some wine enthusiasts might recognize telltale signs of white wines such as apple, citrus, and a pronounced acidity, but can we train our neural network to recognize these signals? Additionally, can we train our neural network to recognize the subtle differences between a white blend review and a pinot grigio review?
Similar Algorithms
The problem we’re working with today is essentially an NLP classification problem. There are several NLP classification algorithms that have been applied to various problems in NLP. For example, naive Bayes have been used in various spam detection algorithms, and support vector machines (SVM) have been used to classify texts such as progress notes at healthcare institutions. It would be interesting to implement a simple version of these algorithms to serve as a baseline for our deep learning model.
Naive Bayes
A popular implementation of naive Bayes for NLP involves preprocessing the text using TF-IDF and then running the multinomial naive Bayes on the preprocessed outputs. This allows the algorithm to be run on the most prominent words within a document. We can implement the naive Bayes as follows:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
df = pd.read_csv('data/wine_data.csv')
counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df[df['variety'].map(lambda x: x in top_10_varieties)]
description_list = df['description'].tolist()
varietal_list = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = np.array(varietal_list)
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(description_list)
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
train_x, test_x, train_y, test_y = train_test_split(x_train_tfidf, varietal_list, test_size=0.3)
clf = MultinomialNB().fit(train_x, train_y)
y_score = clf.predict(test_x)
n_right = 0
for i in range(len(y_score)):
if y_score[i] == test_y[i]:
n_right += 1
print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))
Run the above code and you should see something like the following: 73.56%
Considering we’re looking at 10 classes, this is quite a good result.
We can also use support vector machine and see how it would do. To see how it performs, simply replace the classifier definition with
clf = SVC(kernel='linear').fit(train_x, train_y)
Run this and you should see the following output:
Accuracy: 80.66%
Not too shabby either.
Let’s see if we can build a deep learning model that can surpass or at least match these results. If we manage that, it would be a great indication that our deep learning model is effective in at least replicating the results of the popular machine learning models informed by domain expertise.
Building the Model
Today, we will be using Keras with Tensorflow to build our model. Keras is a Python library that makes building deep learning models very easy compared to the relatively low-level interface of the Tensorflow API. In addition to the dense layers, we will also use embedding and convolutional layers to learn the underlying semantic information of the words and potential structural patterns within the data.
Data Cleaning
First, we will have to restructure the data in a way that can be easily processed and understood by our neural network. We can do this by replacing the words with uniquely identifying numbers. Combined with an embedding vector, we are able to represent the words in a manner that is both flexible and semantically sensitive.
In practice, we will want to be a little smarter about this preprocessing. It would make sense to focus on the commonly used words, and to also filter out the most commonly used words (e.g., the, this, a).
We can implement this functionality using Defaultdict and NLTK. Write the following code into a separate Python module. I placed it in lib/get_top_x_words.py
.
from nltk import word_tokenize
from collections import defaultdict
def count_top_x_words(corpus, top_x, skip_top_n):
count = defaultdict(lambda: 0)
for c in corpus:
for w in word_tokenize(c):
count[w] += 1
count_tuples = sorted([(w, c) for w, c in count.items()], key=lambda x: x[1], reverse=True)
return [i[0] for i in count_tuples[skip_top_n: skip_top_n + top_x]]
def replace_top_x_words_with_vectors(corpus, top_x):
topx_dict = {top_x[i]: i for i in range(len(top_x))}
return [
[topx_dict[w] for w in word_tokenize(s) if w in topx_dict]
for s in corpus
], topx_dict
def filter_to_top_x(corpus, n_top, skip_n_top=0):
top_x = count_top_x_words(corpus, n_top, skip_n_top)
return replace_top_x_words_with_vectors(corpus, top_x)
Now we’re ready to build the model. We want an embedding layer, a convolutional layer, and a dense layer to take advantage of all of the deep learning features that can be helpful for our application. With Keras, we can build the model very simply:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import to_categorical
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from lib.get_top_xwords import filter_to_top_x
df = pd.read_csv('data/wine_data.csv')
counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df[df['variety'].map(lambda x: x in top_10_varieties)]
description_list = df['description'].tolist()
mapped_list, word_list = filter_to_top_x(description_list, 2500, 10)
varietal_list_o = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = to_categorical(varietal_list_o)
max_review_length = 150
mapped_list = sequence.pad_sequences(mapped_list, maxlen=max_review_length)
train_x, test_x, train_y, test_y = train_test_split(mapped_list, varietal_list, test_size=0.3)
max_review_length = 150
embedding_vector_length = 64
model = Sequential()
model.add(Embedding(2500, embedding_vector_length, input_length=max_review_length))
model.add(Conv1D(50, 5))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(max(varietal_list_o) + 1, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=3, batch_size=64)
y_score = model.predict(test_x)
y_score = [[1 if i == max(sc) else 0 for i in sc] for sc in y_score]
n_right = 0
for i in range(len(y_score)):
if all(y_score[i][j] == test_y[i][j] for j in range(len(y_score[i]))):
n_right += 1
print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))
Run the code and you should see the following output.
Accuracy: 77.20%
Recall that the accuracy for naive Bayes and SVC were 73.56% and 80.66% respectively. So our neural network is very much holding its own against some of the more common text classification methods out there.
Conclusion
Today, we covered building a classification deep learning model to analyze wine reviews.
We found that we were able to build a model that was able to compete with and outperform some of the other machine learning algorithms. We hope that you are inspired to use this information to build applications that analyze more complex datasets and generate more complex outputs!
Note: You can find the code I used for this article on GitHub.
Further Reading on the Toptal Blog:
Understanding the basics
What is natural language processing?
Natural language processing is the range of computational techniques used to analyze or produce human language and speech.
Shanglun Wang
New York, NY, United States
Member since December 16, 2016
About the author
Sean is a passionate polyglot: A full-stack wizard, sys admin, and data scientist. He’s also developed market intelligence software.
Expertise
PREVIOUSLY AT