Text Classification: Predicting ‘Good’ or ‘Bad’ Statements using Natural Language Processing

Create and train an NLP model using spaCy library to predict and classify the input lines

Swati Rajwal
Towards Data Science

--

Picture by Ramdlon from Pixbay

This blog will cover a very fundamental method of predicting whether the given input statement should be classified as ‘Good’ or ‘Bad’. To do so we will first train a Natural Language Processing (NLP) model utilizing the past dataset. In this way, how about we begin!

Pre-requisite:

You should be aware of the BOW (Bag of Word) approach. You may check [1] out for more details. BOW approach essentially converts the text to numeric making it simpler for the NLP model to learn.

In this tutorial, Google Colab is used to run the script. You may choose any other platform of your choice. Also, the scripting language used is Python.

The Dataset

As this is a very introductory blog, I have written the dataset on my own with only 7 rows in total as shown in table 1. Once you get familiar with the essentials, I highly recommend that you select a bigger dataset and try to apply similar processing to acquire experiences. Kaggle is a great spot to track down countless datasets.

Table 1: Custom Dataset with simple statements labeled as either ‘Good’ or ‘Bad’

Python Script

You can find the entire codebase in this Github repository. In this blog, I will explain the important code snippets only. If still need assistant with other parts of the code, do comment and I will be happy to help.

We are going to use the spaCy package. It is a free open-source library for Natural Language Processing in Python. I highly recommend you to visit this package’s website and just have a quick glance as to what this offers.

After importing the dataset, we will create a blank model. We are using a text categorizer as provided by spaCy. Note that there are many versions of text categorizer. But we are going to use textcat as shown in the code below. textcatis being used because we want to predict only one true label which will be either Good or Bad.

nlp = spacy.blank("en")  #model is named as nlp

# text categroizer wit standard settings
textcat = nlp.create_pipe("textcat", config={
"exclusive_classes": True,
"architecture": "bow"}) #bow = bag of words

nlp.add_pipe(textcat) #add textcat to nlp

Training the Model

In order to train a model, you will need an optimizer and for that, the spaCy package comes to the rescue. An optimizer will keep on updating the model during the training phase using minibatch function. Notice the code below that does what we just talked about.

from spacy.util import minibatch
optimizer = nlp.begin_training() #create optmizer to be used by spacy to update the model

batches = minibatch(train_data, size=8) #spacy provides minibatch fn

for batch in batches:
texts, labels = zip(*batch)
nlp.update(texts, labels, sgd=optimizer)

Making Predictions

In the previous step, we trained the model with the input dataset. Once that is completed, you can use the trained model to make predictions on input statements or lines using predict() method as shown below:

# i mentioned all lines to be predicted in a 'texts' array Lines = ["I look awesome"] 
docs = [nlp.tokenizer(text)for text in Lines]
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)
#Prob score for both classes (Good/bad)
print(scores)

Also, do notice one thing. When the above code runs, the output will be something like:

[[0.50299996 0.49700007]]

The above output is the probability score for both the class labels. In the current scenario, the possible labels are- Good or Bad. According to the above output, the probability of the given input line being Good is more (0.50299996) and hence the model classifies the line as Good.

To make the predictions more direct, let’s print the class label for the given input line instead of probability scores.

predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

And you will see the output as below:

['Good']

Next Steps

As a matter of first importance, Congratulations! You just learned how to construct a text classifier using the spaCy library. There are absolutely numerous alternate approaches to do as such and I will come up with more instructional blogs later on. Until further notice, I would like to request my readers to take the learnings from this instructional exercise and apply it to some relatively greater datasets.

Also, you can ask me a question on Twitter and LinkedIn!

References

[1] Ismayil, M. (2021, February 10). From Words To Vectors — Towards Data Science. Medium. https://towardsdatascience.com/from-words-to-vectors-e24f0977193e

--

--