Naive Bayes classification of IMDB reviews

Text Classification (Sentiment Analysis) Using Naive Bayes Rule

An exploratory project written for IMDB reviews like Kaggle’s “Bag of Words Meets Bag of Popcorn” competition

Objective:

As an introductory project, I wanted to manually write a Naive Bayes classifier and explore different model variations to see how they perform. As such, I wrote this code using minimalist libraries (like numpy) and then directly testing and visualizing results. As well, I repeatedly and iteratively refined this notebook to make it as efficient and modular as I could in the couple of days I had.

Goal:

We need a program to take full-text reviews and classify them as either positive or negative reviews of that movie, business, etc. I will design and then train a model to pull meaningful words from the reviews and use them to determine the polarity (positive or negative) of the review. For training, I have data that include the correct polarity, which is based on the star rating that they submitted with their text review.

Methods:

As an introductory project, I will create a Naive Bayes Multinomial Classifier using the bag of words model. I will not, however, use sklearn or other such libraries that do the Bayes classification for me. Instead, I’ll use Numpy and basic accessory libraries like NTLK to try stemming to improve results.

Desired comparisons:

  1. Unigrams, Bigrams, and/or Trigrams vs. model performance
  2. Stemming and/or Lemmatizing vs. model performance
  3. Model accuracy, precision, recall, and F-score

Evaluation:

Once complete, I will run it on the full dataset from the Kaggle competition to test the methods on a large dataset and to see how well it performs against the existing leaderboard. (NOTE: the competition has ended so the leaderboard is frozen and will not update with new submissions.)

Setup: installations and imports

In [1]:
#-------------------------------------------------------------------------------------
# One-time runs - do not uncomment unless needed to install on new device
#-------------------------------------------------------------------------------------

#!pip install stop-words #Used to compare to the stopwords supplied with the IMDB data

#import nltk
#nltk.download()

#!pip install kaggle
In [2]:
%load_ext autoreload
%autoreload 2

import tools as t     #tools.py file included with this data
import numpy as np
import string    #using punctuation
import pandas as pd
from math import log
import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

stopwords=set(t.read_txt_file('./data/english.stop'))

print("complete")
complete

Main functions and classes

Process for each IMDB review:

  1. Parse text, remove punctuation, and tokenize to unigram words
  2. Save unigram words to global variable for that dataset
  3. Generate new unigram matricies with each stemming and lemmatizing combination desired (4 combinations)
  4. Generate bigrams and trigrams and combinations thereof from each unigram matrix created in step 3 (5 combinations)
  5. Train for each combination, saving the self.prob for later use if model selected (4×5 = 20 combinations)
  6. Test for each combination and calculate accuracy, precision, recall, and F-score for each
  7. Plot each comparison as desired to evaluate results
  8. Choose best model for final use

Key Data:

Global variables:

X: (matrix) bag of words (uni, bi, and/or tri) for each movie review, one review per row
Y: (vector) known review classifications (only available for training/test data)
Combo: (dict) NaiveBayes objects created for each combination of stemming/lemmatizing and uni/bi/trigrams for model testing
acc_small: (dict) model accuracy values from testing
acc_final: (dict) model accuracy, precision, recall, and F-score values from testing

NaiveBayes class variables:

self.classes: (list) all classes (positive/negative, 1/0, etc) for the data in Y
self.prior: (dict) for this dataset, the probability of each class being the result
self.prob: (dict) for this dataset, the probability of each word appearing in each class

Functions:

  1. files_to_strings – pass in a list of all files to be read, return list of strings containing text of review
  2. parse_file – pass in a filename, return the string of raw text in it (unprocessed)
  3. parse_string – pass in a string, return a bag of words (unigrams)
  4. stem_lemm – pass in a bag of unigram words, return a bag of unigram words processed based on desired combo
  5. gen_bigrams – pass in a bag of unigram words, return a bag of bigram words
  6. gen_trigrams – pass in a bag of unigram words, return a bag of trigram words
In [3]:
def files_to_strings(filenames):
    """ Read an array of files where each file content is read in a string

        Input:
        -------
        X an array of file names

        Returns:
        --------
        A list with each row containing a read string from the file"""

    return ([ parse_file(file) for file in filenames ])
In [4]:
def parse_file(filename): # Parse a given file
    """ Parse the input file:

        Parameters:
        ----------
        filename: name of PLAIN TEXT file to be read (i.e., not CSV or TSV)

        Returns:
        ---------
        read file as raw string (with \n, \t, \r, etc included)"""
    
    with open(filename[0]) as file: return file.read()
In [5]:
def parse_string(string1): # Parse a read string into respective tokens
    """ Parse the input string and tokenize it into words:

        Parameters:
        ----------
        string: string to be parsed

        Returns:
        ---------
        list of tokens (unigrams) extracted from the string"""
    
    #Remove any punctuation
    for c in string.punctuation:
                string1 = string1.replace(c,"")

    #Remove any stopwords and return result
    return [((a)) for a in string1.lower().split() if a[0] in 'abcdefghijklmnopqrstuvwxyz' and a not in stopwords]
In [6]:
def stem_lemm(bow, mode) : #bow = bag of words
    """ Apply the desired stemming or lemmatizing methods to the bag of unigram words:

        Parameters:
        ----------
        bow: bag of unigram words
        mode: type of stemming or lemmatizing method

        Returns:
        ---------
        list with processed bag of unigram words"""
    
    #Create instance of stemmer and lemmatizer
    ls = LancasterStemmer()
    lz = WordNetLemmatizer()
    
    #mode selects whether stemmer, lemmatizer, or both are applied
    if mode == 'None' :
        return bow

    elif mode == 'Stem' :
        return [ls.stem((word)) for word in bow]

    elif mode == 'Lemm' :
        return [lz.lemmatize(word) for word in bow]
    
    elif mode == 'Lemm-Stem' :
        return [ls.stem(lz.lemmatize(word)) for word in bow]


def gen_bigrams(bow) :
    """ Create bag of bigram words from unigrams:

        Parameters:
        ----------
        bow: bag of unigram words

        Returns:
        ---------
        list with bag of bigram words"""

    return [' '.join([bow[a],bow[a+1]]) for a in range(len(bow)-1)]


def gen_trigrams(bow) :
    """ Create bag of trigram words from unigrams:

        Parameters:
        ----------
        bow: bag of unigram words

        Returns:
        ---------
        list with bag of trigram words"""

    return [' '.join([bow[a],bow[a+1],bow[a+2]]) for a in range(len(bow)-2)]
In [7]:
def eval_model(pred,ans):
    """ calculate the model performance stats for this model:

        Parameters:
        ----------
        pred: list of class predictions from the model
        ans: actual classes

        Returns:
        ---------
        Model performance stats"""
    
    pred = [1 if a == classes[1] else 0 for a in pred]
    ans = [1 if a == classes[1] else 0 for a in ans]
    pred = np.array(pred,dtype=bool)
    ans = np.array(ans,dtype=bool)
    TP = np.sum((pred & ans))           #Predicted positive, was positive
    FN = np.sum((np.invert(pred) & ans))       #Predicted negative, was positive
    FP = np.sum((pred & np.invert(ans)))       #Predicted positive, was negative 
    TN = np.sum((np.invert(pred) & np.invert(ans)))   #Predicted negative, was negative
    Accuracy = 100*(TP+TN)/len(pred)
    Precision = 100*TP/(TP+FP)
    Recall = 100*TP/(TP+FN)
    Fscore = (2*Precision*Recall)/(Precision + Recall)
    return Accuracy, Precision, Recall, Fscore
In [8]:
class NaiveBayes:
    ''' Implements the Naive Bayes For Text Classification '''
    def __init__(self, classes):
        self.classes=classes
        self.prior = [] #dictionary of prior classes and their respective probability
        self.prob = [] #dictionary of probabilities for each class as prob[pos/neg][word] = probability 
        
    def train(self, X, Y):
        ''' Train the multiclass (or Binary) Bayes Rule using the given 
            X [m x d] data matrix and Y labels matrix

            Input:
            ------
            X: [m x d] a data matrix of m d-dimensional examples.
            Y: [m x 1] a label vector.

            Returns:
            -----------
            Nothing'''
        #Set classes:
        neg = self.classes[0]
        pos = self.classes[1]
        
        #Boolean arrays for positive and negative:
        pos_bool = Y==pos
        neg_bool = Y==neg
        
        #Prior probability:
        prob_pos = sum(pos_bool) / len(Y)
        prob_neg = sum(neg_bool) / len(Y)
        
        #Word probabilities:
        #Generate lists of all words in each class (positive class and negative class)
        X_pos = []
        X_neg = []
        for idx in range(len(X[:])): 
            if Y[idx]==pos : X_pos.extend(X[idx])
            elif Y[idx]==neg : X_neg.extend(X[idx])
            else : print("NON-CLASS ENTRY")

        #V = Vocabulary = all non-stop words in all IMDB reviews (positive or negative)
        V = sorted(set(X_pos+X_neg)) #sorted isn't necessary, but it helps for viewing AND converts to a list 
        V_count = len(V)

        #Get the words and counts for each non-stop word in positive and then negative reviews
        #NOTE: adding V into the list means EVERY word's count (+ 1) appears in both class lists 
        X_pos_unique, X_pos_counts = np.unique(V+X_pos, return_counts=True)
        X_neg_unique, X_neg_counts = np.unique(V+X_neg, return_counts=True)

        #Get the total number of considered words (i.e., not stop words) in each class
        X_pos_all_count = sum(X_pos_counts)
        X_neg_all_count = sum(X_neg_counts)

        #P(word|class) = [count of words in pos+1] / [(count of all words in pos) + (count of all words in V)]
        #NOTE: the +1 AND the +V was taken care of by "V+..." in the .unique step above
        #NOTE: ln (natural log) of probabilities taken to avoid float underrun (underflow) when multiplying products
        word_given_pos_prob = dict( zip(X_pos_unique, np.log(X_pos_counts/X_pos_all_count)) )
        word_given_neg_prob = dict( zip(X_neg_unique, np.log(X_neg_counts/X_neg_all_count)) )
        
        #Save results to class variables
        self.prior = {'pos':prob_pos, 'neg':prob_neg}
        self.prob = {'pos':word_given_pos_prob, 'neg':word_given_neg_prob}
        
        return None 
        
        
    def test(self, X):
        ''' Test the trained classifiers on the given set of examples 

            Input:
            ------
            X: [m x d] a data matrix of m d-dimensional test examples.

            Returns:
            -----------
            pclass: the predicted class for each example, i.e. to which it belongs'''
        
        neg = self.classes[0]
        pos = self.classes[1]
        
        pclass = []
        for idx in range(len(X[:])) :
            prob_review_pos = log(self.prior['pos']) + sum([self.prob['pos'].get(word,1) for word in X[idx]])
            prob_review_neg = log(self.prior['neg']) + sum([self.prob['neg'].get(word,1) for word in X[idx]])
            if prob_review_pos > prob_review_neg : pclass.extend([pos])
            else : pclass.extend([neg])

        return pclass

Load small-dataset training and testing data:

In [9]:
#load data, get list of files for each class...
tdir='./data/imdb1/' # training dir...
posfiles=t.get_files(tdir+'/pos','*',withpath=True)
negfiles=t.get_files(tdir+'/neg','*',withpath=True)
In [10]:
#generate training and testing data...
labels = np.concatenate((['pos']*len(posfiles),['neg']*len(negfiles))) # concatenate the +ve and -ve labels
tX=np.concatenate((posfiles,negfiles)).reshape((len(posfiles)+len(negfiles),1))
print("Training data Dimensions =", tX.shape," Training labels dimensions=", labels.shape)
Training data Dimensions = (2000, 1)  Training labels dimensions= (2000,)
In [11]:
#It would be more efficient to have performed the parse_string within the files_to_strings function. 
#However, since the Kaggle data that is used later doesn't use the file_to_strings function the 
#parse_string could not be embedded into it

X = files_to_strings(tX) # read files and convert each file into a string, saved in each row of a list
X = [parse_string(review) for review in X] #parse each review to generate unigrams

Comparative test runs with small dataset:

In [12]:
#Generate one NaiveBayes object at a time and save into COMBO
#Each object will train and test a model that uses a mode and gram combination as input to the Bayes process

nfolds=5

classes=np.unique(labels)
combo = {} #store objects created in this dictionary
acc_small = {} #store accuracy of each combo in this dictionary
modes = ['None','Stem','Lemm','Lemm-Stem']
gram_combos = ['Uni','Bi','Tri','Uni-Bi','Uni-Bi-Tri']

# Loop through to test each mode and create the x-gram combos to run
for mode in modes :
    unigrams = [stem_lemm(review,mode) for review in X]
    bigrams = []
    trigrams = []
    uni_bi = []
    uni_bi_tri = []
    
    for unigram in unigrams :
        bigram = gen_bigrams(unigram)
        trigram = gen_trigrams(unigram)
        uni_bi = uni_bi + [unigram + bigram]
        uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
        bigrams = bigrams + [bigram]
        trigrams = trigrams + [trigram]

    for processed_reviews, gram_combo in zip([unigrams, bigrams, trigrams, uni_bi, uni_bi_tri],gram_combos) :
        folds=t.generate_folds(np.array(processed_reviews),labels,nfolds) # generate folds
        totacc = []
        totpre = []
        totrec = []
        totfsc = []

        for k in range(nfolds):
            traindata,trainlabels,testdata,testlabels=folds[k][0:4]
            combo[(mode,gram_combo)] = NaiveBayes(classes)
            combo[(mode,gram_combo)].train(traindata,trainlabels)
            pclasses = combo[(mode,gram_combo)].test(testdata)
            acc,pre,rec,fsc = eval_model(pclasses,testlabels)
            totacc.append(acc)
            totpre.append(pre)
            totrec.append(rec)
            totfsc.append(fsc)
        print('Model: Stem-Lemm = {}, Grams = {}'.format(mode, gram_combo))
        print('Accuracy: min = {}, mean = {}, max = {}'.format(np.min(totacc),np.mean(totacc),np.max(totacc)))
        print('Precision: min = {}, mean = {}, max = {}'.format(np.min(totpre),np.mean(totpre),np.max(totpre)))
        print('Recall: min = {}, mean = {}, max = {}'.format(np.min(totrec),np.mean(totrec),np.max(totrec)))
        print('Fscore: min = {}, mean = {}, max = {}'.format(np.min(totfsc),np.mean(totfsc),np.max(totfsc)))
        acc_small[(mode,gram_combo)]=(np.mean(totacc),np.mean(totpre),np.mean(totrec),np.mean(totfsc))
    print('') #space between each series
        
print('All models tested!')
Model: Stem-Lemm = None, Grams = Uni
Accuracy: min = 79.25, mean = 80.5, max = 81.75
Precision: min = 78.64077669902913, mean = 81.45331431708931, max = 83.78378378378379
Recall: min = 77.5, mean = 79.1, max = 81.0
Fscore: min = 79.19799498746868, mean = 80.23145196329503, max = 81.32992327365729
Model: Stem-Lemm = None, Grams = Bi
Accuracy: min = 74.5, mean = 77.05, max = 79.0
Precision: min = 74.5, mean = 78.61103010088759, max = 82.95454545454545
Recall: min = 73.0, mean = 74.5, max = 76.5
Fscore: min = 74.5, mean = 76.46483071606688, max = 77.66497461928934
Model: Stem-Lemm = None, Grams = Tri
Accuracy: min = 63.0, mean = 66.55, max = 71.25
Precision: min = 64.94252873563218, mean = 68.27543716938862, max = 72.72727272727273
Recall: min = 56.0, mean = 61.6, max = 68.0
Fscore: min = 60.42780748663101, mean = 64.73588381079742, max = 70.28423772609818
Model: Stem-Lemm = None, Grams = Uni-Bi
Accuracy: min = 77.5, mean = 80.0, max = 82.5
Precision: min = 74.33628318584071, mean = 77.12303580869009, max = 80.67632850241546
Recall: min = 80.0, mean = 85.4, max = 90.5
Fscore: min = 78.04878048780488, mean = 80.9998576707774, max = 83.7962962962963
Model: Stem-Lemm = None, Grams = Uni-Bi-Tri
Accuracy: min = 78.75, mean = 80.2, max = 82.25
Precision: min = 73.98373983739837, mean = 76.65440481213462, max = 79.18552036199095
Recall: min = 84.0, mean = 87.0, max = 91.0
Fscore: min = 79.80997624703086, mean = 81.46252745259424, max = 83.13539192399051

Model: Stem-Lemm = Stem, Grams = Uni
Accuracy: min = 77.5, mean = 78.95, max = 80.5
Precision: min = 78.1725888324873, mean = 79.45961653154066, max = 81.12244897959184
Recall: min = 76.0, mean = 78.1, max = 81.5
Fscore: min = 77.15736040609139, mean = 78.75987193059441, max = 80.3030303030303
Model: Stem-Lemm = Stem, Grams = Bi
Accuracy: min = 73.5, mean = 77.4, max = 79.0
Precision: min = 74.22680412371135, mean = 77.76671633928818, max = 82.58426966292134
Recall: min = 72.0, mean = 76.9, max = 81.5
Fscore: min = 73.0964467005076, mean = 77.26436844046518, max = 79.51219512195122
Model: Stem-Lemm = Stem, Grams = Tri
Accuracy: min = 63.25, mean = 66.95, max = 69.75
Precision: min = 66.25766871165644, mean = 68.64786373350015, max = 71.8232044198895
Recall: min = 54.0, mean = 62.3, max = 65.0
Fscore: min = 59.50413223140496, mean = 65.27385297436936, max = 68.24146981627295
Model: Stem-Lemm = Stem, Grams = Uni-Bi
Accuracy: min = 76.5, mean = 78.7, max = 82.25
Precision: min = 71.90082644628099, mean = 75.12597569507122, max = 78.9237668161435
Recall: min = 83.0, mean = 86.0, max = 88.0
Fscore: min = 78.73303167420815, mean = 80.164725733844, max = 83.21513002364067
Model: Stem-Lemm = Stem, Grams = Uni-Bi-Tri
Accuracy: min = 79.75, mean = 80.5, max = 81.0
Precision: min = 74.2063492063492, mean = 76.33969538168222, max = 77.43362831858407
Recall: min = 86.5, mean = 88.5, max = 93.5
Fscore: min = 81.0304449648712, mean = 81.93497102124552, max = 82.74336283185839

Model: Stem-Lemm = Lemm, Grams = Uni
Accuracy: min = 79.0, mean = 80.85, max = 83.25
Precision: min = 79.52380952380952, mean = 81.42490863295778, max = 83.41708542713567
Recall: min = 75.5, mean = 80.0, max = 83.5
Fscore: min = 78.23834196891193, mean = 80.65291242634629, max = 83.20802005012531
Model: Stem-Lemm = Lemm, Grams = Bi
Accuracy: min = 76.0, mean = 78.45, max = 80.0
Precision: min = 74.76190476190476, mean = 79.20768262502133, max = 81.48148148148148
Recall: min = 74.5, mean = 77.3, max = 80.0
Fscore: min = 76.58536585365854, mean = 78.20520815486907, max = 80.0
Model: Stem-Lemm = Lemm, Grams = Tri
Accuracy: min = 64.5, mean = 66.8, max = 68.25
Precision: min = 64.64646464646465, mean = 68.42043019772903, max = 71.6867469879518
Recall: min = 59.5, mean = 62.7, max = 65.0
Fscore: min = 64.321608040201, mean = 65.38184611743938, max = 67.18346253229976
Model: Stem-Lemm = Lemm, Grams = Uni-Bi
Accuracy: min = 77.5, mean = 81.0, max = 83.75
Precision: min = 73.91304347826087, mean = 77.91567203435515, max = 81.3953488372093
Recall: min = 85.0, mean = 86.7, max = 87.5
Fscore: min = 79.06976744186046, mean = 82.05574507436535, max = 84.33734939759036
Model: Stem-Lemm = Lemm, Grams = Uni-Bi-Tri
Accuracy: min = 77.5, mean = 79.65, max = 81.25
Precision: min = 73.96694214876032, mean = 76.16897205135184, max = 78.02690582959642
Recall: min = 83.5, mean = 86.4, max = 89.5
Fscore: min = 78.77358490566037, mean = 80.93733753969907, max = 82.26950354609929

Model: Stem-Lemm = Lemm-Stem, Grams = Uni
Accuracy: min = 78.0, mean = 79.5, max = 81.25
Precision: min = 78.04878048780488, mean = 80.06850749109498, max = 82.05128205128206
Recall: min = 74.0, mean = 78.6, max = 83.0
Fscore: min = 77.08333333333334, mean = 79.28592084170393, max = 81.37254901960785
Model: Stem-Lemm = Lemm-Stem, Grams = Bi
Accuracy: min = 76.0, mean = 78.05, max = 80.75
Precision: min = 74.76190476190476, mean = 77.61903628372211, max = 81.21827411167513
Recall: min = 76.5, mean = 78.9, max = 80.5
Fscore: min = 76.30922693266832, mean = 78.24431421619461, max = 80.60453400503778
Model: Stem-Lemm = Lemm-Stem, Grams = Tri
Accuracy: min = 66.25, mean = 67.05, max = 68.0
Precision: min = 65.70048309178743, mean = 68.31599702802468, max = 70.45454545454545
Recall: min = 60.5, mean = 63.8, max = 68.0
Fscore: min = 64.19098143236074, mean = 65.92445585701785, max = 67.0103092783505
Model: Stem-Lemm = Lemm-Stem, Grams = Uni-Bi
Accuracy: min = 76.25, mean = 79.05, max = 82.0
Precision: min = 71.96652719665272, mean = 76.1066129746233, max = 79.62962962962963
Recall: min = 82.5, mean = 84.9, max = 87.0
Fscore: min = 78.35990888382688, mean = 80.2254311121068, max = 82.69230769230771
Model: Stem-Lemm = Lemm-Stem, Grams = Uni-Bi-Tri
Accuracy: min = 77.25, mean = 79.35, max = 80.5
Precision: min = 73.5930735930736, mean = 74.88829422088509, max = 75.52742616033755
Recall: min = 85.0, mean = 88.3, max = 91.5
Fscore: min = 78.8863109048724, mean = 81.03566793860703, max = 82.43243243243244

All models tested!
In [13]:
#Save model performance stats to a Pandas dataframe for future analysis
df1 = pd.DataFrame.from_dict(acc_small,orient='columns',).T
df1.columns=['Accuracy','Precision','Recall','Fscore']

#unstack the dataframe in 2 ways for visualization purposes
df_grams = df1.unstack(level=0)
df_stemlemm = df1.unstack(level=-1)
In [14]:
#optional: save data to a file for testing purposes
#df1.to_pickle('accsmall.pkl')

#read the data, if needed
#df1 = pd.read_pickle('accsmall.pkl')
In [15]:
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams 
figures = 8
df = df1
fig = plt.figure(figsize=(20,24))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here

ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1

for idx in range(0,figures,2) : #[0,2,4,6]
    this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
    ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
    ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
    df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
    df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])

    plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
    plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)

    for label in ax[idx].get_xticklabels():
        label.set_rotation(0) 
        
    for label in ax[idx+1].get_xticklabels():
        label.set_rotation(0) 

plt.show()

Conclusion from small-dataset model comparisons:

  1. The trigrams-only models have poor performance against all 4 measures and will not be trained or tested further in their independent form (i.e. unless they are combined with unigrams or bigrams).
  2. The bigrams-only models have slightly-lower performance in the key measures of accuracy and F-score and will not be trained or tested further.
  3. The stemming and lemmatizing methods do not show significant improvements over each other in the plots, but the dataset here is small. Therefore, testing should be done on a larger dataset, which I will get from Kaggle’s Bag of Popcorn competition.

Larger dataset

To get a larger dataset to train and test on, I downloaded the data for the competition “Bag of words meets bags of popcorn”.

In [16]:
# read the data-set
train=pd.read_csv('./Kaggle_BOW/labeledTrainData.tsv',sep='\t')
train.head()
Out[16]:
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w…
1 2381_9 1 \The Classic War of the Worlds\” by Timothy Hi…
2 7759_3 0 The film starts with a manager (Nicholas Bell)…
3 3630_4 0 It must be assumed that those who praised this…
4 9495_8 1 Superbly trashy and wondrously unpretentious 8…
In [17]:
labels=np.array(train['sentiment'])

rawX=train['review']
X = [parse_string(review) for review in rawX]
In [18]:
#read test set...
test=pd.read_csv('./Kaggle_BOW/testData.tsv',sep='\t')
test.head()
Out[18]:
id review
0 12311_10 Naturally in a film who’s main themes are of m…
1 8348_2 This movie is a disaster within a disaster fil…
2 5828_4 All in all, this is a movie for kids. We saw i…
3 7186_2 Afraid of the Dark left me with the impression…
4 12128_7 A very accurate depiction of small time mob li…

Training and testing with the Kaggle dataset :

In [19]:
nfolds=5
classes=np.unique(labels)

#Generate one NaiveBayes object at a time and save into COMBO
combo = {} #store objects created in this dictionary
acc_final = {} #store accuracy of each combo in this dictionary
modes = ['None','Stem','Lemm','Lemm-Stem']
gram_combos = ['Uni','Uni-Bi','Uni-Bi-Tri']

# Loop through to test each mode and create the x-gram combos to run
for mode in modes :
    unigrams = [stem_lemm(review,mode) for review in X]
    bigrams = []
    trigrams = []
    uni_bi = []
    uni_bi_tri = []
    
    for unigram in unigrams :
        bigram = gen_bigrams(unigram)
        trigram = gen_trigrams(unigram)
        uni_bi = uni_bi + [unigram + bigram]
        uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
        bigrams = bigrams + [bigram]
        trigrams = trigrams + [trigram]

    for processed_reviews, gram_combo in zip([unigrams, uni_bi, uni_bi_tri],gram_combos) :
        folds=t.generate_folds(np.array(processed_reviews),labels,nfolds) # generate folds
        totacc = []
        totpre = []
        totrec = []
        totfsc = []

        for k in range(nfolds):
            traindata,trainlabels,testdata,testlabels=folds[k][0:4] #split data into train and test data
            combo[(mode,gram_combo)] = NaiveBayes(classes)
            combo[(mode,gram_combo)].train(traindata,trainlabels) #train the model
            pclasses = combo[(mode,gram_combo)].test(testdata) #test the model
            acc,pre,rec,fsc = eval_model(pclasses,testlabels) #get model performance stats
            totacc.append(acc)
            totpre.append(pre)
            totrec.append(rec)
            totfsc.append(fsc)
        print('Model: Stem-Lemm = {}, Grams = {}'.format(mode, gram_combo))
        print('Accuracy: min = {}, mean = {}, max = {}'.format(np.min(totacc),np.mean(totacc),np.max(totacc)))
        print('Precision: min = {}, mean = {}, max = {}'.format(np.min(totpre),np.mean(totpre),np.max(totpre)))
        print('Recall: min = {}, mean = {}, max = {}'.format(np.min(totrec),np.mean(totrec),np.max(totrec)))
        print('Fscore: min = {}, mean = {}, max = {}'.format(np.min(totfsc),np.mean(totfsc),np.max(totfsc)))
        acc_final[(mode,gram_combo)]=(np.mean(totacc),np.mean(totpre),np.mean(totrec),np.mean(totfsc))
    print('')
        
print('All models tested!')
Model: Stem-Lemm = None, Grams = Uni
Accuracy: min = 85.1, mean = 85.59200000000001, max = 86.14
Precision: min = 86.35614179719703, mean = 86.67703607414145, max = 87.22702925422332
Recall: min = 83.08, mean = 84.112, max = 85.16
Fscore: min = 84.79281486017554, mean = 85.37427196356006, max = 86.00282771157343
Model: Stem-Lemm = None, Grams = Uni-Bi
Accuracy: min = 85.98, mean = 86.89600000000002, max = 87.62
Precision: min = 86.19718309859155, mean = 87.10401771910237, max = 87.66519823788546
Recall: min = 85.24, mean = 86.61600000000001, max = 87.76
Fscore: min = 85.93781344032097, mean = 86.85699093145129, max = 87.61256754052431
Model: Stem-Lemm = None, Grams = Uni-Bi-Tri
Accuracy: min = 86.0, mean = 86.944, max = 87.72
Precision: min = 86.17363344051446, mean = 86.91919059050642, max = 87.36133122028527
Recall: min = 85.76, mean = 86.976, max = 88.2
Fscore: min = 85.96631916599839, mean = 86.94677051838082, max = 87.77866242038216

Model: Stem-Lemm = Stem, Grams = Uni
Accuracy: min = 84.5, mean = 84.76, max = 85.24
Precision: min = 85.13238289205702, mean = 85.83791924398797, max = 86.56527249683143
Recall: min = 81.96, mean = 83.264, max = 83.92
Fscore: min = 84.19971234846928, mean = 84.52786085082542, max = 85.04256181597081
Model: Stem-Lemm = Stem, Grams = Uni-Bi
Accuracy: min = 85.76, mean = 86.50399999999999, max = 87.28
Precision: min = 85.99283724631914, mean = 86.62751326521803, max = 87.36968724939855
Recall: min = 85.28, mean = 86.33599999999998, max = 87.16
Fscore: min = 85.69131832797427, mean = 86.48087278400516, max = 87.26471766119344
Model: Stem-Lemm = Stem, Grams = Uni-Bi-Tri
Accuracy: min = 85.88, mean = 86.504, max = 87.18
Precision: min = 84.95713172252533, mean = 86.17517952768875, max = 87.19487795118047
Recall: min = 86.76, mean = 86.96799999999999, max = 87.2
Fscore: min = 86.06395578365576, mean = 86.56794092788762, max = 87.17743548709741

Model: Stem-Lemm = Lemm, Grams = Uni
Accuracy: min = 84.82, mean = 85.392, max = 86.04
Precision: min = 85.66161409258501, mean = 86.66859809038024, max = 87.26968174204355
Recall: min = 83.24, mean = 83.65599999999999, max = 84.6
Fscore: min = 84.6387370977535, mean = 85.13425349543881, max = 85.83603896103897
Model: Stem-Lemm = Lemm, Grams = Uni-Bi
Accuracy: min = 86.4, mean = 86.944, max = 87.56
Precision: min = 86.29600626468284, mean = 87.20196334947065, max = 88.2952691680261
Recall: min = 85.28, mean = 86.60799999999999, max = 88.16
Fscore: min = 86.24595469255664, mean = 86.89872799413509, max = 87.43941841680129
Model: Stem-Lemm = Lemm, Grams = Uni-Bi-Tri
Accuracy: min = 86.18, mean = 86.96400000000001, max = 87.9
Precision: min = 86.13663603675589, mean = 86.81272282057373, max = 88.08943089430895
Recall: min = 86.24, mean = 87.176, max = 88.72
Fscore: min = 86.18828702778332, mean = 86.99110701598211, max = 87.99841301329101

Model: Stem-Lemm = Lemm-Stem, Grams = Uni
Accuracy: min = 84.16, mean = 84.73999999999998, max = 85.5
Precision: min = 85.16235100698727, mean = 85.79129997614173, max = 86.47759967118783
Recall: min = 81.84, mean = 83.272, max = 84.28
Fscore: min = 83.7837837837838, mean = 84.51070372261967, max = 85.30306101763632
Model: Stem-Lemm = Lemm-Stem, Grams = Uni-Bi
Accuracy: min = 85.38, mean = 86.24, max = 87.08
Precision: min = 85.14104092173223, mean = 86.29342217205604, max = 86.72741679873216
Recall: min = 84.12, mean = 86.16799999999999, max = 87.56
Fscore: min = 85.26251773768497, mean = 86.22686018943062, max = 87.14171974522291
Model: Stem-Lemm = Lemm-Stem, Grams = Uni-Bi-Tri
Accuracy: min = 85.6, mean = 86.444, max = 87.52
Precision: min = 84.94163424124514, mean = 86.06595568153185, max = 87.52
Recall: min = 86.04, mean = 86.984, max = 87.6
Fscore: min = 85.66308243727597, mean = 86.5187288673859, max = 87.52

All models tested!
In [20]:
#Convert the model performance stats into a Pandas dataframe for visualization
df2 = pd.DataFrame.from_dict(acc_final,orient='columns',).T
df2.columns=['Accuracy','Precision','Recall','Fscore']

#Unstack the dataframe for two different plot types
df_grams = df2.unstack(level=0)
df_stemlemm = df2.unstack(level=-1)
In [21]:
#optional: save data to a file for testing purposes
#df2.to_pickle('accfinal.pkl')

#read the data, if needed
#df2 = pd.read_pickle('accfinal.pkl')
In [22]:
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams 
figures = 8
df = df2
fig = plt.figure(figsize=(20,figures*3))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here

ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1

for idx in range(0,figures,2) : #[0,2,4,6]
    this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
    ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
    ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
    df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
    df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])

    plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
    plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)

    for label in ax[idx].get_xticklabels():
        label.set_rotation(0) 
        
    for label in ax[idx+1].get_xticklabels():
        label.set_rotation(0) 

plt.show()

Conclusion from model comparisons on larger dataset:

  1. Unigrams consistently underperform the combinations with bigrams and trigrams. Hence, they will not be used further.
  2. The combination lemmatized-stemmed models offer little to no gains while having the highest cost in terms of processing time and power. Hence, this combination will not be used further.
In [23]:
#Convert the model performance stats into a Pandas dataframe for visualization
df_cleaned = df.drop([('None','Uni'),('Stem','Uni'),('Lemm','Uni'),
                      ('Lemm-Stem','Uni'),('Lemm-Stem','Uni-Bi'),('Lemm-Stem','Uni-Bi-Tri')])

df_grams = df_cleaned.unstack(level=0)
df_stemlemm = df_cleaned.unstack(level=-1)
In [24]:
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams 
modes = ['None','Stem','Lemm']
gram_combos = ['Uni-Bi','Uni-Bi-Tri']
df = df_cleaned
figures = 8

fig = plt.figure(figsize=(20,figures*3))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here

ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1

for idx in range(0,figures,2) : #[0,2,4,6]
    this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
    ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
    ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
    df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
    df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])

    plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
    plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)

    for label in ax[idx].get_xticklabels():
        label.set_rotation(0) 
        
    for label in ax[idx+1].get_xticklabels():
        label.set_rotation(0) 

plt.show()

These results all look very similar in performance. Looking very closely, though:

  1. In terms of accuracy, the model with no stemming or lemming performed the best, by a small margin.
  2. In terms of precision, no stemming or lemmatizing got the best performance, and the unigram-bigram-trigram combo also showed the best results.
  3. In terms of recall, the unigram-bigram-trigram combination performed best.
  4. Since we do not have any particular bias toward positive or negative outcomes, neither recall nor precision is more important than the other, so results are just notes of interest. If we were very interested in identifying only those who wrote positive reviews, for example, then this bias may be different.
  5. Overall, in terms of F-score, the model without any stemming or lemmatizing did slightly better than the others, but is close to equivalent to the others.
  6. Given the priorities here of accuracy over other metrics, the low-cost approach of avoiding stemming and lemmatizing will be utilized, along with the unigram-bigram-trigram combination.

In summary:

  1. The models with mode “None” that used neither lemmatizing nor stemming performed best, so this will be used in the final model.
  2. The models that combined unigrams, bigrams, and trigrams performed best, so this will be used in the final model.

Finally, let’s train on the complete dataset and test on Kaggle’s competition test set:

In [25]:
classes=np.unique(labels)
print ('Training a Classifier on Full training set with classes =', classes)

mode = 'None'
gram_combo = 'Mono-Bi-Tri'
combo = {}

unigrams = X #no stemming or lemmatizing
bigrams = []
trigrams = []
#uni_bi = []
uni_bi_tri = []
    
for unigram in unigrams :
    bigram = gen_bigrams(unigram)
    trigram = gen_trigrams(unigram)
    uni_bi = uni_bi + [unigram + bigram]
    uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
    bigrams = bigrams + [bigram]
    trigrams = trigrams + [trigram]

traindata,trainlabels,testdata,testlabels=t.split_data(np.array(uni_bi_tri),labels)

combo[(mode,gram_combo)] = NaiveBayes(classes)
combo[(mode,gram_combo)].train(traindata,trainlabels)
        
print('The model is trained!')
Training a Classifier on Full training set with classes = [0 1]
The model is trained!
In [26]:
#prep the test data
Xtest=test['review']
unigrams = [parse_string(string) for string in Xtest]
bigrams = []
trigrams = []
#uni_bi = []
uni_bi_tri = []
    
for unigram in unigrams :
    bigram = gen_bigrams(unigram)
    trigram = gen_trigrams(unigram)
    uni_bi = uni_bi + [unigram + bigram]
    uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
    bigrams = bigrams + [bigram]
    trigrams = trigrams + [trigram]
    
Xtest=np.array(uni_bi_tri)

#test the classifier on the provided test set...
pclasses=combo[(mode,gram_combo)].test(Xtest)
print ("done")
done
In [27]:
#write the result in Kaggle's required format
output = pd.DataFrame( data={"id":test["id"], "sentiment":pclasses} )

# Use Pandas to write the comma-separated output file
output.to_csv( "Naive_Bayes_Bag_of_Words_model.csv", index=False, quoting=3 )

Finally, upload the prediction to Kaggle:

Now I will upload the result to Kaggle to see my ranking and score. I was told that by using Naive Bayes, I could have an accuracy of around 0.80960.

It seems that I did well!

This was fun, and I learned a lot. Thanks for reading!

Bag%20of%20Popcorn%20results.png

In [ ]: