Text Classification (Sentiment Analysis) Using Naive Bayes Rule ¶
An exploratory project written for IMDB reviews like Kaggle’s “Bag of Words Meets Bag of Popcorn” competition¶
Objective:¶
As an introductory project, I wanted to manually write a Naive Bayes classifier and explore different model variations to see how they perform. As such, I wrote this code using minimalist libraries (like numpy) and then directly testing and visualizing results. As well, I repeatedly and iteratively refined this notebook to make it as efficient and modular as I could in the couple of days I had.
Goal:¶
We need a program to take full-text reviews and classify them as either positive or negative reviews of that movie, business, etc. I will design and then train a model to pull meaningful words from the reviews and use them to determine the polarity (positive or negative) of the review. For training, I have data that include the correct polarity, which is based on the star rating that they submitted with their text review.
Methods:¶
As an introductory project, I will create a Naive Bayes Multinomial Classifier using the bag of words model. I will not, however, use sklearn or other such libraries that do the Bayes classification for me. Instead, I’ll use Numpy and basic accessory libraries like NTLK to try stemming to improve results.
Desired comparisons:¶
- Unigrams, Bigrams, and/or Trigrams vs. model performance
- Stemming and/or Lemmatizing vs. model performance
- Model accuracy, precision, recall, and F-score
Evaluation:¶
Once complete, I will run it on the full dataset from the Kaggle competition to test the methods on a large dataset and to see how well it performs against the existing leaderboard. (NOTE: the competition has ended so the leaderboard is frozen and will not update with new submissions.)
Setup: installations and imports¶
#-------------------------------------------------------------------------------------
# One-time runs - do not uncomment unless needed to install on new device
#-------------------------------------------------------------------------------------
#!pip install stop-words #Used to compare to the stopwords supplied with the IMDB data
#import nltk
#nltk.download()
#!pip install kaggle
%load_ext autoreload
%autoreload 2
import tools as t #tools.py file included with this data
import numpy as np
import string #using punctuation
import pandas as pd
from math import log
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
stopwords=set(t.read_txt_file('./data/english.stop'))
print("complete")
Main functions and classes¶
Process for each IMDB review:¶
- Parse text, remove punctuation, and tokenize to unigram words
- Save unigram words to global variable for that dataset
- Generate new unigram matricies with each stemming and lemmatizing combination desired (4 combinations)
- Generate bigrams and trigrams and combinations thereof from each unigram matrix created in step 3 (5 combinations)
- Train for each combination, saving the self.prob for later use if model selected (4×5 = 20 combinations)
- Test for each combination and calculate accuracy, precision, recall, and F-score for each
- Plot each comparison as desired to evaluate results
- Choose best model for final use
Key Data:¶
Global variables:¶
X: (matrix) bag of words (uni, bi, and/or tri) for each movie review, one review per row
Y: (vector) known review classifications (only available for training/test data)
Combo: (dict) NaiveBayes objects created for each combination of stemming/lemmatizing and uni/bi/trigrams for model testing
acc_small: (dict) model accuracy values from testing
acc_final: (dict) model accuracy, precision, recall, and F-score values from testing
NaiveBayes class variables:¶
self.classes: (list) all classes (positive/negative, 1/0, etc) for the data in Y
self.prior: (dict) for this dataset, the probability of each class being the result
self.prob: (dict) for this dataset, the probability of each word appearing in each class
Functions:¶
- files_to_strings – pass in a list of all files to be read, return list of strings containing text of review
- parse_file – pass in a filename, return the string of raw text in it (unprocessed)
- parse_string – pass in a string, return a bag of words (unigrams)
- stem_lemm – pass in a bag of unigram words, return a bag of unigram words processed based on desired combo
- gen_bigrams – pass in a bag of unigram words, return a bag of bigram words
- gen_trigrams – pass in a bag of unigram words, return a bag of trigram words
def files_to_strings(filenames):
""" Read an array of files where each file content is read in a string
Input:
-------
X an array of file names
Returns:
--------
A list with each row containing a read string from the file"""
return ([ parse_file(file) for file in filenames ])
def parse_file(filename): # Parse a given file
""" Parse the input file:
Parameters:
----------
filename: name of PLAIN TEXT file to be read (i.e., not CSV or TSV)
Returns:
---------
read file as raw string (with \n, \t, \r, etc included)"""
with open(filename[0]) as file: return file.read()
def parse_string(string1): # Parse a read string into respective tokens
""" Parse the input string and tokenize it into words:
Parameters:
----------
string: string to be parsed
Returns:
---------
list of tokens (unigrams) extracted from the string"""
#Remove any punctuation
for c in string.punctuation:
string1 = string1.replace(c,"")
#Remove any stopwords and return result
return [((a)) for a in string1.lower().split() if a[0] in 'abcdefghijklmnopqrstuvwxyz' and a not in stopwords]
def stem_lemm(bow, mode) : #bow = bag of words
""" Apply the desired stemming or lemmatizing methods to the bag of unigram words:
Parameters:
----------
bow: bag of unigram words
mode: type of stemming or lemmatizing method
Returns:
---------
list with processed bag of unigram words"""
#Create instance of stemmer and lemmatizer
ls = LancasterStemmer()
lz = WordNetLemmatizer()
#mode selects whether stemmer, lemmatizer, or both are applied
if mode == 'None' :
return bow
elif mode == 'Stem' :
return [ls.stem((word)) for word in bow]
elif mode == 'Lemm' :
return [lz.lemmatize(word) for word in bow]
elif mode == 'Lemm-Stem' :
return [ls.stem(lz.lemmatize(word)) for word in bow]
def gen_bigrams(bow) :
""" Create bag of bigram words from unigrams:
Parameters:
----------
bow: bag of unigram words
Returns:
---------
list with bag of bigram words"""
return [' '.join([bow[a],bow[a+1]]) for a in range(len(bow)-1)]
def gen_trigrams(bow) :
""" Create bag of trigram words from unigrams:
Parameters:
----------
bow: bag of unigram words
Returns:
---------
list with bag of trigram words"""
return [' '.join([bow[a],bow[a+1],bow[a+2]]) for a in range(len(bow)-2)]
def eval_model(pred,ans):
""" calculate the model performance stats for this model:
Parameters:
----------
pred: list of class predictions from the model
ans: actual classes
Returns:
---------
Model performance stats"""
pred = [1 if a == classes[1] else 0 for a in pred]
ans = [1 if a == classes[1] else 0 for a in ans]
pred = np.array(pred,dtype=bool)
ans = np.array(ans,dtype=bool)
TP = np.sum((pred & ans)) #Predicted positive, was positive
FN = np.sum((np.invert(pred) & ans)) #Predicted negative, was positive
FP = np.sum((pred & np.invert(ans))) #Predicted positive, was negative
TN = np.sum((np.invert(pred) & np.invert(ans))) #Predicted negative, was negative
Accuracy = 100*(TP+TN)/len(pred)
Precision = 100*TP/(TP+FP)
Recall = 100*TP/(TP+FN)
Fscore = (2*Precision*Recall)/(Precision + Recall)
return Accuracy, Precision, Recall, Fscore
class NaiveBayes:
''' Implements the Naive Bayes For Text Classification '''
def __init__(self, classes):
self.classes=classes
self.prior = [] #dictionary of prior classes and their respective probability
self.prob = [] #dictionary of probabilities for each class as prob[pos/neg][word] = probability
def train(self, X, Y):
''' Train the multiclass (or Binary) Bayes Rule using the given
X [m x d] data matrix and Y labels matrix
Input:
------
X: [m x d] a data matrix of m d-dimensional examples.
Y: [m x 1] a label vector.
Returns:
-----------
Nothing'''
#Set classes:
neg = self.classes[0]
pos = self.classes[1]
#Boolean arrays for positive and negative:
pos_bool = Y==pos
neg_bool = Y==neg
#Prior probability:
prob_pos = sum(pos_bool) / len(Y)
prob_neg = sum(neg_bool) / len(Y)
#Word probabilities:
#Generate lists of all words in each class (positive class and negative class)
X_pos = []
X_neg = []
for idx in range(len(X[:])):
if Y[idx]==pos : X_pos.extend(X[idx])
elif Y[idx]==neg : X_neg.extend(X[idx])
else : print("NON-CLASS ENTRY")
#V = Vocabulary = all non-stop words in all IMDB reviews (positive or negative)
V = sorted(set(X_pos+X_neg)) #sorted isn't necessary, but it helps for viewing AND converts to a list
V_count = len(V)
#Get the words and counts for each non-stop word in positive and then negative reviews
#NOTE: adding V into the list means EVERY word's count (+ 1) appears in both class lists
X_pos_unique, X_pos_counts = np.unique(V+X_pos, return_counts=True)
X_neg_unique, X_neg_counts = np.unique(V+X_neg, return_counts=True)
#Get the total number of considered words (i.e., not stop words) in each class
X_pos_all_count = sum(X_pos_counts)
X_neg_all_count = sum(X_neg_counts)
#P(word|class) = [count of words in pos+1] / [(count of all words in pos) + (count of all words in V)]
#NOTE: the +1 AND the +V was taken care of by "V+..." in the .unique step above
#NOTE: ln (natural log) of probabilities taken to avoid float underrun (underflow) when multiplying products
word_given_pos_prob = dict( zip(X_pos_unique, np.log(X_pos_counts/X_pos_all_count)) )
word_given_neg_prob = dict( zip(X_neg_unique, np.log(X_neg_counts/X_neg_all_count)) )
#Save results to class variables
self.prior = {'pos':prob_pos, 'neg':prob_neg}
self.prob = {'pos':word_given_pos_prob, 'neg':word_given_neg_prob}
return None
def test(self, X):
''' Test the trained classifiers on the given set of examples
Input:
------
X: [m x d] a data matrix of m d-dimensional test examples.
Returns:
-----------
pclass: the predicted class for each example, i.e. to which it belongs'''
neg = self.classes[0]
pos = self.classes[1]
pclass = []
for idx in range(len(X[:])) :
prob_review_pos = log(self.prior['pos']) + sum([self.prob['pos'].get(word,1) for word in X[idx]])
prob_review_neg = log(self.prior['neg']) + sum([self.prob['neg'].get(word,1) for word in X[idx]])
if prob_review_pos > prob_review_neg : pclass.extend([pos])
else : pclass.extend([neg])
return pclass
Load small-dataset training and testing data:¶
#load data, get list of files for each class...
tdir='./data/imdb1/' # training dir...
posfiles=t.get_files(tdir+'/pos','*',withpath=True)
negfiles=t.get_files(tdir+'/neg','*',withpath=True)
#generate training and testing data...
labels = np.concatenate((['pos']*len(posfiles),['neg']*len(negfiles))) # concatenate the +ve and -ve labels
tX=np.concatenate((posfiles,negfiles)).reshape((len(posfiles)+len(negfiles),1))
print("Training data Dimensions =", tX.shape," Training labels dimensions=", labels.shape)
#It would be more efficient to have performed the parse_string within the files_to_strings function.
#However, since the Kaggle data that is used later doesn't use the file_to_strings function the
#parse_string could not be embedded into it
X = files_to_strings(tX) # read files and convert each file into a string, saved in each row of a list
X = [parse_string(review) for review in X] #parse each review to generate unigrams
Comparative test runs with small dataset:¶
#Generate one NaiveBayes object at a time and save into COMBO
#Each object will train and test a model that uses a mode and gram combination as input to the Bayes process
nfolds=5
classes=np.unique(labels)
combo = {} #store objects created in this dictionary
acc_small = {} #store accuracy of each combo in this dictionary
modes = ['None','Stem','Lemm','Lemm-Stem']
gram_combos = ['Uni','Bi','Tri','Uni-Bi','Uni-Bi-Tri']
# Loop through to test each mode and create the x-gram combos to run
for mode in modes :
unigrams = [stem_lemm(review,mode) for review in X]
bigrams = []
trigrams = []
uni_bi = []
uni_bi_tri = []
for unigram in unigrams :
bigram = gen_bigrams(unigram)
trigram = gen_trigrams(unigram)
uni_bi = uni_bi + [unigram + bigram]
uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
bigrams = bigrams + [bigram]
trigrams = trigrams + [trigram]
for processed_reviews, gram_combo in zip([unigrams, bigrams, trigrams, uni_bi, uni_bi_tri],gram_combos) :
folds=t.generate_folds(np.array(processed_reviews),labels,nfolds) # generate folds
totacc = []
totpre = []
totrec = []
totfsc = []
for k in range(nfolds):
traindata,trainlabels,testdata,testlabels=folds[k][0:4]
combo[(mode,gram_combo)] = NaiveBayes(classes)
combo[(mode,gram_combo)].train(traindata,trainlabels)
pclasses = combo[(mode,gram_combo)].test(testdata)
acc,pre,rec,fsc = eval_model(pclasses,testlabels)
totacc.append(acc)
totpre.append(pre)
totrec.append(rec)
totfsc.append(fsc)
print('Model: Stem-Lemm = {}, Grams = {}'.format(mode, gram_combo))
print('Accuracy: min = {}, mean = {}, max = {}'.format(np.min(totacc),np.mean(totacc),np.max(totacc)))
print('Precision: min = {}, mean = {}, max = {}'.format(np.min(totpre),np.mean(totpre),np.max(totpre)))
print('Recall: min = {}, mean = {}, max = {}'.format(np.min(totrec),np.mean(totrec),np.max(totrec)))
print('Fscore: min = {}, mean = {}, max = {}'.format(np.min(totfsc),np.mean(totfsc),np.max(totfsc)))
acc_small[(mode,gram_combo)]=(np.mean(totacc),np.mean(totpre),np.mean(totrec),np.mean(totfsc))
print('') #space between each series
print('All models tested!')
#Save model performance stats to a Pandas dataframe for future analysis
df1 = pd.DataFrame.from_dict(acc_small,orient='columns',).T
df1.columns=['Accuracy','Precision','Recall','Fscore']
#unstack the dataframe in 2 ways for visualization purposes
df_grams = df1.unstack(level=0)
df_stemlemm = df1.unstack(level=-1)
#optional: save data to a file for testing purposes
#df1.to_pickle('accsmall.pkl')
#read the data, if needed
#df1 = pd.read_pickle('accsmall.pkl')
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams
figures = 8
df = df1
fig = plt.figure(figsize=(20,24))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here
ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1
for idx in range(0,figures,2) : #[0,2,4,6]
this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])
plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
for label in ax[idx].get_xticklabels():
label.set_rotation(0)
for label in ax[idx+1].get_xticklabels():
label.set_rotation(0)
plt.show()
Conclusion from small-dataset model comparisons:¶
- The trigrams-only models have poor performance against all 4 measures and will not be trained or tested further in their independent form (i.e. unless they are combined with unigrams or bigrams).
- The bigrams-only models have slightly-lower performance in the key measures of accuracy and F-score and will not be trained or tested further.
- The stemming and lemmatizing methods do not show significant improvements over each other in the plots, but the dataset here is small. Therefore, testing should be done on a larger dataset, which I will get from Kaggle’s Bag of Popcorn competition.
Larger dataset¶
To get a larger dataset to train and test on, I downloaded the data for the competition “Bag of words meets bags of popcorn”.
# read the data-set
train=pd.read_csv('./Kaggle_BOW/labeledTrainData.tsv',sep='\t')
train.head()
labels=np.array(train['sentiment'])
rawX=train['review']
X = [parse_string(review) for review in rawX]
#read test set...
test=pd.read_csv('./Kaggle_BOW/testData.tsv',sep='\t')
test.head()
Training and testing with the Kaggle dataset :¶
nfolds=5
classes=np.unique(labels)
#Generate one NaiveBayes object at a time and save into COMBO
combo = {} #store objects created in this dictionary
acc_final = {} #store accuracy of each combo in this dictionary
modes = ['None','Stem','Lemm','Lemm-Stem']
gram_combos = ['Uni','Uni-Bi','Uni-Bi-Tri']
# Loop through to test each mode and create the x-gram combos to run
for mode in modes :
unigrams = [stem_lemm(review,mode) for review in X]
bigrams = []
trigrams = []
uni_bi = []
uni_bi_tri = []
for unigram in unigrams :
bigram = gen_bigrams(unigram)
trigram = gen_trigrams(unigram)
uni_bi = uni_bi + [unigram + bigram]
uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
bigrams = bigrams + [bigram]
trigrams = trigrams + [trigram]
for processed_reviews, gram_combo in zip([unigrams, uni_bi, uni_bi_tri],gram_combos) :
folds=t.generate_folds(np.array(processed_reviews),labels,nfolds) # generate folds
totacc = []
totpre = []
totrec = []
totfsc = []
for k in range(nfolds):
traindata,trainlabels,testdata,testlabels=folds[k][0:4] #split data into train and test data
combo[(mode,gram_combo)] = NaiveBayes(classes)
combo[(mode,gram_combo)].train(traindata,trainlabels) #train the model
pclasses = combo[(mode,gram_combo)].test(testdata) #test the model
acc,pre,rec,fsc = eval_model(pclasses,testlabels) #get model performance stats
totacc.append(acc)
totpre.append(pre)
totrec.append(rec)
totfsc.append(fsc)
print('Model: Stem-Lemm = {}, Grams = {}'.format(mode, gram_combo))
print('Accuracy: min = {}, mean = {}, max = {}'.format(np.min(totacc),np.mean(totacc),np.max(totacc)))
print('Precision: min = {}, mean = {}, max = {}'.format(np.min(totpre),np.mean(totpre),np.max(totpre)))
print('Recall: min = {}, mean = {}, max = {}'.format(np.min(totrec),np.mean(totrec),np.max(totrec)))
print('Fscore: min = {}, mean = {}, max = {}'.format(np.min(totfsc),np.mean(totfsc),np.max(totfsc)))
acc_final[(mode,gram_combo)]=(np.mean(totacc),np.mean(totpre),np.mean(totrec),np.mean(totfsc))
print('')
print('All models tested!')
#Convert the model performance stats into a Pandas dataframe for visualization
df2 = pd.DataFrame.from_dict(acc_final,orient='columns',).T
df2.columns=['Accuracy','Precision','Recall','Fscore']
#Unstack the dataframe for two different plot types
df_grams = df2.unstack(level=0)
df_stemlemm = df2.unstack(level=-1)
#optional: save data to a file for testing purposes
#df2.to_pickle('accfinal.pkl')
#read the data, if needed
#df2 = pd.read_pickle('accfinal.pkl')
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams
figures = 8
df = df2
fig = plt.figure(figsize=(20,figures*3))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here
ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1
for idx in range(0,figures,2) : #[0,2,4,6]
this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])
plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
for label in ax[idx].get_xticklabels():
label.set_rotation(0)
for label in ax[idx+1].get_xticklabels():
label.set_rotation(0)
plt.show()
Conclusion from model comparisons on larger dataset:¶
- Unigrams consistently underperform the combinations with bigrams and trigrams. Hence, they will not be used further.
- The combination lemmatized-stemmed models offer little to no gains while having the highest cost in terms of processing time and power. Hence, this combination will not be used further.
#Convert the model performance stats into a Pandas dataframe for visualization
df_cleaned = df.drop([('None','Uni'),('Stem','Uni'),('Lemm','Uni'),
('Lemm-Stem','Uni'),('Lemm-Stem','Uni-Bi'),('Lemm-Stem','Uni-Bi-Tri')])
df_grams = df_cleaned.unstack(level=0)
df_stemlemm = df_cleaned.unstack(level=-1)
#create 2 bar graphs for each model performance stat - one with modes as the x-variable, one with the grams
modes = ['None','Stem','Lemm']
gram_combos = ['Uni-Bi','Uni-Bi-Tri']
df = df_cleaned
figures = 8
fig = plt.figure(figsize=(20,figures*3))
plt.rcParams.update({'font.size': 12})
rows = int(figures/2)
ax = [0]*figures #I've never had to initialize an empty list like this before, but for some reason it was needed here
ymin = int(df.min().min()-0.1)
ymax = int(df.max().max())+1
for idx in range(0,figures,2) : #[0,2,4,6]
this_eval = df.columns[int(idx//2)] #get 'Accuracy','Precision', etc label
ax[idx] = fig.add_subplot(rows,2,idx+1) #make space for left subplot (df_grams)
ax[idx+1] = fig.add_subplot(rows,2,idx+2) #make space for right subplot (df_stemlemm)
df_grams[this_eval].reindex(gram_combos).plot(kind="bar",fontsize=12, ax=ax[idx])
df_stemlemm[this_eval].reindex(modes).plot(kind='bar',fontsize=12, ax=ax[idx+1])
plt.setp(ax[idx],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
plt.setp(ax[idx+1],ylim=(ymin,ymax),xlabel='',ylabel='Score, in percent',title=this_eval)
for label in ax[idx].get_xticklabels():
label.set_rotation(0)
for label in ax[idx+1].get_xticklabels():
label.set_rotation(0)
plt.show()
These results all look very similar in performance. Looking very closely, though:
- In terms of accuracy, the model with no stemming or lemming performed the best, by a small margin.
- In terms of precision, no stemming or lemmatizing got the best performance, and the unigram-bigram-trigram combo also showed the best results.
- In terms of recall, the unigram-bigram-trigram combination performed best.
- Since we do not have any particular bias toward positive or negative outcomes, neither recall nor precision is more important than the other, so results are just notes of interest. If we were very interested in identifying only those who wrote positive reviews, for example, then this bias may be different.
- Overall, in terms of F-score, the model without any stemming or lemmatizing did slightly better than the others, but is close to equivalent to the others.
- Given the priorities here of accuracy over other metrics, the low-cost approach of avoiding stemming and lemmatizing will be utilized, along with the unigram-bigram-trigram combination.
In summary:
- The models with mode “None” that used neither lemmatizing nor stemming performed best, so this will be used in the final model.
- The models that combined unigrams, bigrams, and trigrams performed best, so this will be used in the final model.
Finally, let’s train on the complete dataset and test on Kaggle’s competition test set:¶
classes=np.unique(labels)
print ('Training a Classifier on Full training set with classes =', classes)
mode = 'None'
gram_combo = 'Mono-Bi-Tri'
combo = {}
unigrams = X #no stemming or lemmatizing
bigrams = []
trigrams = []
#uni_bi = []
uni_bi_tri = []
for unigram in unigrams :
bigram = gen_bigrams(unigram)
trigram = gen_trigrams(unigram)
uni_bi = uni_bi + [unigram + bigram]
uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
bigrams = bigrams + [bigram]
trigrams = trigrams + [trigram]
traindata,trainlabels,testdata,testlabels=t.split_data(np.array(uni_bi_tri),labels)
combo[(mode,gram_combo)] = NaiveBayes(classes)
combo[(mode,gram_combo)].train(traindata,trainlabels)
print('The model is trained!')
#prep the test data
Xtest=test['review']
unigrams = [parse_string(string) for string in Xtest]
bigrams = []
trigrams = []
#uni_bi = []
uni_bi_tri = []
for unigram in unigrams :
bigram = gen_bigrams(unigram)
trigram = gen_trigrams(unigram)
uni_bi = uni_bi + [unigram + bigram]
uni_bi_tri = uni_bi_tri + [unigram + bigram + trigram]
bigrams = bigrams + [bigram]
trigrams = trigrams + [trigram]
Xtest=np.array(uni_bi_tri)
#test the classifier on the provided test set...
pclasses=combo[(mode,gram_combo)].test(Xtest)
print ("done")
#write the result in Kaggle's required format
output = pd.DataFrame( data={"id":test["id"], "sentiment":pclasses} )
# Use Pandas to write the comma-separated output file
output.to_csv( "Naive_Bayes_Bag_of_Words_model.csv", index=False, quoting=3 )
Finally, upload the prediction to Kaggle:¶
Now I will upload the result to Kaggle to see my ranking and score. I was told that by using Naive Bayes, I could have an accuracy of around 0.80960.
It seems that I did well!
This was fun, and I learned a lot. Thanks for reading!