Sentiment Analysis

Artificial Intelligence Nanodegree Program | Natural Language Processing


With the rise of online social media platforms like Twitter, Facebook and Reddit, and the proliferation of customer reviews on sites like Amazon and Yelp, we now have access, more than ever before, to massive text-based data sets! They can be analyzed in order to determine how large portions of the population feel about certain products, events, etc. This sort of analysis is called sentiment analysis. In this notebook you will build an end-to-end sentiment classification system from scratch.

Instructions

Some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this notebook. You will not need to modify the included code beyond what is requested. Sections that begin with 'TODO' in the header indicate that you need to complete or implement some portion within them. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a # TODO: ... comment. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions for you to answer which relate to the task and your implementation. Each section where you will answer a question is preceded by a 'Question:' header. Carefully read each question and provide your answer below the 'Answer:' header by editing the Markdown cell.

Note: Code and Markdown cells can be executed using the Shift+Enter keyboard shortcut. In addition, a cell can be edited by typically clicking it (double-click for Markdown cells) or by pressing Enter while it is highlighted.

Step 1: Exploring the data!

The dataset we are going to use is very popular among researchers in Natural Language Processing, usually referred to as the IMDb dataset. It consists of movie reviews from the website imdb.com, each labeled as either 'positive', if the reviewer enjoyed the film, or 'negative' otherwise.

Maas, Andrew L., et al. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011.

We have provided the dataset for you. You can load it in by executing the Python cell below.

In [10]:
import os
import glob

def read_imdb_data(data_dir='data/imdb-reviews'):
    """Read IMDb movie reviews from given directory.
    
    Directory structure expected:
    - data/
        - train/
            - pos/
            - neg/
        - test/
            - pos/
            - neg/
    
    """

    # Data, labels to be returned in nested dicts matching the dir. structure
    data = {}
    labels = {}

    # Assume 2 sub-directories: train, test
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}

        # Assume 2 sub-directories for sentiment (label): pos, neg
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            # Fetch list of files for this sentiment
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            # Read reviews data and assign labels
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(sentiment)
            
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
    
    # Return data, labels as nested dicts
    return data, labels


data, labels = read_imdb_data()
print("IMDb reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
        len(data['train']['pos']), len(data['train']['neg']),
        len(data['test']['pos']), len(data['test']['neg'])))
IMDb reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg

Now that the data is loaded in, let's take a quick look at one of the positive reviews:

In [11]:
print(data['train']['pos'][2])
Dark Remains is a home run plain and simple. The film is full of creepy visuals, and scares' that will make the most seasoned horror veteran jump straight out of there seat. The staircase scene in particular, these guys are good. Although they weren't working on a huge budget everything looks good, and the actors come through. Dark Remains does have one of those interpretive endings which may be a negative for some, but I guess it makes you think. Cheri Christian and Greg Thompson are spot on as the grieving couple trying to rebuild there lives', however some side characters like the Sheriff didn't convince me. They aren't all that important anyways. I give Dark Remains a perfect ten rating for being ten times scarier than any recent studio ghost story/ Japanese remake.

And one with a negative sentiment:

In [12]:
print(data['train']['neg'][2])
Evil Aliens owes a huge debt to Peter Jacksons early films Bad Taste and Braindead.I must confess to never enjoying those films particularly and i say the same about this.Jake West is a director who clearly lacks inspiration of his own and chooses to steal from those whom he looks up to.I lost count of the amount of times a major Hollywood film was quoted most notably James Camerons Aliens.The amount of blood and gore on show here isn't funny either,the latter end of the film becomes tired and dragged out.Maybe it would have worked better as a short film.The actors a poor,the direction is weak and the plot is non existent.I can see what the director was trying to do,the homage he was trying to pay,but others have done the same thing a lot better than presented here. 4/10

We can also make a wordcloud visualization of the reviews.

In [13]:
import matplotlib.pyplot as plt
%matplotlib inline

from wordcloud import WordCloud, STOPWORDS

sentiment = 'neg'

# Combine all reviews for the desired sentiment
combined_text = " ".join([review for review in data['train'][sentiment]])

# Initialize wordcloud object
wc = WordCloud(background_color='white', max_words=50,
        # update stopwords to include common words like film and movie
        stopwords = STOPWORDS.update(['br','film','movie']))

# Generate and plot wordcloud
plt.imshow(wc.generate(combined_text))
plt.axis('off')
plt.show()

Try changing the sentiment to 'neg' and see if you can spot any obvious differences between the wordclouds.

TODO: Form training and test sets

Now that you've seen what the raw data looks like, combine the positive and negative documents to get one unified training set and one unified test set.

In [14]:
from sklearn.utils import shuffle

def prepare_imdb_data(data):
    """Prepare training and test sets from IMDb movie reviews."""
    
    # TODO: Combine positive and negative reviews and labels
    data_train=data['train']['pos']+data['train']['neg']
    labels_train=labels['train']['pos']+labels['train']['neg']
    data_test=data['test']['pos']+data['test']['neg']
    labels_test=labels['test']['pos']+labels['test']['neg']
    
    # TODO: Shuffle reviews and corresponding labels within training and test sets
    data_train,labels_train=shuffle(data_train,labels_train)
    data_test,labels_test=shuffle(data_test,labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test


data_train, data_test, labels_train, labels_test = prepare_imdb_data(data)
print("IMDb reviews (combined): train = {}, test = {}".format(len(data_train), len(data_test)))
IMDb reviews (combined): train = 25000, test = 25000

Step 2. Preprocessing

As you might have noticed in the sample reviews, our raw data includes HTML. Therefore there are HTML tags that need to be removed. We also need to remove non-letter characters, normalize uppercase letters by converting them to lowercase, tokenize, remove stop words, and stem the remaining words in each document.

TODO: Convert each review to words

As your next task, you should complete the function review_to_words() that performs all these steps. For your convenience, in the Python cell below we provide you with all the libraries that you may need in order to accomplish these preprocessing steps. Make sure you can import all of them! (If not, pip install from a terminal and run/import again.)

In [15]:
# BeautifulSoup to easily remove HTML tags
from bs4 import BeautifulSoup 

# RegEx for removing non-letter characters
import re

# NLTK library for the remaining steps
import nltk
nltk.download("stopwords")   # download list of stopwords (only once; need not run it again)
from nltk.corpus import stopwords # import stopwords

from nltk.stem.porter import *
#词干提取器
stemmer = PorterStemmer()
[nltk_data] Downloading package stopwords to /home/zzx/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [16]:
debug=0
def log(text):
    if debug>0:
        print(text)


def review_to_words(review):
    """Convert a raw review string into a sequence of words."""
    
    # TODO: Remove HTML tags and non-letters,
    #       convert to lowercase, tokenize,
    #       remove stopwords and stem
    log('Raw input :')
    log(review)
    
    # remove HTML tags
    #去掉文本的html格式
    review=BeautifulSoup(review, "html5lib").get_text()
    log('\nHTML tags removed')
    log(review)
    
    # remove punctuation
    review = re.sub(r"[^a-zA-Z0-9]", " ", review) 
    log('\n Punctuation removed')
    log(review)
    
    # lowercase    
    review = review.lower()
    
    # tokenize
    review = nltk.word_tokenize(review)
    log('\n Tokenized')
    log(review)
    
    # remove stop words
    review = [w for w in review if w not in stopwords.words("english")]
    log('\n Stop words removed')
    log(review)
    
    # stemming
    review = [PorterStemmer().stem(w) for w in review]
    log('\n Stemmed')
    log(review)
    
    log('\n\n')
    #words=[]
    
    # Return final list of words
    return review


review_to_words('''This is just a <em>test</em>.<br/><br />
But if it wasn't a test, it would make for a <b>Great</b> movie review!''')

#review_to_words(data_train[0])
Out[16]:
['test', 'test', 'would', 'make', 'great', 'movi', 'review']

With the function review_to_words() fully implemeneted, we can apply it to all reviews in both training and test datasets. This may take a while, so let's build in a mechanism to write to a cache file and retrieve from it later.

In [17]:
import pickle

cache_dir = os.path.join("cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        words_train = list(map(review_to_words, data_train))
        words_test = list(map(review_to_words, data_test))
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test


# Preprocess data
words_train, words_test, labels_train, labels_test = preprocess_data(
        data_train, data_test, labels_train, labels_test)

# Take a look at a sample
print("\n--- Raw review ---")
print(data_train[1])
print("\n--- Preprocessed words ---")
print(words_train[1])
print("\n--- Label ---")
print(labels_train[1])
Read preprocessed data from cache file: preprocessed_data.pkl

--- Raw review ---
Starring an unknown cast which seem likely to remain that way, this "film" is yet another cheap slasher flick which amazes me how this was released. I have no problem with horrors and slasher flicks in particular, in fact they are my favourites. But when they are done THIS BAD, it really does take the monkey and its no wonder the genre has such a hard time. The story is as clichéd and without imagination as possible with a bunch of people in a cabin out in the woods being slashed and hacked up by this zombie/ghost guy. Its not the story that sucks the most, its the atrocious acting and dialouge, home made directing quality and an awful soundtrack. Not to mention laughable effects and some incredibly lazy film making - these morons are outside in clear daylight yet we are meant to believe it's night?? What the hell was the director thinking with this move? What, he had only one day to film all this in? He was scared of the dark? ( Is hilarious seeing a cop walking around in pure daylight with a torch acting as if its pitch black, though)<br /><br />I guess the positive side for the actors is they look like people who work in the local supermarket so at least they could possibly escape from this film without ever being noticed. Im sure one of the "teens" plays bingo down the local pub - but she's 40-45.<br /><br />Anyway, good for a laugh but just another waste of film and time.

--- Preprocessed words ---
['stephen', 'king', 'adapt', 'script', 'king', 'young', 'famili', 'newcom', 'rural', 'main', 'find', 'pet', 'cemeteri', 'close', 'home', 'father', 'dale', 'midkiff', 'find', 'micmac', 'burial', 'ground', 'beyond', 'pet', 'cemeteri', 'power', 'resurrect', 'cours', 'anyth', 'buri', 'come', 'back', 'quit', 'right', 'averag', 'horror', 'pictur', 'start', 'clumsi', 'insult', 'inept', 'continu', 'way', 'absolut', 'worst', 'element', 'midkiff', 'worthless', 'perform', 'get', 'littl', 'better', 'toward', 'end', 'genuin', 'disturb', 'final', 'point', 'fact', 'whole', 'movi', 'realli', 'disturb', 'complet', 'dismiss', 'least', 'someth', 'make', 'memor', 'decent', 'support', 'perform', 'fred', 'gwynn', 'wise', 'old', 'age', 'neighbor', 'brad', 'greenquist', 'disfigur', 'spirit', 'victor', 'pascow', 'enough', 'realli', 'redeem', 'film', 'king', 'usual', 'cameo', 'minist', 'follow', 'sequel', 'also', 'direct', 'mari', 'lambert', 'wonder', 'mainstream', 'film', 'work', 'sinc', '4', '10']

--- Label ---
neg

Step 3: Extracting Bag-of-Words features

Now that each document has been preprocessed, we can transform each into a Bag-of-Words feature representation. Note that we need to create this transformation based on the training data alone, as we are not allowed to peek at the testing data at all!

The dictionary or vocabulary $V$ (set of words shared by documents in the training set) used here will be the one on which we train our supervised learning algorithm. Any future test data must be transformed in the same way for us to be able to apply the learned model for prediction. Hence, it is important to store the transformation / vocabulary as well.

Note: The set of words in the training set may not be exactly the same as the test set. What do you do if you encounter a word during testing that you haven't seen before? Unfortunately, we'll have to ignore it, or replace it with a special <UNK> token.

TODO: Compute Bag-of-Words features

Implement the extract_BoW_features() function, apply it to both training and test datasets, and store the results in features_train and features_test NumPy arrays, respectively. Choose a reasonable vocabulary size, say $|V| = 5000$, and keep only the top $|V|$ occuring words and discard the rest. This number will also serve as the number of columns in the BoW matrices.

Hint: You may find it useful to take advantage of CountVectorizer from scikit-learn. Also make sure to pickle your Bag-of-Words transformation so that you can use it in future.

In [18]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib
# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays

def extract_BoW_features(words_train, words_test, vocabulary_size=5000,
                         cache_dir=cache_dir, cache_file="bow_features.pkl"):
    """Extract Bag-of-Words for a given set of documents, already preprocessed into words."""
    
    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read features from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # TODO: Fit a vectorizer to training documents and use it to transform them
        # NOTE: Training documents have already been preprocessed and tokenized into words;
        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x
        vectorizer = CountVectorizer(preprocessor=lambda x: x,tokenizer=lambda x: x, lowercase=False,max_features=5000)
        features_train = vectorizer.fit_transform(words_train).toarray()

        # TODO: Apply the same vectorizer to transform the test documents (ignore unknown words)
        features_test = vectorizer.fit_transform(words_test).toarray()
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train=features_train, features_test=features_test,
                             vocabulary=vocabulary)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote features to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        features_train, features_test, vocabulary = (cache_data['features_train'],
                cache_data['features_test'], cache_data['vocabulary'])
    
    # Return both the extracted features as well as the vocabulary
    return features_train, features_test, vocabulary


# Extract Bag of Words features for both training and test datasets
features_train, features_test, vocabulary = extract_BoW_features(words_train, words_test)

# Inspect the vocabulary that was computed
print("Vocabulary: {} words".format(len(vocabulary)))

import random
print("Sample words: {}".format(random.sample(list(vocabulary.keys()), 8)))

# Sample
print("\n--- Preprocessed words ---")
print(words_train[5])
print("\n--- Bag-of-Words features ---")
print(features_train[5])
print("\n--- Label ---")
print(labels_train[5])
Read features from cache file: bow_features.pkl
Vocabulary: 5000 words
Sample words: ['pleasantli', 'ii', 'urban', 'armor', 'catherin', 'box', 'dove', 'donna']

--- Preprocessed words ---
['show', 'great', 'histori', 'stori', 'everyth', 'slaveri', 'way', 'treat', 'religion', 'way', 'jew', 'sent', 'hide', 'inquisit', 'belief', 'orisha', 'african', 'god', 'way', 'women', 'treat', 'includ', 'daughter', 'even', 'homosexu', 'way', 'charact', 'intertwin', 'violant', 'charact', 'sadden', 'desper', 'love', 'destroy', 'everyon', 'around', 'glad', 'decid', 'releas', 'v', 'although', 'would', 'love', 'see', 'unedit', 'version', 'xica', 'becom', 'heroin', 'look', 'way', 'use', 'power', 'help', 'seek', 'love', 'charact', 'found', 'relat', 'mani', 'peopl', 'centuri', 'look', 'forward', 'xica', 'everi', 'night', 'would', 'great', 'dub', 'english', 'american', 'love']

--- Bag-of-Words features ---
[0 0 0 ..., 0 0 0]

--- Label ---
pos
In [19]:
[index for index in features_train[5] if index != 0]
Out[19]:
[1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 4,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 5,
 1,
 2]

Let's try to visualize the Bag-of-Words feature vector for one of our training documents.

In [20]:
# Plot the BoW feature vector for a training document
plt.plot(features_train[5,:])
plt.xlabel('Word')
plt.ylabel('Count')
plt.show()

Question: Reflecting on Bag-of-Words feature representation

What is the average sparsity level of BoW vectors in our training set? In other words, on average what percentage of entries in a BoW feature vector are zero?

Answer:

...

Zipf's law

Zipf's law, named after the famous American linguist George Zipf, is an empirical law stating that given a large collection of documents, the frequency of any word is inversely proportional to its rank in the frequency table. So the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. In the figure below we plot number of appearances of each word in our training set against its rank.

In [21]:
# Find number of occurrences for each word in the training set
word_freq = features_train.sum(axis=0)

# Sort it in descending order
sorted_word_freq = np.sort(word_freq)[::-1]

# Plot 
plt.plot(sorted_word_freq)
plt.gca().set_xscale('log')
plt.gca().set_yscale('log')
plt.xlabel('Rank')
plt.ylabel('Number of occurrences')
plt.show()

Question: Zipf's law

What is the total number of occurrences of the most frequent word? What is the the total number of occurrences of the second most frequent word? Do your numbers follow Zipf's law? If not, why?

Answer:

...

TODO: Normalize feature vectors

Bag-of-Words features are intuitive to understand as they are simply word counts. But counts can vary a lot, and potentially throw off learning algorithms later in the pipeline. So, before we proceed further, let's normalize the BoW feature vectors to have unit length.

This makes sure that each document's representation retains the unique mixture of feature components, but prevents documents with large word counts from dominating those with fewer words.

In [22]:
import sklearn.preprocessing as pr

# TODO: Normalize BoW features in training and test set
features_train=pr.normalize(features_train, norm='l2',copy=False)
features_test=pr.normalize(features_test, norm='l2',copy=False)
/home/zzx/anaconda3/envs/rnn/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by the normalize function.
  warnings.warn(msg, _DataConversionWarning)
In [23]:
[index for index in features_train[5] if index != 0]
Out[23]:
[0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.28603877677367767,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.19069251784911848,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.19069251784911848,
 0.38138503569823695,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.19069251784911848,
 0.095346258924559238,
 0.095346258924559238,
 0.095346258924559238,
 0.47673129462279618,
 0.095346258924559238,
 0.19069251784911848]

Step 4: Classification using BoW features

Now that the data has all been properly transformed, we can feed it into a classifier. To get a baseline model, we train a Naive Bayes classifier from scikit-learn (specifically, GaussianNB), and evaluate its accuracy on the test set.

In [14]:
from sklearn.naive_bayes import GaussianNB

# TODO: Train a Guassian Naive Bayes classifier
clf1 = GaussianNB().fit(features_train,labels_train)

# Calculate the mean accuracy score on training and test sets
print("[{}] Accuracy: train = {}, test = {}".format(
        clf1.__class__.__name__,
        clf1.score(features_train, labels_train),
        clf1.score(features_test, labels_test)))
[GaussianNB] Accuracy: train = 0.81916, test = 0.48212

Tree-based algorithms often work quite well on Bag-of-Words as their highly discontinuous and sparse nature is nicely matched by the structure of trees. As your next task, you will try to improve on the Naive Bayes classifier's performance by using scikit-learn's Gradient-Boosted Decision Tree classifer.

TODO: Gradient-Boosted Decision Tree classifier

Use GradientBoostingClassifier from scikit-learn to classify the BoW data. This model has a number of parameters. We use default parameters for some of them and pre-set the rest for you, except one: n_estimators. Find a proper value for this hyperparameter, use it to classify the data, and report how much improvement you get over Naive Bayes in terms of accuracy.

Tip: Use a model selection technique such as cross-validation, grid-search, or an information criterion method, to find an optimal value for the hyperparameter.

In [15]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

n_estimators = range(20, 81, 10)

def classify_gboost(X_train, X_test, y_train, y_test):        
    # Initialize classifier
    clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=1.0, max_depth=1, random_state=0)

    # TODO: Classify the data using GradientBoostingClassifier
    gsearch = GridSearchCV(estimator = clf, param_grid = {'n_estimators': n_estimators}, verbose=50, n_jobs=-1)
    gsearch.fit(X_train, y_train)
    # TODO(optional): Perform hyperparameter tuning / model selection
    best_clf = gsearch.best_estimator_
    # TODO: Print final training & test accuracy
    print(gsearch.best_params_)
    # Return best classifier model
    print("[{}] Accuracy: train = {}, test = {}".format(
    best_clf.__class__.__name__,
    best_clf.score(X_train, y_train),
    best_clf.score(X_test, y_test)))
    
    return best_clf


clf2 = classify_gboost(features_train, features_test, labels_train, labels_test)
Fitting 3 folds for each of 7 candidates, totalling 21 fits
[CV] n_estimators=20 .................................................
[CV] n_estimators=20 .................................................
[CV] n_estimators=20 .................................................
[CV] n_estimators=30 .................................................
[CV] n_estimators=30 .................................................
[CV] n_estimators=30 .................................................
[CV] n_estimators=40 .................................................
[CV] n_estimators=40 .................................................
[CV] n_estimators=40 .................................................
[CV] n_estimators=50 .................................................
[CV] n_estimators=50 .................................................
[CV] n_estimators=50 .................................................
[CV] n_estimators=60 .................................................
[CV] n_estimators=60 .................................................
[CV] n_estimators=60 .................................................
[CV] n_estimators=70 .................................................
[CV] n_estimators=70 .................................................
[CV] n_estimators=70 .................................................
[CV] n_estimators=80 .................................................
[CV] .................. n_estimators=20, score=0.768419, total=  33.2s
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   38.8s
[CV] n_estimators=80 .................................................
[CV] n_estimators=80 .................................................
[CV] .................. n_estimators=20, score=0.767699, total=  48.1s
[Parallel(n_jobs=-1)]: Done   2 out of  21 | elapsed:   57.3s remaining:  9.1min
[CV] .................. n_estimators=20, score=0.773164, total=  50.7s
[Parallel(n_jobs=-1)]: Done   3 out of  21 | elapsed:   58.8s remaining:  5.9min
[CV] .................. n_estimators=30, score=0.783897, total= 1.1min
[Parallel(n_jobs=-1)]: Done   4 out of  21 | elapsed:  1.3min remaining:  5.6min
[CV] .................. n_estimators=30, score=0.784257, total= 1.4min
[Parallel(n_jobs=-1)]: Done   5 out of  21 | elapsed:  1.6min remaining:  5.1min
[CV] .................. n_estimators=30, score=0.790687, total= 1.4min
[Parallel(n_jobs=-1)]: Done   6 out of  21 | elapsed:  1.6min remaining:  4.1min
[CV] .................. n_estimators=40, score=0.805089, total= 1.7min
[Parallel(n_jobs=-1)]: Done   7 out of  21 | elapsed:  2.0min remaining:  4.0min
[CV] .................. n_estimators=40, score=0.799976, total= 1.8min
[Parallel(n_jobs=-1)]: Done   8 out of  21 | elapsed:  2.1min remaining:  3.4min
[CV] .................. n_estimators=40, score=0.796136, total= 2.0min
[Parallel(n_jobs=-1)]: Done   9 out of  21 | elapsed:  2.3min remaining:  3.0min
[CV] .................. n_estimators=60, score=0.816775, total= 2.6min
[Parallel(n_jobs=-1)]: Done  10 out of  21 | elapsed:  3.1min remaining:  3.4min
[CV] .................. n_estimators=60, score=0.816295, total= 2.7min
[Parallel(n_jobs=-1)]: Done  11 out of  21 | elapsed:  3.1min remaining:  2.9min
[CV] .................. n_estimators=50, score=0.817211, total= 2.8min
[Parallel(n_jobs=-1)]: Done  12 out of  21 | elapsed:  3.2min remaining:  2.4min
[CV] .................. n_estimators=50, score=0.805856, total= 2.9min
[Parallel(n_jobs=-1)]: Done  13 out of  21 | elapsed:  3.2min remaining:  2.0min
[CV] .................. n_estimators=60, score=0.824532, total= 2.8min
[Parallel(n_jobs=-1)]: Done  14 out of  21 | elapsed:  3.3min remaining:  1.6min
[CV] .................. n_estimators=50, score=0.808495, total= 3.0min
[Parallel(n_jobs=-1)]: Done  15 out of  21 | elapsed:  3.4min remaining:  1.4min
[CV] .................. n_estimators=70, score=0.828012, total= 2.9min
[Parallel(n_jobs=-1)]: Done  16 out of  21 | elapsed:  3.5min remaining:  1.1min
[CV] .................. n_estimators=70, score=0.817375, total= 2.9min
[Parallel(n_jobs=-1)]: Done  17 out of  21 | elapsed:  3.5min remaining:   49.6s
[CV] .................. n_estimators=80, score=0.822774, total= 3.0min
[Parallel(n_jobs=-1)]: Done  18 out of  21 | elapsed:  3.6min remaining:   36.4s
[CV] .................. n_estimators=80, score=0.821934, total= 3.0min
[Parallel(n_jobs=-1)]: Done  19 out of  21 | elapsed:  3.6min remaining:   23.0s
[CV] .................. n_estimators=80, score=0.834494, total= 3.2min
[CV] .................. n_estimators=70, score=0.820494, total= 3.5min
[Parallel(n_jobs=-1)]: Done  21 out of  21 | elapsed:  4.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  21 out of  21 | elapsed:  4.0min finished
{'n_estimators': 80}
[GradientBoostingClassifier] Accuracy: train = 0.84024, test = 0.53692

TODO: Adverserial testing

Write a short movie review to trick your machine learning model! That is, a movie review with a clear positive or negative sentiment that your model will classify incorrectly.

Hint: You might want to take advantage of the biggest weakness of the Bag-of-Words scheme!

In [16]:
# TODO: Write a sample review and set its true sentiment
my_review = "The action was bad-ass and the visuals were sick!  The vocal delivery of the lead villian gave me creeps the way it should whereas the protagonist's demeanor was uninspiring until boldly salvaging his role at the end."
true_sentiment = 'pos'  # sentiment must be 'pos' or 'neg'

# TODO: Apply the same preprocessing and vectorizing steps as you did for your training data
my_words = review_to_words(my_review)
vectorizer = CountVectorizer(vocabulary=vocabulary, preprocessor=lambda x: x, tokenizer=lambda x: x)
my_features = vectorizer.transform(my_words).toarray()
my_features = pr.normalize(my_features)
# TODO: Then call your classifier to label it
yhat = clf2.predict(my_features)
print(yhat[-1])
pos
/home/zzx/anaconda3/envs/rnn/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by the normalize function.
  warnings.warn(msg, _DataConversionWarning)

Step 5: Switching gears - RNNs

We just saw how the task of sentiment analysis can be solved via a traditional machine learning approach: BoW + a nonlinear classifier. We now switch gears and use Recurrent Neural Networks, and in particular LSTMs, to perform sentiment analysis in Keras. Conveniently, Keras has a built-in IMDb movie reviews dataset that we can use, with the same vocabulary size.

In [3]:
from keras.datasets import imdb  # import the built-in imdb dataset in Keras

# Set the vocabulary size
vocabulary_size = 5000

# Load in training and test data (note the difference in convention compared to scikit-learn)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))
Loaded dataset with 25000 training samples, 25000 test samples
In [18]:
# Inspect a sample review and its label
print("--- Review ---")
print(X_train[7])
print("--- Label ---")
print(y_train[7])
--- Review ---
[1, 4, 2, 716, 4, 65, 7, 4, 689, 4367, 2, 2343, 4804, 2, 2, 2, 2, 2315, 2, 2, 2, 2, 4, 2, 628, 2, 37, 9, 150, 4, 2, 4069, 11, 2909, 4, 2, 847, 313, 6, 176, 2, 9, 2, 138, 9, 4434, 19, 4, 96, 183, 26, 4, 192, 15, 27, 2, 799, 2, 2, 588, 84, 11, 4, 3231, 152, 339, 2, 42, 4869, 2, 2, 345, 4804, 2, 142, 43, 218, 208, 54, 29, 853, 659, 46, 4, 882, 183, 80, 115, 30, 4, 172, 174, 10, 10, 1001, 398, 1001, 1055, 526, 34, 3717, 2, 2, 2, 17, 4, 2, 1094, 871, 64, 85, 22, 2030, 1109, 38, 230, 9, 4, 4324, 2, 251, 2, 1034, 195, 301, 14, 16, 31, 7, 4, 2, 8, 783, 2, 33, 4, 2945, 103, 465, 2, 42, 845, 45, 446, 11, 1895, 19, 184, 76, 32, 4, 2, 207, 110, 13, 197, 4, 2, 16, 601, 964, 2152, 595, 13, 258, 4, 1730, 66, 338, 55, 2, 4, 550, 728, 65, 1196, 8, 1839, 61, 1546, 42, 2, 61, 602, 120, 45, 2, 6, 320, 786, 99, 196, 2, 786, 2, 4, 225, 4, 373, 1009, 33, 4, 130, 63, 69, 72, 1104, 46, 1292, 225, 14, 66, 194, 2, 1703, 56, 8, 803, 1004, 6, 2, 155, 11, 4, 2, 3231, 45, 853, 2029, 8, 30, 6, 117, 430, 19, 6, 2, 9, 15, 66, 424, 8, 2337, 178, 9, 15, 66, 424, 8, 1465, 178, 9, 15, 66, 142, 15, 9, 424, 8, 28, 178, 662, 44, 12, 17, 4, 130, 898, 1686, 9, 6, 2, 267, 185, 430, 4, 118, 2, 277, 15, 4, 1188, 100, 216, 56, 19, 4, 357, 114, 2, 367, 45, 115, 93, 788, 121, 4, 2, 79, 32, 68, 278, 39, 8, 818, 162, 4165, 237, 600, 7, 98, 306, 8, 157, 549, 628, 11, 6, 2, 13, 824, 15, 4104, 76, 42, 138, 36, 774, 77, 1059, 159, 150, 4, 229, 497, 8, 1493, 11, 175, 251, 453, 19, 2, 189, 12, 43, 127, 6, 394, 292, 7, 2, 4, 107, 8, 4, 2826, 15, 1082, 1251, 9, 906, 42, 1134, 6, 66, 78, 22, 15, 13, 244, 2519, 8, 135, 233, 52, 44, 10, 10, 466, 112, 398, 526, 34, 4, 1572, 4413, 2, 1094, 225, 57, 599, 133, 225, 6, 227, 7, 541, 4323, 6, 171, 139, 7, 539, 2, 56, 11, 6, 3231, 21, 164, 25, 426, 81, 33, 344, 624, 19, 6, 4617, 7, 2, 2, 6, 2, 4, 22, 9, 1082, 629, 237, 45, 188, 6, 55, 655, 707, 2, 956, 225, 1456, 841, 42, 1310, 225, 6, 2493, 1467, 2, 2828, 21, 4, 2, 9, 364, 23, 4, 2228, 2407, 225, 24, 76, 133, 18, 4, 189, 2293, 10, 10, 814, 11, 2, 11, 2642, 14, 47, 15, 682, 364, 352, 168, 44, 12, 45, 24, 913, 93, 21, 247, 2441, 4, 116, 34, 35, 1859, 8, 72, 177, 9, 164, 8, 901, 344, 44, 13, 191, 135, 13, 126, 421, 233, 18, 259, 10, 10, 4, 2, 2, 4, 2, 3074, 7, 112, 199, 753, 357, 39, 63, 12, 115, 2, 763, 8, 15, 35, 3282, 1523, 65, 57, 599, 6, 1916, 277, 1730, 37, 25, 92, 202, 6, 2, 44, 25, 28, 6, 22, 15, 122, 24, 4171, 72, 33, 32]
--- Label ---
0

Notice that the label is an integer (0 for negative, 1 for positive), and the review itself is stored as a sequence of integers. These are word IDs that have been preassigned to individual words. To map them back to the original words, you can use the dictionary returned by imdb.get_word_index().

In [19]:
# Map word IDs back to words
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print("--- Review (with words) ---")
print([id2word.get(i, " ") for i in X_train[7]])
print("--- Label ---")
print(y_train[7])
--- Review (with words) ---
['the', 'of', 'and', 'local', 'of', 'their', 'br', 'of', 'attention', 'widow', 'and', 'captures', 'parties', 'and', 'and', 'and', 'and', 'excitement', 'and', 'and', 'and', 'and', 'of', 'and', 'english', 'and', 'like', 'it', 'years', 'of', 'and', 'unintentional', 'this', 'hitchcock', 'of', 'and', 'learn', 'everyone', 'is', 'quite', 'and', 'it', 'and', 'such', 'it', 'bonus', 'film', 'of', 'too', 'seems', 'he', 'of', 'enough', 'for', 'be', 'and', 'editing', 'and', 'and', 'please', 'great', 'this', 'of', 'shoots', 'thing', '3', 'and', "it's", 'mentioning', 'and', 'and', 'given', 'parties', 'and', 'back', 'out', 'interesting', 'times', 'no', 'all', 'average', 'talking', 'some', 'of', 'nor', 'seems', 'into', 'best', 'at', 'of', 'every', 'cast', 'i', 'i', 'inside', 'keep', 'inside', 'large', 'viewer', 'who', 'obscure', 'and', 'and', 'and', 'movie', 'of', 'and', 'entirely', "you've", 'see', 'because', 'you', 'deals', 'successful', 'her', 'anything', 'it', 'of', 'dedicated', 'and', 'hard', 'and', 'further', "that's", 'takes', 'as', 'with', 'by', 'br', 'of', 'and', 'in', 'minute', 'and', 'they', 'of', 'westerns', 'watch', 'seemed', 'and', "it's", 'lee', 'if', 'oh', 'this', 'japan', 'film', 'around', 'get', 'an', 'of', 'and', 'always', 'life', 'was', 'between', 'of', 'and', 'with', 'group', 'rate', 'code', "film's", 'was', 'although', 'of', 'arts', 'had', 'death', 'time', 'and', 'of', 'anyway', 'romantic', 'their', 'won', 'in', 'kevin', 'only', 'flying', "it's", 'and', 'only', 'cut', 'show', 'if', 'and', 'is', 'star', 'stay', 'movies', 'both', 'and', 'stay', 'and', 'of', 'music', 'of', 'tell', 'missing', 'they', 'of', 'here', 'really', 'me', 'we', 'value', 'some', 'silent', 'music', 'as', 'had', 'thought', 'and', 'realized', 'she', 'in', 'sorry', 'reasons', 'is', 'and', '10', 'this', 'of', 'and', 'shoots', 'if', 'average', 'remembered', 'in', 'at', 'is', 'over', 'worse', 'film', 'is', 'and', 'it', 'for', 'had', 'absolutely', 'in', 'naive', 'want', 'it', 'for', 'had', 'absolutely', 'in', 'j', 'want', 'it', 'for', 'had', 'back', 'for', 'it', 'absolutely', 'in', 'one', 'want', 'shots', 'has', 'that', 'movie', 'of', 'here', 'write', 'whatsoever', 'it', 'is', 'and', 'set', 'got', 'worse', 'of', 'where', 'and', 'once', 'for', 'of', 'accent', 'after', 'saw', 'she', 'film', 'of', 'rest', 'little', 'and', 'camera', 'if', 'best', 'way', 'elements', 'know', 'of', 'and', 'also', 'an', 'were', 'sense', 'or', 'in', 'realistic', 'actually', 'satan', "he's", 'score', 'br', 'any', 'himself', 'in', 'another', 'type', 'english', 'this', 'is', 'and', 'was', 'tom', 'for', 'dating', 'get', "it's", 'such', 'from', 'fantastic', 'will', 'pace', 'new', 'years', 'of', 'guy', 'game', 'in', 'murders', 'this', 'us', 'hard', 'lives', 'film', 'and', 'fact', 'that', 'out', 'end', 'is', 'getting', 'together', 'br', 'and', 'of', 'seen', 'in', 'of', 'jail', 'for', 'sees', 'utterly', 'it', 'meet', "it's", 'depth', 'is', 'had', 'do', 'you', 'for', 'was', 'rather', 'convince', 'in', 'why', 'last', 'very', 'has', 'i', 'i', 'throughout', 'never', 'keep', 'viewer', 'who', 'of', 'becoming', 'switch', 'and', 'entirely', 'music', 'even', 'interest', 'scene', 'music', 'is', 'far', 'br', 'voice', 'riveting', 'is', 'again', 'something', 'br', 'decent', 'and', 'she', 'this', 'is', 'shoots', 'not', 'director', 'have', 'against', 'people', 'they', 'line', 'cinematography', 'film', 'is', 'couples', 'br', 'and', 'and', 'is', 'and', 'of', 'you', 'it', 'sees', 'hero', "he's", 'if', "can't", 'is', 'time', 'husband', 'silly', 'and', 'result', 'music', 'image', 'sequences', "it's", 'chase', 'music', 'is', 'veteran', 'include', 'and', 'freeman', 'not', 'of', 'and', 'it', 'along', 'are', 'of', 'hearing', 'cutting', 'music', 'his', 'get', 'scene', 'but', 'of', 'fact', 'correct', 'i', 'i', 'means', 'this', 'and', 'this', 'blockbuster', 'as', 'there', 'for', 'disappointed', 'along', 'wrong', 'few', 'has', 'that', 'if', 'his', 'weird', 'way', 'not', 'girl', 'display', 'of', 'love', 'who', 'so', 'friendship', 'in', 'we', 'down', 'it', 'director', 'in', 'situation', 'line', 'has', 'was', 'big', 'why', 'was', 'your', 'supposed', 'last', 'but', 'especially', 'i', 'i', 'of', 'and', 'and', 'of', 'and', 'internet', 'br', 'never', 'give', 'theme', 'rest', 'or', 'really', 'that', 'best', 'and', 'release', 'in', 'for', 'so', 'multi', 'random', 'their', 'even', 'interest', 'is', 'judge', 'once', 'arts', 'like', 'have', 'then', 'own', 'is', 'and', 'has', 'have', 'one', 'is', 'you', 'for', 'off', 'his', 'dutch', 'we', 'they', 'an']
--- Label ---
0

Unlike our Bag-of-Words approach, where we simply summarized the counts of each word in a document, this representation essentially retains the entire sequence of words (minus punctuation, stopwords, etc.). This is critical for RNNs to function. But it also means that now the features can be of different lengths!

Question: Variable length reviews

What is the maximum review length (in terms of number of words) in the training set? What is the minimum?

Answer:

...

TODO: Pad sequences

In order to feed this data into your RNN, all input documents must have the same length. Let's limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). You can accomplish this easily using the pad_sequences() function in Keras. For now, set max_words to 500.

In [4]:
from keras.preprocessing import sequence

# Set the maximum number of words per document (for both training and testing)
max_words = 500

# TODO: Pad sequences in X_train and X_test
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

TODO: Design an RNN model for sentiment analysis

Build your model architecture in the code cell below. We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.

Remember that your input is a sequence of words (technically, integer word IDs) of maximum length = max_words, and your output is a binary sentiment label (0 or 1).

In [5]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

#使用多GPU训练
import tensorflow as tf
server = tf.train.Server.create_local_server()
sess = tf.Session(server.target)
from keras import backend as K
K.set_session(sess)

# TODO: Design your model
model = Sequential()
model.add(Embedding(vocabulary_size, 128, input_length=max_words))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 500, 128)          640000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 771,713.0
Trainable params: 771,713.0
Non-trainable params: 0.0
_________________________________________________________________
None

Question: Architecture and parameters

Briefly describe your neural net architecture. How many model parameters does it have that need to be trained?

Answer:

...

TODO: Train and evaluate your model

Now you are ready to train your model. In Keras world, you first need to compile your model by specifying the loss function and optimizer you want to use while training, as well as any evaluation metrics you'd like to measure. Specify the approprate parameters, including at least one metric 'accuracy'.

In [6]:
# TODO: Compile your model, specifying a loss function, optimizer, and metrics
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Once compiled, you can kick off the training process. There are two important training parameters that you have to specify - batch size and number of training epochs, which together with your model architecture determine the total training time.

Training may take a while, so grab a cup of coffee, or better, go for a hike! If possible, consider using a GPU, as a single training run can take several hours on a CPU.

Tip: You can split off a small portion of the training set to be used for validation during training. This will help monitor the training process and identify potential overfitting. You can supply a validation set to model.fit() using its validation_data parameter, or just specify validation_split - a fraction of the training data for Keras to set aside for this purpose (typically 5-10%). Validation metrics are evaluated once at the end of each epoch.

In [7]:
# TODO: Specify training parameters: batch size and number of epochs
batch_size = 512
num_epochs = 8

# TODO(optional): Reserve/specify some training data for validation (not to be used for training)

# TODO: Train your model
model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs, verbose=1, validation_data=(X_test, y_test))
Train on 25000 samples, validate on 25000 samples
Epoch 1/8
25000/25000 [==============================] - 75s - loss: 0.6312 - acc: 0.6773 - val_loss: 0.5712 - val_acc: 0.7383
Epoch 2/8
25000/25000 [==============================] - 74s - loss: 0.4074 - acc: 0.8220 - val_loss: 0.3381 - val_acc: 0.8574
Epoch 3/8
25000/25000 [==============================] - 73s - loss: 0.2728 - acc: 0.8927 - val_loss: 0.3313 - val_acc: 0.8562
Epoch 4/8
25000/25000 [==============================] - 73s - loss: 0.2893 - acc: 0.8858 - val_loss: 0.3239 - val_acc: 0.8689
Epoch 5/8
25000/25000 [==============================] - 73s - loss: 0.2192 - acc: 0.9166 - val_loss: 0.3204 - val_acc: 0.8746
Epoch 6/8
25000/25000 [==============================] - 73s - loss: 0.1976 - acc: 0.9268 - val_loss: 0.3293 - val_acc: 0.8715
Epoch 7/8
25000/25000 [==============================] - 74s - loss: 0.1708 - acc: 0.9390 - val_loss: 0.3495 - val_acc: 0.8716
Epoch 8/8
25000/25000 [==============================] - 74s - loss: 0.1507 - acc: 0.9465 - val_loss: 0.3453 - val_acc: 0.8719
Out[7]:
<keras.callbacks.History at 0x7fb13ed71f60>
In [24]:
# Save your model, so that you can quickly load it in future (and perhaps resume training)
model_file = "rnn_model.h5"  # HDF5 file
model.save(os.path.join(cache_dir, model_file))

# Later you can load it using keras.models.load_model()
from keras.models import load_model
model = load_model(os.path.join(cache_dir, model_file))

Once you have trained your model, it's time to see how well it performs on unseen test data.

In [25]:
# Evaluate your model on the test set
scores = model.evaluate(X_test, y_test, verbose=1)  # returns loss and other metrics specified in model.compile()
print("Test accuracy:", scores[1])  # scores[1] should correspond to accuracy if you passed in metrics=['accuracy']
25000/25000 [==============================] - 228s     
Test accuracy: 0.87188

Question: Comparing RNNs and Traditional Methods

How well does your RNN model perform compared to the BoW + Gradient-Boosted Decision Trees?

Answer:

...

Extensions

There are several ways in which you can build upon this notebook. Each comes with its set of challenges, but can be a rewarding experience.

  • The first thing is to try and improve the accuracy of your model by experimenting with different architectures, layers and parameters. How good can you get without taking prohibitively long to train? How do you prevent overfitting?

  • Then, you may want to deploy your model as a mobile app or web service. What do you need to do in order to package your model for such deployment? How would you accept a new review, convert it into a form suitable for your model, and perform the actual prediction? (Note that the same environment you used during training may not be available.)

  • One simplification we made in this notebook is to limit the task to binary classification. The dataset actually includes a more fine-grained review rating that is indicated in each review's filename (which is of the form <[id]_[rating].txt> where [id] is a unique identifier and [rating] is on a scale of 1-10; note that neutral reviews > 4 or < 7 have been excluded). How would you modify the notebook to perform regression on the review ratings? In what situations is regression more useful than classification, and vice-versa?

Whatever direction you take, make sure to share your results and learnings with your peers, through blogs, discussions and participating in online competitions. This is also a great way to become more visible to potential employers!

In [ ]: