Unsupervised Language Modeling at scale for robust sentiment classification
PyTorch Unsupervised Sentiment Discovery
This codebase is part of our effort to reproduce, analyze, and scale the Generating Reviews and Discovering Sentiment paper from OpenAI.
Early efforts have yielded a training time of 5 days on 8 volta-class gpus down from the training time of 1 month reported in the paper.
The learned language model can be transferred to other natural language processing (NLP) tasks where it is used to featurize text samples. The featurizations provide a strong initialization point for discriminative language tasks, and allow for competitive task performance given only a few labeled samples. To illustrate this the model is transferred to the Binary Stanford Sentiment Treebank task.
The model's performance as a whole will increase as it processes more data. However, as is discovered early on in the training process, transferring the model to sentiment classification via Logistic Regression with an L1 penalty leads to the emergence of one neuron (feature) that correlates strongly with sentiment.
This sentiment neuron can be used to accurately and robustly discriminate between postive and negative sentiment given the single feature. A decision boundary of 0 is able to split the distribution of the neuron features into (relatively) distinct positive and negative sentiment clusters.
Samples from the tails of the feature distribution correlate strongly with positive/negative sentiment.
- SentimentDiscovery Package
Install the sentiment_discovery package with
pip setup.py install in order to run the modules/scripts within this repo.
At this time we only support python3. * numpy * pytorch (>= 0.3.0) * pandas * scikit-learn * matplotlib * unidecode
We've included a link to a preprocessed version of the amazon review dataset. We've also included links to both the full dataset, and a pre-sharded (into train/val/test) version meant for lazy evaluation that we used to generate our results. * full[40GB] * lazy shards[37GB]
In addition to providing easily reusable code of the core functionalities of this work in our
sentiment_discovery package, we also provide scripts to perform the three main high-level functionalities in the paper: * unsupervised reconstruction/language modeling of a corpus of text * transfer of learned language model to perform sentiment analysis on a specified corpus * heatmap visualization of sentiment in a body of text, and conversely generating text according to a fixed sentiment
Script results will be saved/logged to the
//* directory hierarchy.
Train a recurrent language model (mlstm/lstm) and save it to the
/ directory in the above hierarchy. The metric histories will be available in the appropriate
/ directory in the same hierarchy. In order to apply
weight_norm only to lstm parameters as specified by the paper we use the
python text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \-embed_size 64 -rnn_size 4096 -layers 1 -weight_norm -lstm_only -rnn_type mlstm -dropout 0 -persist_state 1 \-seq_length 256 -batch_size 32 -lr 0.000125 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \-train data/amazon/reviews.json -valid data/amazon/reviews.json -test data/amazon/reviews.json \-data_set_type unsupervised -lazy -text_key sentence
The training process takes in a json list (or csv) file as a dataset where each entry has key (or column)
text_key containing samples of text to model.
The unsupervised dataset will proportionally take 1002 (1000 train) shards from the training, validation, and test sets. Currently the same dataset must be used for all three.
The dataset entries are transposed to allow for the concatenation of sequences in order to persist hidden state across a shard in the training process.
Hidden states are reset either at the start of every sequence, every shard, or never based on the value of
persist_state. See data flags.
Lastly, We know waiting for more than 1 million updates over 1 epoch for large datasets is a long time. We've set the training script to save the most current model (and data history) every 5000 iterations, and also included a
-max_iters flag to end training early.
The resulting loss curve over 1 (x4 because of data parallelism) million iterations should produce a graph similar to this.
Note: if it's your first time modeling a particular corpus make sure to turn on the
For a list of default values for all flags look at
no_loss - do not compute or track the loss curve of the model *
lr - Starting learning rate. *
lr_scheduler - one of [ExponentialLR,LinearLR] *
lr_factor - factor by which to decay exponential learning rate *
optimizer_type - Class name of optimizer to use as listed in torch.optim class (ie. SGD, RMSProp, Adam) *
clip - Clip gradients to this maximum value. Default: 1. *
epochs - number of epochs to train for *
max_iters - total number of training iterations to run. Takes presedence over number of epochs *
start_epoch - epoch to start training at (used to resume training for exponential decay scheduler) *
start_iter - what iteration to start training at (used mainly for resuming linear learning rate scheduler)
This script parses sample text and binary sentiment labels from csv (or json) data files according to
label_key. The text is then featurized and fit to a logistic regression model.
The index of the feature with the largest L1 penalty is then used as the index of the sentiment neuron. The relevant sentiment feature is extracted from the samples' features and then also fit to a logistic regression model to compare performance.
Below is example output for transfering our trained language model to the Binary Stanford Sentiment Treebank task.
python3 sentiment_transfer.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \-train binary_sst/train.csv -valid binary_sst/val.csv -test binary_sst/test.csv \-text_key sentence -label_key label -load_model e0.pt
92.11/90.02/91.27 train/val/test accuracy w/ all_neurons00.12 regularization coef w/ all_neurons00107 features used w/ all neurons all_neuronsneuron(s) 1285, 1503, 3986, 1228, 2301 are top sentiment neurons86.36/85.44/86.71 train/val/test accuracy w/ n_neurons00.50 regularization coef w/ n_neurons00001 features used w/ all neurons n_neurons
These results are saved to
Note: The other neurons are firing, and we can use these neurons in our prediction metrics by setting
-num_neurons to the desired number of heavily penalized neurons. Even if this option is not enabled the logit visualizations/data are always saved for at least the top 5 neurons.
For a list of default values for all involved flags look at
num_neurons - number of neurons to consider for sentiment classification *
no_test_eval - do not evaluate the test data for accuracy (useful when your test set has no labels) *
write_results - write results (output probabilities as well as logits of top neurons) of model on test (or train if none is specified) data to specified filepath. (Can only write results to a csv file)
Transfer to your own data
Analyze sentiment on your own data by using one of the pretrained models to train on the Stanford Treebank (or other sentiment benchmarks) and evaluating on your own
If you don't have labels for your data set make sure to use the
Train a Different Classifier
If you have your own unique classification dataset complete with training labels, you can also use a pretrained model to extract relevant features for classification.
Supply your own
-train (and evaluation) dataset while making sure to specify the right text and label keys.
This script takes a piece of text specified by
text and then visualizes it via a karpathy-style heatmap based on the activations of the specified
neuron. In order to find the neuron to specify, find the index of the neuron with the largest l1 penalty based on the neuron weight visualizations from the previous section.
Without loss of generality we're going to go through this section first referring to OpenAI's neuron weights. For this set of weights the sentiment neuron can be found at neuron 2388.
Armed with this information we can use the script to analyze the following exerpt from the beginning of a review about The League of Extraordinary Gentlemen.
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is one of the all time greats \I have been a fan of his since the 1950's." -neuron 2388 -load_model e0.pt
However, not only does this script analyze text, but the user can also specify the begining of a text snippet with
text and generate additional following text up to a total
seq_length. Simply set the
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \-neuron 2388 -load_model e0.pt -generate
In addition to generating text, the user can also set the sentiment of the generated text to positive/negative by setting
-overwrite to +/- 1. Continuing our previous example:
When training a sentiment neuron with this method, sometimes we learn a positive sentiment neuron, and sometimes a negative sentiment neuron. This doesn’t matter, except for when running the visualization. The
negate flag improves the visualization of negative sentiment models, if you happen to learn one.
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \-neuron 1285 -load_model e0.pt -generate -negate -overwrite +/-1
The resulting heatmaps are all saved to a png file corresponding of the first 100 characters of (generated or analyzed) text at
For a list of default values for all flags look at
text - whole or initial text for visualization *
neuron - index of sentiment neuron *
generate - generates text following initial text up to a total length of
temperature - temperature from sampling from language model while generating text *
overwrite - For generated portion of text overwrite the neuron's value while generating *
negate - If
neuron corresponds to a negative sentiment neuron rather than positive sentiment *
layer - layer of recurrent net to extract neurons from
python text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \-embed_size 64 -rnn_size 4096 -layers 1 -weight_norm -lstm_only -rnn_type mlstm -dropout 0 -persist_state 1 \-seq_length 256 -batch_size 32 -lr 0.00025 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \-train data/amazon/reviews.json -valid data/amazon/reviews.json -test data/amazon/reviews.json \-data_set_type unsupervised -lazy -text_key sentence -num_gpus 2 -distributed
In order to utilize Data Parallelism during training time ensure that
-cuda is in use and that
-num_gpus >1. As mentioned previously, vanilla DataParallelism produces no speedup for recurrent architectures. In order to circumvent this problem turn on the
-distributed flag to utilize PyTorch's DistributedDataParallel instead and experience speedup gains.
Also make sure to scale the learning rate as appropriate. Note how we went from
2.5e-4 learning rates when we doubled the number of gpus.
Note: validation and test datasets are currently not functioning properly for unsupervised reconstruction.
sentiment_discovery.modules: implementation of multiplicative LSTM, stacked LSTM.
sentiment_discovery.model: implementation of module for processing text, and model wrapper for running experiments
sentiment_discovery.reparametrization: implementation of weight reparameterization abstract class, and weight norm reparameterization.
sentiment_discovery.data_utils: implementations of datasets, data loaders, data samplers, data preprocessing utils.
sentiment_discovery.neuron_transfer: contains utilities for extracting neurons and featurizing text with the values of a model's neurons.
sentiment_discovery.learning_rates: contains implementations of learning rate schedules (ie. linear and exponential decay)
experiment_dir- Root directory for saving results, models, and figures
experiment_name- Name of experiment used for logging to
should_test- whether to train or evaluate a model
embed_size- embedding size for data
rnn_type- one of <
rnn_size- hidden state dimension of rnn
layers- Number of stacked recurrent layers
dropout- dropout probability after hidden state (0 = None, 1 = all dropout)
weight_norm- boolean flag, which, when enabled, will apply weight norm only to the recurrent (m)LSTM parameters
-weight_normis applied to the model, apply it to the lstm parmeters only
model_dir- subdirectory where models are saved in
load_model- a specific checkpoint file to load from
batch_size- minibatch size
data_size- dimension of each data point
seq_length- length of time sequence for reconstruction/generation (sentiment inference has no maximum sequence length)
data_set_type- json or csv for sentiment transfer or unsupervised (only json available) for unsupervised reconstruction
persist_state- degree to which to persist state across samples for unsupervised learnign. 0 = reset state after every sample, 1 = reset state after finishing a shard, -1 = never reset state
transpose- transpose dataset, given batch size. Is always on for unsupervised learning, (should probably only be used for unsupervised).
no_wrap- In order to better concatenate strings together for unsupervised learning the batch data sampler does not drop the last batch, it instead "wraps" the dataset around to fill in the last batch of an epoch. On subsequent calls to the batch sampler the data set will have been shifted to account for this "wrapping" and allow for proper hidden state persistence. This is turned on by default in non-unsupervised scripts.
cache- cache recently used samples, and preemptively cache future data samples. Helps avoid thrashing in large datasets. Note: currently buggy and provides no performance benefit
Data Processing Flags
lazy- lazily load dataset (necessary for large datasets such as amazon)
preprocess- load dataset into memory and preprocess it according to section 4 of the paper. (should only be called once on a dataset)
shuffle- shuffle unsupervised dataset.
text_key- column name/dictionary key for extacting text samples from csv/json files.
label_key- column name/dictionary key for extacting sample labels from csv/json files.
delim- column delimiter for csv tokens (eg. ',', '\t', '|')
drop_unlabeled- drops unlabeled samples from csv/json file
binarize_sent- if sentiment labels are not 0 or 1, then binarize them so that the lower half of sentiment ratings get set to 0, and the upper half of the rating scale gets set to 1.
Dataset Path Flags
train- path to training set
split- comma-separated list of proportions for splitting the training set into training, validation, and test sets
valid- path to validation set
test- path to test set
cuda- utilize gpus to process strings. Should be on as cpu is too slow to process this.
num_gpus- number of gpus available
rank- distributed process index.
distributed- train with data parallelism distributed across multiple processes
world_size- number of distributed processes to run. Do not set. The scripts will automatically set it equal to
num_gpus(one gpu per process).
verbose- 2 = enable stdout for all processes (including all the data parallel workers). 1 = enable stdout for master worker. 0 = disable all stdout
seed- random seed for pytorch random operations
- Why Unsupervised Language Modeling?
- Reproducing Results
- Data Parallel Scalability
- Open Questions
Thanks to @guillette for providing a lightweight pytorch port of the original weights.
This project uses the amazon review dataset collected by J. McAuley
Want to help out? Open up an issue with questions/suggestions or pull requests ranging from minor fixes to new functionality.
May your learning be Deep and Unsupervised.
To restore the repository download the bundle
git clone NVIDIA-sentiment-discovery_-_2017-12-07_19-30-58.bundle
Upload date: 2017-12-07
- 2017-12-07 19:54:15
- 2017-12-07 19:30:58
- Internet Archive Python library 1.5.0
- iagitup - v1.0