I have been interested in exploring the direction of machine comprehension from a long time and the new year seemed just perfect. This post is mostly about discussing the progress we have made to make content preserving text generation with attribute controls. Such architectures allow us to convert sentences to target attributes like sentiment, tense and voice while preserving the original content.
Query: “The people behind the counter were not friendly whatsoever.”
Output: “The people at the counter were very friendly and helpful.”
Suppose we want to make a model which takes a text of sentiment x and outputs a text of sentiment y. We will use Yelp dataset for this purpose.
One approach to solve this problem is to first make a seq2seq autoencoder. The only thing we need to do differently is to add a sentiment word-vector in embedding.
pos_label = [1, 0, 0, … emb_size]
neg_label = [2, 0, 0, … emb_size]
For positive labelled sentences, we append text pos_label at the beginning and vice versa. This extra label allows us to control output down the line when we are interested in encoding a negative sentence with positive target sentence. At that time we will create text in this fashion.
input = [pos_label, The, people, behind, the, counter, were, not, friendly, whatsoever, .]
output = [The, people, behind, the, counter, were, not, friendly, whatsoever, .]
Once we have the encoder, we need to have a classifier trained for the labelled Yelp data. This classifier can be anything as far as it is good. Let’s take the ULMFiT model trained over Yelp as it has an error rate of just 2.16%.
Once we have both the components ready, we can connect them in a pipeline and force seq2seq to start giving output as per the target class.
- The embedding layer is kept untrainable as we are not interested in getting better embeddings as a pre-trained Glove 100 was taken and fined tuned for the task.
- The seq2seq model was kept trainable as it has to learn to output sentences as per the target task. The encoder part will learn to encode the sentence along with the target label and the decoder part will learn to output sentence which is classified as in accordance with the pos_lable.
- The ULMFiT classifier is kept trainable
As we can see, if we give a negative sentence along with pos_label the model is able to output a positive sentence. But the content of the generated sentence is different from the input. The input sentence was about “counter people” while the output sentence is about the “restaurant”. This means our network has learned to cheat to get the required output. As the only constraint on the generator is to produce a sentence of a given label, it has the freedom to generate any sentence. If we want to preserve the content, we need to enforce content preservation also.
Now we create a network using an encoder, a decoder along with a discriminator as described in this paper which was accepted in NIPS 2018 and sets a new benchmark.
Let be a sentence and the corresponding attribute vector be . Let be the encoded representation of . Since sentence should have high probability under , we enforce this constraint using an auto-encoding loss.
Consider , an arbitrary attribute vector different from (i.e., corresponds to a different set of attribute values). Let be a generated sentence conditioned on . Assuming a well-trained model, the sampled sentence will preserve the content of . In this case, sentence should have high probability under where is the encoded representation of sentence . This requirement can be enforced in a back-translation loss as follows.
A common pitfall of the auto-encoding loss in auto-regressive models is that the model learns to simply copy the input sequence without capturing any informative features in the latent representation. A de-noising formulation is often considered where noise is introduced to the input sequence by deleting, swapping or re-arranging words. On the other hand, the generated sample can be mismatched in content from during the early stages of training, so that the back-translation loss can potentially misguide the generator. We address these issues by interpolating the latent representations of ground truth sentence and generated sentence .
Interpolated reconstruction loss
We merge the autoencoding and back-translation losses by fusing the two latent representations . We merge these two loss components by fusing the two representations . We consider , where is a binary random vector of values sampled from a Bernoulli distribution with parameter . We define a new reconstruction loss which uses to reconstruct the original sentence.
We consider an adversarial loss which encourages generating realistic and attribute compatible sentences. The advesarial loss tries to match the distribution of sentence and attribute vector pairs where the sentence can either be a real or generated sentence. Let and be the decoder hidden-state sequences corresponding to and respectively.
It is possible that the discriminator ignores the attributes and makes the real/fake decision based on just the hidden states, or vice versa. To prevent this situation, we consider additional fake pairs where we consider a real sentence and a mismatched attribute vector, and encourage the discriminator to classify these pairs as fake. The new objective takes the following form.
The discriminator architecture follows the projection discriminator,
where represents the binary attribute vector corresponding to . is a bi-directional RNN encoder ( represents the final hidden state). are learnable parameters and is the sigmoid function. The overall loss function is given by where is a hyperparameter.
Training and hyperparameters
We use the following validation metrics for model selection. The autoencoding loss is used to measure how well the model generates content compatible sentences. Attribute compatibility is measured by generating sentences conditioned a set of labels, and using pre-trained attribute classifiers to measure how well the samples match the conditioning labels.
For all tasks we use a GRU (Gated Recurrent Unit) RNN with hidden state size 500 as the encoder . Attribute labels are represented as a binary vector, and an attribute embedding is constructed via linear projection. The decoder is initialized using a concatenation of the representation coming from the encoder and the attribute embedding. Attribute embeddings of size 200 and a decoder GRU with hidden state size 700 were used. The discriminator receives an RNN hidden state sequence and an attribute vector as input. The hidden state sequence is encoded using a bi-directional RNN with hidden state size 500. The interpolation probability and weight of the adversarial loss are chosen based on the validation metrics above. Word embeddings are initialized with pre-trained GloVe embeddings
To quantitatively evaluate how well the generated samples match the conditioning labels we adopt a protocol similar to Toward Controlled Generation of Text. We generate samples from the model and measure label accuracy using a pre-trained sentiment classifier. For the sentiment experiments, the pre-trained classifiers are CNNs trained to perform sentiment analysis at the review level on the Yelp and IMDB datasets. The classifiers achieve test accuracies of 95%, 90% on the respective datasets.
|Tense||Voice||Negative||John was born in the camp|
|Past||Passive||No||john was born in the camp .|
|Past||Passive||Yes||john wasn't born in the camp .|
|Past||Active||No||john had lived in the camp .|
|Past||Active||Yes||john didn't live in the camp .|
|Present||Passive||No||john is born in the camp .|
|Present||Passive||Yes||john isn't born in the camp .|
|Present||Active||No||john has lived in the camp .|
|Present||Active||Yes||john doesn't live in the camp .|
|Future||Passive||No||john will be born in the camp .|
|Future||Passive||Yes||john will not be born in the camp .|
|Future||Active||No||john will live in the camp .|
|Future||Active||Yes||john will not survive in the camp .|
|Cond||Passive||No||john could be born in the camp .|
|Cond||Passive||Yes||john couldn't live in the camp .|
|Cond||Active||No||john could live in the camp .|
|Cond||Active||Yes||john couldn't live in the camp .|
Generated output comparison table
|negative --> positive|
|Query||the people behind the counter were not friendly whatsoever.|
|Ctrl gen||the food did n't taste as fresh as it could have been either .|
|Cross-align||the owners are the staff is so friendly .|
|Content preserving||the people at the counter were very friendly and helpful .|
|positive --> negative|
|Query||they do an exceptional job here , the entire staff is professional and accommodating !|
|Ctrl gen||very little water just boring ruined !|
|Cross-align||they do not be back here , the service is so rude and do n't care !|
|Content preserving||they do not care about customer service , the staff is rude and unprofessional !|
|negative --> positive|
|Query||once again , in this short , there isn't much plot .|
|Ctrl gen||it's perfectly executed with some idiotically amazing directing .|
|Content preserving||first off , in this film , there is nothing more interesting .|
|positive --> negative|
|Query||that's another interesting aspect about the film .|
|Ctrl gen||peter was an ordinary guy and had problems we all could
|Content preserving||there's no redeeming qualities about the film .|
An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!