/Machine Comprehension – 2

Machine Comprehension – 2

I have been interested in exploring the direction of machine comprehension from a long time and the new year seemed just perfect. This post is mostly about discussing the progress we have made to make content preserving text generation with attribute controls. Such architectures allow us to convert sentences to target attributes like sentiment, tense and voice while preserving the original content.

Query: “The people behind the counter were not friendly whatsoever.”

Query-sentiment: “Negative”

Output: “The people at the counter were very friendly and helpful.”

Output-sentiment: “Positive”

Suppose we want to make a model which takes a text of sentiment x and outputs a text of sentiment y. We will use Yelp dataset for this purpose.

One approach to solve this problem is to first make a seq2seq autoencoder. The only thing we need to do differently is to add a sentiment word-vector in embedding.

pos_label = [1, 0, 0, … emb_size]

neg_label = [2, 0, 0, … emb_size]

For positive labelled sentences, we append text pos_label at the beginning and vice versa. This extra label allows us to control output down the line when we are interested in encoding a negative sentence with positive target sentence. At that time we will create text in this fashion.

input = [pos_label, The, people, behind, the, counter, were, not, friendly, whatsoever, .]

output = [The, people, behind, the, counter, were, not, friendly, whatsoever, .]

Once we have the encoder, we need to have a classifier trained for the labelled Yelp data. This classifier can be anything as far as it is good. Let’s take the ULMFiT model trained over Yelp as it has an error rate of just 2.16%.

Once we have both the components ready, we can connect them in a pipeline and force seq2seq to start giving output as per the target class.


  • The embedding layer is kept untrainable as we are not interested in getting better embeddings as a pre-trained Glove 100 was taken and fined tuned for the task.
  • The seq2seq model was kept trainable as it has to learn to output sentences as per the target task. The encoder part will learn to encode the sentence along with the target label and the decoder part will learn to output sentence which is classified as in accordance with the pos_lable.
  • The ULMFiT classifier is kept trainable

As we can see, if we give a negative sentence along with pos_label the model is able to output a positive sentence. But the content of the generated sentence is different from the input. The input sentence was about “counter people” while the output sentence is about the “restaurant”. This means our network has learned to cheat to get the required output. As the only constraint on the generator is to produce a sentence of a given label, it has the freedom to generate any sentence. If we want to preserve the content, we need to enforce content preservation also.

Content preserving text generation with attribute controls

Now we create a network using an encoder, a decoder along with a discriminator as described in this paper which was accepted in NIPS 2018 and sets a new benchmark.

Autoencoding loss

Let x be a sentence and the corresponding attribute vector be l. Let z_x = G_\text{enc}(x) be the encoded representation of x. Since sentence x should have high probability under G(\cdot|z_x,l), we enforce this constraint using an auto-encoding loss.

    \[ \mathcal{L}^{ae} (x,l) = -\text{log } p_G(x|z_x, l) \]

Back-translation loss

Consider l', an arbitrary attribute vector different from l (i.e., corresponds to a different set of attribute values). Let y \sim p_G(\cdot | z_x, l') be a generated sentence conditioned on x, l'. Assuming a well-trained model, the sampled sentence y will preserve the content of x. In this case, sentence x should have high probability under p_G(\cdot|z_y,l) where z_y = G_\text{enc}(y) is the encoded representation of sentence y. This requirement can be enforced in a back-translation loss as follows.

    \[ \mathcal{L}^{bt} (x,l) = -\text{log } p_G(x|z_y, l) \]

A common pitfall of the auto-encoding loss in auto-regressive models is that the model learns to simply copy the input sequence without capturing any informative features in the latent representation. A de-noising formulation is often considered where noise is introduced to the input sequence by deleting, swapping or re-arranging words. On the other hand, the generated sample y can be mismatched in content from x during the early stages of training, so that the back-translation loss can potentially misguide the generator. We address these issues by interpolating the latent representations of ground truth sentence x and generated sentence y.

Interpolated reconstruction loss

We merge the autoencoding and back-translation losses by fusing the two latent representations z_x,z_y. We merge these two loss components by fusing the two representations z_x,z_y. We consider z_{xy} = g \odot z_x + (1 - g) \odot z_y,  where g is a binary random vector of values sampled from a Bernoulli distribution with parameter \Gamma. We define a new reconstruction loss which uses z_{xy} to reconstruct the original sentence.

    \[\mathcal{L}^{int} = \mathbb{E}_{(x,l)\sim p_\text{data}, y \sim p_G(\cdot | z_x,l')} [-\text{log } p_G(x|z_{xy},l)]\]

Attribute compatibility

We consider an adversarial loss which encourages generating realistic and attribute compatible sentences. The advesarial loss tries to match the distribution of sentence and attribute vector pairs (s,a) where the sentence can either be a real or generated sentence. Let h_x and h_y be the decoder hidden-state sequences corresponding to x and y respectively.

    \[\mathcal{L}^\text{adv} = \min_G \max_D \mathbb{E}_{(x,l)\sim p_\text{data}, y \sim p_G(\cdot | z_x,l')} [\text{log} D(h_x, l) + \text{log}(1 - D(h_y, l'))]\]

It is possible that the discriminator ignores the attributes and makes the real/fake decision based on just the hidden states, or vice versa. To prevent this situation, we consider additional fake pairs (x,l') where we consider a real sentence and a mismatched attribute vector, and encourage the discriminator to classify these pairs as fake. The new objective takes the following form.

    \[ \mathcal{L}^\text{adv} = \min_G \max_D \mathbb{E}_{\begin{subarray}{l}(x,l) \sim p_\text{data}\\y \sim p_G(\cdot|z_x, l')\end{subarray}} [2 \text{ log} D(h_x, l) + \text{log}(1 - D(h_y, l')) + \text{log}(1 - D(h_x, l'))]]\]

The discriminator architecture follows the projection discriminator,

    \[\text D(s, l) = \sigma(l_v^T W \phi(s) + v^T \phi(s))\]

where l_v represents the binary attribute vector corresponding to l. \phi is a bi-directional RNN encoder (\phi(\cdot) represents the final hidden state). W, v are learnable parameters and \sigma is the sigmoid function. The overall loss function is given by \mathcal{L}^\text{int} + \lambda \mathcal{L}^\text{adv} where \lambda is a hyperparameter.

Training and hyperparameters

We use the following validation metrics for model selection. The autoencoding loss \mathcal{L}^\text{ae} is used to measure how well the model generates content compatible sentences. Attribute compatibility is measured by generating sentences conditioned a set of labels, and using pre-trained attribute classifiers to measure how well the samples match the conditioning labels.

For all tasks we use a GRU (Gated Recurrent Unit) RNN with hidden state size 500 as the encoder G_\text{enc}. Attribute labels are represented as a binary vector, and an attribute embedding is constructed via linear projection. The decoder G_\text{dec} is initialized using a concatenation of the representation coming from the encoder and the attribute embedding.  Attribute embeddings of size 200 and a decoder GRU with hidden state size 700 were used. The discriminator receives an RNN hidden state sequence and an attribute vector as input.  The hidden state sequence is encoded using a bi-directional RNN \phi with hidden state size 500. The interpolation probability \Gamma \in \{0, 0.1, 0.2, .., 1.0\} and weight of the adversarial loss \lambda \in \{0.5, 1.0, 1.5\} are chosen based on the validation metrics above. Word embeddings are initialized with pre-trained GloVe embeddings


To quantitatively evaluate how well the generated samples match the conditioning labels we adopt a protocol similar to Toward Controlled Generation of Text. We generate samples from the model and measure label accuracy using a pre-trained sentiment classifier. For the sentiment experiments, the pre-trained classifiers are CNNs trained to perform sentiment analysis at the review level on the Yelp and IMDB datasets. The classifiers achieve test accuracies of 95%, 90% on the respective datasets.


Generated output

TenseVoiceNegativeJohn was born in the camp
PastPassiveNojohn was born in the camp .
PastPassiveYesjohn wasn't born in the camp .
PastActiveNojohn had lived in the camp .
PastActiveYesjohn didn't live in the camp .
PresentPassiveNojohn is born in the camp .
PresentPassiveYesjohn isn't born in the camp .
PresentActiveNojohn has lived in the camp .
PresentActiveYesjohn doesn't live in the camp .
FuturePassiveNojohn will be born in the camp .
FuturePassiveYesjohn will not be born in the camp .
FutureActiveNojohn will live in the camp .
FutureActiveYesjohn will not survive in the camp .
CondPassiveNojohn could be born in the camp .
CondPassiveYesjohn couldn't live in the camp .
CondActiveNojohn could live in the camp .
CondActiveYesjohn couldn't live in the camp .

Generated output comparison table

Restaurant review
negative --> positive
Querythe people behind the counter were not friendly whatsoever.
Ctrl genthe food did n't taste as fresh as it could have been either .
Cross-alignthe owners are the staff is so friendly .
Content preservingthe people at the counter were very friendly and helpful .
positive --> negative
Querythey do an exceptional job here , the entire staff is professional and accommodating !
Ctrl genvery little water just boring ruined !
Cross-alignthey do not be back here , the service is so rude and do n't care !
Content preservingthey do not care about customer service , the staff is rude and unprofessional !
Movie reviews
negative --> positive
Queryonce again , in this short , there isn't much plot .
Ctrl genit's perfectly executed with some idiotically amazing directing .
Cross-alignbut , , the film is so good , it is .
Content preservingfirst off , in this film , there is nothing more interesting .
positive --> negative
Querythat's another interesting aspect about the film .
Ctrl genpeter was an ordinary guy and had problems we all could with
Cross-alignit's the and the plot .
Content preservingthere's no redeeming qualities about the film .

An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!