Power of Transfer Learning in NLP

Transfer Learning in NLP

In this post, we will understand the true power of transfer learning in NLP, why it matters and how they compare with recurrent architectures in previous posts using a dataset of Tweets on US Airlines.

All the codes implemented in Jupyter notebook in Keras, PyTorch, Flair, fastai and allennlp.

All codes can be run on Google Colab (link provided in notebook).

Hey yo, but how?

Well sit tight and buckle up. I will go through everything in-detail.

Feel free to jump anywhere,

NLP Tasks and Datasets

The ultimate goal is to make machines understand language (natural language understanding) as we humans do. These are some the tasks outlined which need to be accomplished in order for the machines to be able to comprehend natural language as we do.

Sentiment analysis


Sentiment analysis is task of classifying polarity of given text.


Current SoTA : Sentiment Analysis

Sample Example

  • Input

Sentence: Avengers Endgame is the best movie. Kudos Russo brothers.

  • Output

Positive (100% accuracy) (sentiment)



A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.


Current SoTA : Part-of-speech tagging

Sample Example

  • Input

Sentence: Apple is looking at buying U.K. startup for $1 billion

  • Output

Note: Output obtained from spaCy POS Tagging. Try now!



Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.


Current SoTA : Named entity recognition

Sample Example

  • Input

Sentence: Apple is looking at buying U.K. startup for $1 billion

  • Output

Note: Output obtained from spaCy Named Entities. Try now! Also, here is a live demo from Allennlp for Named Entity Recognition.

Textual Entailment


Textual Entailment (TE) also known as Natural language inference (NLI) takes a pair of sentences and predicts whether the facts in the first necessarily imply the facts in the second one or task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.


Current SoTA : Natural language inference

Sample Example

  • Input

Premise : If you help the needy, God will reward you.

Hypothesis : Giving money to the poor has good consequences.

  • Output

Note: Here is a live demo from Allennlp for Textual Entailment.

Coreference resolution


Coreference resolution is the task of finding all expressions that refer to the same entity in a text.

For e.g. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?

Answer 0: the trophy

Answer 1: the suitcase


Current SoTA : Coreference resolution

Sample Example

  • Input

Sentence: The trophy would not fit in the brown suitcase because it was too big. What was too big? the trophy or the suitcase?

  • Output

Note: Here is a live demo from huggingface for Coreference resolution. Also check Winograd Challenge. Here is a live demo from Allennlp for Coreference resolution.

Question Answering


Reading comprehension or Question Answering is the task of answering questions about a passage of text to show that the system understands the passage


and many more!

Current SoTA : Question Answering

Sample Example

Note: Here is a live demo from Allennlp for QA.

There are many more challenges and nlpprogess provides a great overview of challenges and current SOTA for each challenge. Be sure to check it out! Here is list of 34 datasets from Allen Institute for Artificial Intelligence.

Transfer Learning in NLP

A long time ago in a galaxy far, far away….

I-know-everything: Today the topic of interest is very interesting. It’s Transfer Learning in NLP. Can we transfer the knowledge learned about the language and fine-tune it to task at hand. It’s the similar concept we saw in Power of Transfer Learning for Computer Vision.

I-know-nothing: Will we be using same embedding models which we learned in previous posts? Will the transfer learning in NLP be same as in CV i.e. train on some large dataset and finetune with some target data?

I-know-everything: Well, there’s a catch and to answer your first question no. We will not be using traditional embedding models. And the answer to your second question, the answer is yes.

The embedding models which we disscused earlier like word2vec, GLoVe and fastText are fantastic in capturing meaning of individual words and their relationships by leveraging large datasets. These model generate word vectors of n-dimension which is used by neural network as starting point of training. The word vectors can be initialized to lists of random numbers before a model is trained for a specific task, or initialized with word vectors obtained from above embedding models.

Here is one such relationship learned through embeddings,

How amazingly word2vec learns the captials and relation with the countries in first example? Just through simple arithmetic algebra, a + b - c gives the correct answer i.e. France:Paris :: Japan: ?, the answer it predicts is Tokyo, so cool.

In above embedding models, a word is assigned the same vector representation no matter where it appears and how it’s used, because word embeddings rely on just a look-up table. In other word, they ignore polysemy — a concept that words can have multiple meanings. To take this point home, let’s consider a example, The way Messi plays football, can only be par with the greatest Broadway plays. Notice the word plays in the sentence, the first plays is related to playing while the second plays is more related to drama. The traditional embedding models will assign the same vector for both words when in turn we need embedding that also takes into consideration the context in which the word is used. Those are the embeddings we will learn about in following approaches and how can we achieve such context-conscious embeddings.

The basic idea of following approaches which we will look into will be to learn representation (depending on context) instead fixed emebedding of each word by training a deep language model and use the representation learned by the language model in downstream tasks.


In NLP tasks, context matters. That is, understanding context is very essential to all NLP tasks as words rarely appear in isolation and also helps in general sense of language understanding tasks. One such example is in Question Answering where understanding of how words in question shift the importance of words in document or in Summarization where model needs to understand which words capture the context clearly to summarize succinctly. The ability to share a common representation of words in the context of sentences that include them could further improve transfer learning in NLP. This is where CoVe comes into play, which transfers information from large amounts of unlabeled training data in the form of word vectors using encoder to contextualize word vector has shown to improve performance over random word vector initialization on a variety of downstream tasks e.g. POS, NER and QA.

How it Works?

CoVe is Contextual Word Vectors, type of word embedding learned by encoder in an attentional sequence-2-sequence machine translational model. The team at Salesforce explained CoVe in best way on their research blog, also outlined in their paper. We will look at a special case example of Machine Translation from English (source language) to German (target language).

  • Encoder

A neural network BiLSTM takes word vectors as input and outputs a new vector called hidden vector. This process is often referred to as encoding the sequence, and the neural network that does the encoding is referred to as an encoder. BiLSTM (forward and backward LSTM) is used to incorporate information from words that appear later in the sequence.

CoVe uses two BiLSTM layers as encoder, first BiLSTM processes its entire sequence before passing outputs to the second. Let \(w^{x}\) = [\(w^{x}_{1}, w^{x}_{2}, ..., w^{x}_{n}\)] sequence of words in source language, then the output hidden vector h or CoVe vector,

\[\begin{aligned} \textbf{Encoder} : CoVe(w) & = BiLSTM(GloVe(w^{x})) \\ h = [h_{1}, h_{2}, .. h_{n}] & = BiLSTM(GloVe(w^{x})) \\ h_{t} & = [\overset{\leftarrow}{h_{t}}; \overset{\rightarrow}{h_{t}}] \\ \overset{\leftarrow}{h_{t}} & = LSTM(GloVe(w^{x_t}), \overset{\leftarrow}{h_{t-1}}) \\ \overset{\rightarrow}{h_{t}} & = LSTM(GloVe(w^{x_t}), \overset{\rightarrow}{h_{t-1}}) \\ \end{aligned}\]

The pretrained vectors obtained from embedding captured some interesting relationships, similar results are obtained from hidden vectors (h). In our case of Machine Translation, inputs are Glove embeddings of English words of input sentence (GloVe(\(w^{x}\))) and output are it’s hidden vectors(h). After training, we call this encoder pretrained LSTM an MT-LSTM (Machine Translation) and can serve as pretrained model to generate hidden vectors for new sentences. When using these machine translation hidden vectors as inputs to another NLP model, we refer to them as context vectors (CoVe).

  • Decoder

Encoder produces hidden vector for English sentences given input different English sentences. Another neural network called decoder references those hidden vectors to generate the German sentence. The decoder LSTMs is initialized from the final states of the encoder, reads in a special German word vector to start, and generates a decoder state vector.

Tbe decoder takes in input randomly intialized embedding for target words \(w^{z}\) = [\(w^{z}_{1}, w^{z}_{2}, ..., w^{z}_{n}\)] , context-adjusted state generated by attention mechanism and previous hidden state of LSTM. Similar to encoder, decoder uses two layer LSTM (unidirectional) to create decoder state from input word vectors.

\[\begin{aligned} \textbf{Decoder hidden state} : s_{t} = LSTM([w^{z}_{t-1}; \tilde{h}_{t-1}], s_{t-1}) \end{aligned}\]
  • Attention

Attention mechanism is one the interesting mechanism in NLP. The attention mechanism looks back at hidden vectors in order to decide which part of English sentence to translate next. It uses the state vector to determine how important each hidden vector is, and then it produces a new vector, which we will call the context-adjusted state, to record its observation.

Attention mechanism uses hidden vectors generated by encoder and decoder state by decoder to produce context-adjusted state. It sort of plays a role of deciding which context words play important role in translation and focusing on them rather than whole sentence.

\[\begin{aligned} \textbf{Attention Weights} : \alpha_{t} & = softmax(H(W_{1}s_{t} + b_{1}))\\ \textbf{Context-adjusted Weights} : \tilde{h}_{t} & = tanh(W_{2}[H^{T} \alpha_{t}; s_{t}] + b_{2})\\ \end{aligned}\]

Here H is is a stack of hidden states {h} along the time dimension.

  • Generation

The generator (not a sperate layer, it’s a decoder but step is generation because it generates output sentence) then looks at the context-adjusted state to determine which German word to output, and the context-adjusted state is passed back to the decoder so that it has an accurate sense of what it has already translated. The decoder repeats this process until it is done translating. This is a standard attentional encoder-decoder architecture for learning sequence to sequence tasks like machine translation.

The generator uses context-adjusted state from attention mechanism to produce output German word. Attention mechanism takes in input all hidden vectors and first decoder state to produce context-adjusted state which will be used as input to decoder to produce another decoder state. This decoder state along with all hidden vectors will again be used as input to attention mechanism to generate another context-adjusted state which will be input to third decoder and so on.

\[\begin{aligned} \textbf{Generator Output} : p(y_{t} \mid H,y_{1},y_{2}, …,y_{t-1}) & = softmax(W_{out}\tilde{h}_{t} + b_{out}) \end{aligned}\]

Here \(p(y_{t} \mid H,y_{1},y_{2}, …,y_{t-1})\) is a probability distribution over output words.


  • Use the traditional encoder-decoder architecture used in seq2seq learning, to learn the context of words by giving input GLoVe embedding of words in sentence to encoder and two stacked BiLSTM layers generate output is hidden vector or context vectors.
  • We looked at one specific example of MT, where encoder was used to generate context vectors, and this context vectors along with attention mechanism (which gives context-adjusted state as output) to give target langauge output sentence using decoder.


Hmm, that seems simple process but what about results? Was is it SOTA breaker?

Here CoVe+GLove means that we take the GloVe sequence, run it through a pretrained MT-LSTM to get CoVe sequence, and we append each vector in the CoVe sequence with the corresponding vector in the GloVe sequence.

SOTA in 3 out of 7 tasks, well that’s a good start with using CoVe pretrained vectors.

What this means?

Replacing the good ol’ GloVe, Word2vec and fastText with CoVe seems to do a good job at the tasks where context matters. Training a custom pretrained CoVe model is also simple. Just take any unlabelled data corresponding to task at hand (e.g. Amazon Review for SST or IMDB 50,000 unlabelled reviews for IMDb sentiment analysis task) pass it through encoder (MT-LSTM) to generate CoVe word vector in supervised fashion and we can use that CoVe pretrained vector along with GloVe vector as initial embedding model and use that train for specific task like sentiment analysis, Question Answering, Machine Translation, etc. The more data we use to train the MT-LSTM, the more pronounced the improvement, which seems to be complementary to improvements that come from using other forms of pretrained vector representations.

Here there is disadvantage of using only avaliable data for generating pretrained CoVe embedding using supervised training of encoder-decoder architecture. (no large unsupervisied dataset which are everywhere, supervised learning requires labels too)


Hi, my name is ELMo and I will overcome the limitation of CoVe by generating contextual embeddings in an unsupervised fashion.

ELMo stands for Embeddings from Language Models. ELMo is a word representation technique proposed by AllenNLP group

How it Works?

ELMo word representations are function of entire input sentence and are computed on top of two biLM with character convolutions.

  • Bidirectional Language Model

A language model is an NLP model which learns to predict the next word in a sentence. For instance, if your mobile phone keyboard guesses what word you are going to want to type next, then it’s using a language model. The reason this is important is because for a language model to be really good at guessing what you’ll say next, it needs a lot of world knowledge (e.g. “I ate a hot” → “dog”, “It is very hot” → “weather”), and a deep understanding of grammar, semantics, and other elements of natural language.

Given a sequence of N tokens, (\(t_{1}, t_{2}, ..., t_{N}\)) forward language model(LM) computes the probability of sequence by modeling the probability of token \(t_{k}\) given history (\(t_{1}, t_{2}, ..., t_{k-1}\)):

\[\begin{aligned} p(t_{1}, t_{2}, ..., t_{N}) = \prod_{k=1}^{N}p(t_{k} \mid t_{1}, t_{2}, ..., t_{k-1}) \end{aligned}\]

Given a sequence of N tokens, (\(t_{1}, t_{2}, ..., t_{N}\)) backward language model(LM) computes the probability of sequence by modeling the probability of predicting previous token \(t_{k}\) given future context (\(t_{k+1}, t_{k+2}, ..., t_{N}\)):

\[\begin{aligned} p(t_{1}, t_{2}, ..., t_{N}) = \prod_{k=1}^{N}p(t_{k} \mid t_{k+1}, t_{k+2}, ..., t_{N}) \end{aligned}\]

A bidirectional language model consists of forward LM and backward LM and combines both a forward and backward LM. This model is trained to minimize the negative log likelihood (= maximize the log likelihood for true words) of forward and backward directions:

\[\begin{aligned} \mathcal{L}_{LM} = \sum_{k=1}^{N}(log (p(t_{k} \mid t_{1}, t_{2}, ..., t_{k-1}); \Theta_{x}, \overset{\rightarrow}\Theta_\text{LSTM}, \Theta_{s}) \\ + log (p(t_{k} \mid t_{k+1}, t_{k+2}, ..., t_{N}); \Theta_{x}, \overset{\leftarrow}\Theta_\text{LSTM}, \Theta_{s})) \end{aligned}\]

Here \(\Theta_{x}\) and \(\Theta_{s}\) are embedding layers and softmax layers. Overall, this formulation is similar to the approach of CoVe, with the exception that we share some weights between directions instead of using completely independent parameters. The internal states of forward pass at a certain word reflect the word itself and what has happened before that word, whereas similar can be concluded for backward pass where word itself and what has happened after that word gets reflected. These two passes are concatenated to get intermediate word vector of that word. Therefore, this intermediate word vector at that word is still the representation of what the word means, but it “knows” what is happening (i.e. captures the essence or context) in the rest of the sentence and how the word is used.

ELMo uses two layer biLM where each biLM layer consists of one forward pass and one backward pass that scans the sentence in both directions. ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token \(t_{k}\), a L-layer biLM computes a set of L+1 representations:

\[\begin{aligned} \mathcal{R}_{k} & = \{x_{k}^{LM}, \overset{\rightarrow}h_{k,j}^{LM} ,\overset{\leftarrow}h_{k,j}^{LM} \mid j=1,2,...,L\} \\ & = \{h_{k, j}^{LM} \mid j = 0,1,2,...,L\} \end{aligned}\]

where \(h_{k, 0}^{LM}\) is embedding layer \(h_{k, j}^{LM} = [\overset{\rightarrow}h_{k,j}^{LM} ; \overset{\leftarrow}h_{k,j}^{LM}]\), for each biLSTM layer.

For inclusion in a downstream model, ELMo collapses all layer in \(\mathcal{R}\) into a single vector,

\[\begin{aligned} \text{ELMo}_{k}^{task} & = E[\mathcal{R}_{k}; \Theta^{task}]\\ & = \gamma^{task} \sum_{j=0}^{L}s_{j}^{task}h_{k, j}^{LM} \end{aligned}\]

where \(s^{task}\) are softmax-normalized weights and the scalar parameter \(\gamma^{task}\) allows the task model to scale the entire ELMo vector.

Finally, ELMo uses character CNN (convolutional neural network) for computing those raw word embeddings that get fed into the first layer of the biLM. The input to the biLM is computed purely from characters (and combinations of characters) within a word, without relying on some form of lookup tables like we had in case of word2vec and glove. This type of character n-gram were seen in fastText embeddings and are very much known for their way of handling OOV (out of vocabulary) words. Thus, ELMo embeddings can handle OOV in efficient manner.

Study of “what information is captured by biLM representations” section of paper indicate that syntactic information is better represented at lower layers while semantic information is captured by higher layers. Because different layers tend to carry different type of information, stacking them together helps.

Masato Hagiwara points out difference between biLM and biLSTM clearly,

A word of caution: the biLM used by ELMo is different from biLSTM although they are very similar. biLM is just a concatenation of two LMs, one forward and one backward. biLSTM, on the other hand, is something more than just a concatenation of two spearate LSTMs. The main difference is that in biLSTM, internal states from both directions are concatenated before they are fed to the next layer, while in biLM, internal states are just concatenated from two independently-trained LMs.


  • Different words carry different meaning depending on context and so their embeddings should also take context in account.
  • ELMo trains a bidirectional LM, and extract the hidden state of each layer for the input sequence of words.
  • Then, compute a weighted sum of those hidden states to obtain an embedding for each word. The weight of each hidden state is task-dependent and is learned.
  • This learned ELMo embedding in used in specific downstream tasks for which embedding is obtained.


Well, ELMo certainly outperforms CoVe and emerges as new SOTA at all the 6 tasks with relative error reductions ranging from 6 - 20%.

What this means?

Here is one the results from context embedding of biLM.

Notice how biLM s able to disambiguate both the part of speech and word sense in the source sentence of word “play” than glove counterpart which has fixed neighbours no matter the context.

ELMo improves task performance over word vectors as the biLM’s contextual representations encodes information generally useful for NLP tasks that is not captured in word vectors.

Once pretrained, the biLM can compute representations for any task. In some cases, fine tuning the biLM on domain specific data leads to significant drops in perplexity and an increase in downstream task performance. Given a pretrained LM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model. We simply run the biLM and record all of the layer representations for each word. Then, we let the end task model learn a linear combination of these representations.

To add ELMo to the supervised model, we first freeze the weights of the biLM and then concatenate the ELMo vector \(\text{ELMo}^{task}\) with \(x_{k}\) and pass the ELMo enhanced representation [\(x_{k}; \text{ELMo}^{task}\)] into task RNN.


The paper by Jermey Howard and Sebestain Ruder proposes a transfer learning method in NLP similar to the one which we saw in our previous blog on Transfer Learning on images. So cool!

There was a simple transfer learning technique involved in finetuning pretrained word embeddings and also approaches of ELMo and CoVe that concatenate embeddings derived from other tasks with the input at different layers but that only targets model’s first layer barely scratching the surface of model for finetuning as seen in Computer Vision. These approaches mainly transfer word-level information instead of transferring high-level semantics. The authors argued that not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption.

How it Works?

Universal Language Model Fine-tuning(ULMFiT) is the model that addresses the issues mentioned above and enables robust inductive transfer learning for any NLP task.

ULMFiT consists of three stages:

  1. General-domain LM pretraining : Typical routine for creating pretraining vision models is to train on very large corpus of data (ImageNet size) and then use that freezed model as starting base model for finetuning. Similarly, Wikitext-103 consisting of 28,595 preprocessed Wikipedia articles and 103 million words is used to pretrain a language model. A language model as we discussed in ELMo section learns to predict next word in sentence. This prediction task makes language model more efficient in understanding grammar, semantics and other elements of corpus it is trained on. The base pretrained language model model is AWD-LSTM described in another paper by group at Salesforce, Merity et al. This is only step that needs to be performed once (to obtain pretrained model on large corpus) and is expensive step.

  2. Target task LM fine-tuning : As we know that data on target task and general-domain data used for pretraining can be different (come from a different distribution). This step will finetune LM data on target data. As noted above in lack of knowledge on how to train effectively is holding this process of transfer learning in nlp. To stabilize finetuning process, the authors propose two methods : a) Discriminative fine-tuning and b) Slanted Triangular learning rates.

a) Discriminative fine-tuning : We have seen in visualizing layer how different layers capture different types of information and also in biLM in ELMo. In Discriminative fine-tuning, each layer is updated using different learning rate {\(\eta^{1}, ..\eta^{L}\)} for L layers in model where \(\eta^{l}\) is learning rate of l-th layer. In practise, choosing the learning rate \(\eta^{L}\) of the last layer by fine-tuning only the last layer and using \(\eta^{l-1}\) = \(\eta^{l}\)/2.6 as the learning rate for lower layers is found to work well.

b) Slanted Triangular learning rates: Using the same learning rate (LR) or an annealed learning rate throughout training is not the best way to achieve this behaviour. Instead, authors propose slanted triangular learning rates(STLR), which first linearly increases the learning rate and then linearly decays it according to the following update schedule.

where T is number of iteration (number of epochs x number of updates per epoch) and cut_frac is the fraction of iterations we increase the LR cut is the iteration when we switch from increasing to decreasing the LR, p is the fraction of the number of iterations we have increased or will decrease the LR respectively, ratio specifies how much smaller the lowest LR is from the maximum LR \(\eta_{max}\) and \(\eta_{t}\) is learning rate at iteration t. In practise, ratio = 32, cut_frac = 0.1 and \(\eta_{max}\) = 0.01 is used.

  1. Target task classifier fine-tuning :

For finetuning classifier, pretrained language model is augmented with two additional linear blocks, a) concat pooling and b) gradual unfreezing.

a) Concat pooling: The authors state that as input document can consist of hundreds of words, information may get lost if we only consider the last hidden state of the model. For this reason, we concatenate the hidden state at the last time step \(h_{T}\) of the document with both the max-pooled and the mean-pooled representation of the hidden states over as many time steps as fit in GPU memory. If \(\mathcal{H} = [h_{1},...,h_{T}]\), then \(h_{c} = [h_{T}, \text{maxpool}(\mathcal{H}), \text{meanpool}(\mathcal{H})]\).

b) Gradual Unfreezing: Rather than fine-tuning all layers at once, which may result in catastrophic forgetting, authors propose gradual unfreezing starting from last layer as it contains least amount of information. The steps involved are: We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration.


  • Wooh hoo, CV transfer learning style training. Create a pretrained language model by training on large corpus like Wikitext-103, etc.
  • Finetune LM data on target data and to stabalize this finetuning two methods like Discriminative finetuning and Slanted learning rates are used.
  • To finetune on target task classifier using above finetune LM, additional linear model is added to language model architecture such as concat pooling is added and gradual unfreezing is used.


ULMFiT method significantly outperforms the SOTA on six text classification tasks, reducing the error by 18-24% on the majority of datasets.

What this means?

Ooooh, this is very exiciting. SoTA on everything! Take my money already.

ULMFiT shows one of the best approaches to tackling difficult problem through concatinating different methods into one. Transfer Learning has certainly change Computer Vision field and this method surely opens the door for similar breakthroughs in NLP field.

Transformer Architectures References

Before procedding to GPT and BERT, it is necessary to understand Transformer architecture properly introduced in paper “Attention is All You Need”. Here are recommended very cool resources other than paper to get you started

Note: Dissceting Bert on medium dissects BERT and Transformer, for in-depth understanding BERT Encoder look here part-1 and part-2, Decoder of Transformer architecture look here.

Note: keitakurita does a great job in dissecting the paper on the blog.

Note: Jay Alammar explains through visualizations Transformer architectures through blog on The Illustrated Transformer

Note: Harvard NLP group has excellent blog detailing the paper “Attention is All You Need” which describes the Transformer architecture used by GPT and BERT with PyTorch implementation details step-by-step.


The group at OpenAI proposed a new method GPT large gains on various nlp tasks can be realized by generative pretraining of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. The main goal of paper was to learn a universal representation that transfers with little adaptation to a wide range of tasks.

How it works?

GPT is short for Generative Pretraining Transformers. GPT uses a combination of unsupervised pretraining and supervised fine-tuning. Unlike ULMFit, the authors thought lets turn to Transformer architectures instead of Recurrent architectures for creating language models.

GPT training procedure consists of two steps:

  • Unsupervised pretraining

GPT similar to ELMo uses a standard language model, where instead of using biLM model i.e. both forward and backward direction, GPT uses only forward direction and the model architecture is multi-layer Transformer decoder adapted from this paper for language model. This model applies multiple transformer blocks over the embeddings of input sequences. Each block contains a masked multi-headed self-attention layer and a pointwise feed-forward layer. The final output produces a distribution over target tokens after softmax normalization.

\[\begin{aligned} h_{0} & = UW_{e} + W_{p} \\ h_{l} & = transformer\_block(h_{l-1}) \\ P(u) & = softmax(h_{n}W_{e}^{T}) \\ \end{aligned}\]

where \(W_{e}\) is token embedding matrix, \(W_{p}\) is position embedding matrix, n is number of layers and U = (\(u_{-k}... u_{-1}\)) is the context vector of tokens.

The objective to maximize as seen in ELMo will be the only forward direction of biLM.

\[\begin{aligned} \mathcal{L}_{LM} = \sum_{k=1}^{N}(log (p(t_{k} \mid t_{1}, t_{2}, ..., t_{k-1})) \end{aligned}\]

Byte Pair Encoding (BPE) is used to encode the input sequences. Motivated by the intuition that rare and unknown words can often be decomposed into multiple subwords, BPE finds the best word segmentation by iteratively and greedily merging frequent pairs of characters.

  • Semi-supervised learning for NLP

After training with objective \(\mathcal{L}_{LM}\), the inputs where each instance consists of a sequence of input tokens, \(x^{1}, x^{2} ..., x^{m}\) along with label y are passed through our pretrained model to obtain the final transformer block’s activation \(h_{l}^{m}\) which is then fed into an added linear output layer with parameters \(W_{y}\) to predict y:

\[\begin{aligned} P(y \mid x^{1}, x^{2}, ..., x^{m}) & = softmax(h_{l}^{m}W_{y}) \\ \mathcal{L}_{C} & = \sum_{(x,y)}^{}(log (P(y \mid x^{1}, x^{2}, ..., x^{m})) \\ \mathcal{L}_{total} & = \mathcal{L}_{C} + \lambda * \mathcal{L}_{LM} \end{aligned}\]

GPT gets rid of any task-specific customization or any hyperparameter tuning when applying across various tasks. If the task input contains multiple sentences, a special delimiter token ($) is added between each pair of sentences. The embedding for this delimiter token is a new parameter we need to learn, but it should be pretty minimal. All transformations include adding randomly initialized start and end tokens (〈s〉,〈e〉).


  • GPT makes use of unlabelled data to train a language model using a multi-layer Transformer decoder architecture.
  • Langauge model pretrained above can be applied across various tasks directly instead of training different langauge models across different tasks.


That’s a lot of results. GPT significantly improves upon the SOTA in 9 out of the 12 tasks.

What this means?

By pretraining on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art. The advantage of this approach is few parameters need to be learned from scratch.

One limitation of GPT is its unidirectional nature — the model is only trained to predict the future left-to-right context.


Yo myself, BERT. I will improve the shortcomings of GPT.

How this works?

BERT stands for Bidirectional Encoder Representations for Transformers. BERT is designed by group at Google AI Language to pretrain deep bidirectional representations by jointly conditioning on both left and right context in all layers. With adding different output layers to pretrained BERT, this model can be used for various nlp tasks.

We have seen two strategies for applying pretrained language representations to downstream tasks : feature-based and finetuning. ELMo is example of feature-based where various task-specific architectures are used as additional features and GPT is example of finetuning which has minimal task-specific parameters is trained on the downstream tasks by simply finetuning the pretrained parameters.

Here are the differences in pretraining model architectures. BERT uses bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTM to generate features for downstream tasks. BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left. In the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.

The authors argue that GPT used left-to-right architecture on standard langauge model is limiting in choice and a deep bidirectional model is strictly more powerful than either a left-to-right model (GPT) or the shallow concatenation of a left-to-right and right-to-left model (ELMo). The authors propose a new language model with new objective: “masked language model”(MLM) and “next sentence prediction”.

Input to BERT is composed of 3 parts: (i) Token Embeddings: Use of WordPiece embeddings with a 30,000 token vocabulary and denote split word pieces with ## (ii) Position Embeddings: learned positional embeddings with supported sequence lengths upto 512 tokens (iii) The first token of every sequence is always the special classification embedding([CLS]) (iv) Segment Embeddings: Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence, and for single-sentence inputs we only use the sentence A embeddings.

BERT’s input representation is constructed by summing the corresponding token, segment and position embeddings.

Similar to GPT, BERT training takes place in two steps:

  • Pretraining tasks: Unlike GPT, BERT’s model architecture is multi-layer bidirectional Transformer encoder. To encourage the bidirectional prediction and sentence-level understanding, BERT is trained with two auxiliary tasks (masking random words and next sentence prediction) instead of the basic language task (that is, to predict the next token given context).

a) Task #1: Masked LM: Here we mask some percentage of the input tokens at random, and then predict only those masked tokens. Consider for example sentence: my dog is hairy. Here, it chooses hairy. It randomly masks 15% of tokens in a sequence and rather than always replacing the chosen words with [MASK], the data generator will do the following: (i) Replace the word with [MASK] token 80% of time i.e. my dog is hairy → my dog is [MASK], (ii) Replace the word with a random word 10% of time i.e. my dog is hairy → my dog is apple, (iii) Keep the word untouched 10% of time i.e. my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word. The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words. This forces LM to keep a distributional contextual representation of every input token.

b) Task #2: Next Sentence Prediction: In order to train a model that understands sentence relationships which can be useful for downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI), we train a model to capture this relationship in language model. When choosing sentence A and B for each pretraining example, 50% of time B is actual next sentence that follows A, and 50% of time it is a random sentence from corpus. The final pretrained model achieves 97%-98% accuracy at this task.

  • Finetuning Procedure: For classification task, we take the final hidden state (i.e. the output of Transformer) for the first token in input which is special token [CLS], \(h_{L}^{CLS}\), and multiply it with weight matrix of classification layer \(W_{CLS}\) which is the only added parameter during fine-tuning. Then the label probabilities is applying standard softmax which is \(P = softmax(h_{L}^{CLS} W_{CLS}^{T})\). For other downstream tasks, following figure explains some task-specific modification to be made.

Understanding and choosing correct hyperparameters(there are too many) can make or break BERT. So, we need to choose wisely. Paper outlines some experiements which I would encourage the curious readers to have a look.

Paper also outlines differences between BERT and GPT.

  • GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
  • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.
  • GPT was trained for 1M steps with a batchsize of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.
  • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.


  • Use large corpus of unlabeled data to learn a language model(which captures semantics, etc of language) by training on two tasks: Masked Language Model and Next Sentence Prediction using a multi-layer bidirectional Transformer Encoder architecture.
  • Finetuning pretrained language model for specific downstream tasks, task-specific modifications are done.


Hold on, here comes the result. BERT outperforms previous SOTA in 11 tasks. Yay!! Go, BERT.

What this means?

This means BERT is super cool, that’s it! We can use pretrained BERT models to finetune for specific tasks.


Look who shows up at showdown in between GPT and BERT, GPT’s big brother GPT-2. OpenAI team introduces next version of GPT in the paper, GPT-2.

How it Works?

GPT-2 is a large transformer-based language model with 1.5 billion parameters (10x more than GPT), trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text.

The authors state that the paper from Google AI which performed Multi-task Learning on 8 different tasks required supervision but language modeling, in principle is able to learn such task without the need for explicit supervision. Authors perform preliminary experiments to confirm that sufficiently large language models are able to perform multitask learning in toy-ish setup but learning is much slower than in explicitly supervised approaches.

The internet contains a vast amount of information that is passively available without the need for interactive communication like in dialog or QA tasks. Authors speculate that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. Authors propose using Zero-shot Transfer by pretraining a language model on various tasks and conditioning tasks along with input to get task-specific output, p(output|input,task) instead of finetuning for seperate tasks where for each task the conditional probability is p(output|input).

  • Zero-shot Transfer : GPT-2 learns it’s language model on diverse dataset in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible. While preprocessing LM, authors state that current byte-level LMs are not competitive with word-level LMs on large scale datasets. They modify BPE (Byte Pair encoding) to combine benefits word-level LM with the generality of byte-level approaches.

  • Byte Pair Encoding : Byte Pair Encoding (BPE) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. BPE implementations often operate on Unicode code points and not byte sequences. Each byte can represent 256 different values in 8 bits, while UTF-8 can use up to 4 bytes for one character, supporting up to 231 characters in total. Therefore, with byte sequence representation we only need a vocabulary of size 256 and do not need to worry about pre-processing, tokenization, etc. BPE merges frequently co-occurred byte pairs in a greedy manner. To prevent it from generating multiple versions of common words (i.e. dog., dog! and dog? for the word dog), GPT-2 prevents BPE from merging characters across categories (thus dog would not be merged with punctuations like ., ! and ?). This tricks improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

GPT-2 follows similar Transformer architecture used in GPT. The model details is largely similar to GPT model with a few modifications: Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block, a modified initialization was constructed as a function of the model depth, scaling the weights of residual layers at initialization by a factor of \(1/ \sqrt{N}\) where N is the number of residual layers. The vocabulary is expanded to 50,257 and also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used.

Downstream Tasks

  • Text Generation : Text generation is standard given pretrained LM. Here is one example of text generation

So real but not real! or is it?

  • Summarization : Adding TL;DR after articles produces summary.

  • Machine Translation : Using conditional probability of target language, translation is obtained.

  • Question Answering : Similar to translation, pairs of question and answer and context can be conditioned to give the answer for required question.


  • Large and diverse amount data is enough to capture language semantics related to different tasks instead of training a language model for seperate tasks.
  • Pretrained lanaguage model does excellent job on various tasks such as question answering, machine translation, summarization and especially text generation without having to train explicitly for each particular tasks. No task-specific finetuning required.
  • GPT-2 achieves mind blowing results just through pretrained language model.


I bet results would be SOTA and they are, on 7 tasks out of 8.

What this means?

Just training LM (no task-specific finetuning) that is all it took. Results are mind (into tiny pieces) blowing.

There is recent approach from Baidu, called (wait for it) ERNIE which outperforms BERT in certain Chinese NLP tasks such as natural language inference, semantic similarity, named entity recognition, sentiment analysis and question-answer matching.


To test drive these approaches, we use the dataset of Twitter US Airlines Sentiment. About dataset, Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”). It contains whether the sentiment of the tweets in this set was positive, neutral, or negative for six US airlines.

We will apply the recently learned nlp techniques and see what they can add to the table.


Approach Epoch Time (sec) Train Accuracy(%) Dev Accuracy (%)
LSTM 10 250 82 80
BiLSTM 10 500 83 79
GRU 10 300 88 77
CoVe 10 450 72 72
BERT 3 500 - 85


Approach Epoch Time (sec) Train Accuracy(%) Dev Accuracy (%)
LSTM 10 25 98 78.8
BiLSTM 10 35 98 79.1
GRU 10 27 92 79.3
BERT 3 600 - 85.03

How can I not try open GPT-2 langauge generation model?

Here are the results on random seeds,

Sample Text 1

Take me to the QB!

I came pregnant.

"I'm sure kids will be learning that to be a tad boy."<|endoftext|>Embed This Video On Your Site With This Html: Copy Embed code

<iframe src="http://www.youjizz.com/videos/embed/242203" frameborder="0" style="width:100%; height:570px;" scrolling="no" allowtransparency="true"></iframe><|endoftext|>As president-elect Donald Trump takes his dismantling of Barack Obama's health care law to the White House, it's hoped the good will of both generations won't override Kansas' ready-made attorney general. Getting somebody to poke fun at Trump for not keeping things on her from drawing up a repeal budget, following that up with a lame attempt by Sen. Susan Collins, R-Maine, to slash federal penalties on consumers who'd pay more for insurance through a Medicaid replacement, won't be a serious blow.

That may be largely because it works; reaching a single-payer, single-payer healthcare system under Trump's new administration is not much of a magic bullet, and will make us reluctant to defund Planned Parenthood (which is what the Trump administration and the GOP are taking away as a price to pay for shutting it down), but Republicans weigh in as a country trying to find an adequate replacement.

As Chairman of the Joint Committee on Taxation won't the CBO represent, Trump decides, it would give his political opponents less power to make legitimate criticism of AHCA available, or to oppose repeal legislation without sharing the pieces. "Nobody can refuse to make it available. I mean, let's take a look here," said Senate Minority Leader Chuck Schumer, C-N.Y. (four of the sixteen Democratic senators who voted to repeal tax credits for having children care for themselves or their parents, by making free outreach to Russian entertainment hit shows about women injured in Syria). "To me that means that's another piece of legislation from the Oval Office devoted exclusively to people who feel like we're firing them immediately."

Senators are not Republican only because they represent nothing less than sovereignty on their party's national stage, but because they can break through entrenched partisan historical convictions in one moment with one substantive change they want to see voted upon in the next, and then repeat it because that moment feels unreal and basic.

Now this is my heart, because every unelected bit of Republican leadership in the United States has a

Sample Text 2

[sniffs] I mean Christian, you know, those Christians, we have a holy book that is doctrinal moderate ... like a god? I think that it's time to come out and say like, wow, I just also see the long sweep animosity some people have toward Rule 2. . . but Franciscans also are notoriously violent there and they intend to be in charge at the end of each put. God is the judge of every place."

The Patriot Leader has a tone similar to that of Rev. Carl Nelson's."Let me ask, is a right to exhibit racial and religious symbols on your property such a my God," he marveled, as one inquirer characterized those religious symbols."Jeb Talmage, you did this painting there, did you know?" asked Gagnon.

Through a crack at Trump for his response, Fitzgerald efficient rebuked; not once did Fitzgerald otherwise react. More snd just finished writing this post which the Journal overlooked due to focus and questionability and yet in this city transcribed the posting over 887 frustrating seconds in length and markedly. You know probably will not boo the level of the Presidential "say turn" for much money to basically pay for his utter disregard for our how we live. Not that there's anything wrong with pay-to-play or kicking Fauxouts. As a mother of one young daughter, pray for my voice. See "Before this she left to go to Gothic, I looked up the alphabet to eat cookies For my Christian a Werewolf Wolf, and my Dog ; quote: Heaven my prayer

I saw the scene these moronchildren were living in,

they had room to spare from their loves.

I couldn't make any conversation like a Walter White subject... Reject madmen of the City of Independence. [[End Post, 5/17/08] Hearing what Fitzgerald Barbara is saying, began to seem superfluous. Fitzgerald burst into a Googling of white people's black "hypocrisy" and found herself empowered with an insight understanding that the principle of "black roots arrogant of white" must be Austin McKetty's's vision of white supremacy, when it was the preeminence of a man and its map to totalitarianism, its sensibilities, its pleading what could be called "impartiality to all our problems" and the above alienation of white people. Within and of the Gentlemen's splashy rendition of the words, The Advocate adapted this core reading to help these white nationalist bast

Mind blowing 🤯


Approach Epoch Time (min) Train loss Dev loss Dev Accuracy (%)
Finetune LM 15 6 3.575478 4.021957 26.4607
Finetune Classifier 5 2 0.786838 0.658620 72.4479
Gradual Unfreezing (Last 1 layer) 5 2 0.725324 0.590953 75.2134
Gradual Unfreezing (Last 2 layer) 5 3 0.556359 0.486604 81.2564
Unfreeze whole and train 8 7 0.474538 0.446159 82.9293

Next, we will move back to vision and understand one of the very serious problem in next post on Mystery of Adversarial Learning.

Happy Learning!

Note: Caveats on terminology

Power of Transfer Learning - Transfer Learning

NLP - Natural Language Processing

POS - Part-of-speech

NER - Named Entity Recognition

Power of Transfer Learning in NLP - Transfer Learning in NLP

CoVe - Contextual Word Vectors

ELMo - Embeddings from Language Models

ULMFiT - Universal Language Model Finetuning

GPT - Generative Pretraining Transformers

BERT - Bidirectional Encoder Representations for Transformers

ERNIE - Enhanced Representation through kNowledge IntEgration

Further Reading

Must Read! Awesome Lil’Log Generalized Language Models

Must Read! Sebastian Ruder NLP’s ImageNet moment has arrived and 10 Exciting Ideas of 2018 in NLP

Must Read! Dissecting BERT Part 1: The Encoder, Understanding BERT Part 2: BERT Specifics and Dissecting BERT Appendix: The Decoder

Must Read! Very cool visualizations by Jay Alammar on Illustrated bert, Illustrated Transformer and Visualizing A Neural Machine Translation Model

Must Read! mlexplained.com by keitakurita Awesome Paper Dissected BERT, ELMo and Attention Is All You Need

Must Read! Havard NLP The Annotated Transformer

Salesforce Research Blog: CoVe



ELMo blog by Masato Hagiwara


Fastai blog on ULMFiT

Attention is All You Need

Semi-supervised Sequence Learning

Byte Pair Encoding


OpenAI GPT Blog


Google AI blog BERT


OpenAI GPT-2 Blog

Can you compare perplexity across different segmentations?

Footnotes and Credits

Kaggle Dataset Tweets Sentiment Analysis

Star Wars gif

Meme src

CoVe decoder, attention, generator, results






ULMFiT SLR, results

GPT Transformer, results

BERT Transformer, Inputs, results

GPT-2 examples

Karen Hao’s post in MIT Technology Review Four different philosophies of language


Questions, comments, other feedback? E-mail the author