Bengio, Y., Ducharme, R. and Vincent, P., 2000. A neural probabilistic language model. Advances in neural information processing systems, 13.


From today we are also adding a section "Explain like I am five" and explain the complex paper in simple language.


  • A fundamental problem that makes language modelling and other learning problems difficult is the curse of dimensionality.
  • It is demonstrated that distributed representations can represent the same information in a much more compact form and can also capture more complex relationships between variables. This allowed for an increase in the accuracy of language models, as well as the ability to capture more nuanced relationships between language variables.
  • Due to the curse of dimensionality, it is intrinsically difficult to learn the joint probability function of words in a language when a word sequence is tested, since it may be different from every word sequence seen during training. Traditionally, n-grams have been used to achieve generalization by concatenating overlapping sequences of very short lengths. This allows the model to extrapolate the probability of a sequence of words it has never seen before. This is done by assuming that the probability of a word sequence can be estimated by the probability of the individual words that make up the sequence.
  • The idea is that by learning a distributed representation, each individual word in the sentence can inform the model about a large number of words that are semantically related to it. This allows the model to learn from a much larger context, leading to improved performance.
  • The distributed representations of words, which are learned through the probabilistic model, are able to capture the semantic relationships between words, which allows the model to recognize the relationships between similar sentences. This, in turn, helps the model to understand the context of the sentence and make more accurate predictions even when there are limited training examples.
  • This is because neural networks learn from patterns. As they process a sequence of words that is similar to one they have seen before, they are able to recognize the pattern and assign a higher probability to that sequence. As a result, the neural network is able to generalize and make predictions about new sequences of words.
  • The experiments on two corpora, a medium one 0.2 million words), and a large one (34 million words) have shown that the proposed approach yields much better perplexity than a state-of-the-art method, the smoothed trigram, with differences in the order of 20% to 35%.
  • Learnings: The main result is that the neural network performs much better than the smoothed trigram
  1. More context is useful.
  2. Hidden units help.
  3. Learning the word features jointly is important.
  4. Initialization is not so useful.
  5. Direct architecture works a bit better.
  6. A conditional mixture helps but even without it the neural net is better.

More details:

This post is for subscribers only

Sign up now to read the post and get access to the full library of posts for subscribers only.

Sign up now Already have an account? Sign in