Hello, and welcome.
In this video, we will explain what you need to know in order to apply recurrent neural
networks to language modelling.
Language modelling is a gateway into many exciting deep learning applications like speech
recognition, machine translation, and image captioning.
At its simplest, language modelling is the process of assigning probabilities to sequences
of words.
So for example, a language model could analyze a sequence of words, and predict which word
is most likely to follow.
So with the sequence "This is an" which you see here, a language model might predict that
the word "example" is most likely to follow, with an 80 percent probability.
This boils down to a sequential data analysis problem.
The sequence of words forms the context, and the most recent word is the input data.
Using these two pieces of information, you need to output both a predicted word, and
a new context that contains the input word.
Recurrent neural networks are a great fit for this type of problem.
At each time step, a recurrent net can receive a word as input and the current sequence of
words as the context.
After processing, the net can then form a new context and repeat the steps until the
sentence is complete.
The main metric for language modelling is known as perplexity.
Perplexity is a measure of how well the model is able to predict a sample.
Keep in mind that a low perplexity rating equates to a larger amount of confidence in
the prediction.
So we want our model to have as low of a perplexity rating as possible.
When it comes to actually training and testing a language model, you'll find that good datasets
are hard to come by.
Since the data points are words or sentences, the data has to be annotated, or at least
validated, by a human.
This is time consuming and typically constrains the dataset's size.
One of the biggest datasets for language modelling is the Penn Treebank.
The Penn Treebank was created by scholars at the University of Pennsylvania.
It holds over four million annotated words in many different types of classifications.
In order to build such a large dataset, all of the words were first tagged by machines,
and then validated and corrected by humans.
The data comes from many different sources, from papers published in the Department of
Energy, to excerpts from the Library of America.
As we mentioned, the Penn Treebank is the go-to dataset for language modelling, and
natural language processing in general.
The Penn Treebank is versatile, but if you're only interested in predicting words rather
than meaning or part of speech, then you don't need to use the tags in the dataset.
An interesting way to process words is through a structure known as a Word Embedding.
A word embedding is an n-dimensional vector of real numbers.
The vector is typically large, with n greater than 100.
The vector is also initialized randomly.
You can see what that might look like with the example here.
During the recurrent network's training, the vector values are updated based on the context
that the word is being inserted into.
So words that are used in similar contexts end up with similar positions in the vector space.
This can be visualized by utilizing a dimensionality-reduction algorithm, such as t-SNE.
Take a look at this example here.
Words are grouped together either because they're synonyms, or they're used in similar
places within a sentence.
The words "zero" and "none" are close semantically, so it's natural for them to be close together.
And while "Italy" and "Germany" aren't synonyms, they can be interchanged in several sentences
without distorting the grammar.
By now, you should understand the theory behind Language Modelling, the importance of the
Penn Treebank, and the application of recurrent nets to language problems.
Thank you for watching this video.
Không có nhận xét nào:
Đăng nhận xét