MLENGram
¶

class
numpy_ml.ngram.
MLENGram
(N, unk=True, filter_stopwords=True, filter_punctuation=True)[source]¶ A simple, unsmoothed Ngram model.
Parameters:  N (int) – The maximum length (in words) of the contextwindow to use in the langauge model. Model will compute all ngrams from 1, …, N.
 unk (bool) – Whether to include the
<unk>
(unknown) token in the LM. Default is True.  filter_stopwords (bool) – Whether to remove stopwords before training. Default is True.
 filter_punctuation (bool) – Whether to remove punctuation before training. Default is True.

log_prob
(words, N)[source]¶ Compute the log probability of a sequence of words under the unsmoothed, maximumlikelihood Ngram language model.
Parameters:  words (list of strings) – A sequence of words
 N (int) – The gramsize of the language model to use when calculating the log probabilities of the sequence
Returns: total_prob (float) – The total logprobability of the sequence words under the Ngram language model

completions
(words, N)[source]¶ Return the distribution over proposed next words under the Ngram language model.
Parameters: Returns: probs (list of (word, log_prob) tuples) – The list of possible next words and their log probabilities under the Ngram language model (unsorted)

cross_entropy
(words, N)[source]¶ Calculate the model crossentropy on a sequence of words against the empirical distribution of words in a sample.
Notes
Model crossentropy, H, is defined as
\[H(W) = \frac{\log p(W)}{n}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, and n is the number of Ngrams in W.
The model crossentropy is proportional (not equal, since we use base e) to the average number of bits necessary to encode W under the model distribution.
Parameters: Returns: H (float) – The model crossentropy for the words in words.

generate
(N, seed_words=['<bol>'], n_sentences=5)[source]¶ Use the Ngram language model to generate sentences.
Parameters: Returns: sentences (str) – Samples from the Ngram model, joined by white spaces, with individual sentences separated by newlines.

perplexity
(words, N)[source]¶ Calculate the model perplexity on a sequence of words.
Notes
Perplexity, PP, is defined as
\[PP(W) = \left( \frac{1}{p(W)} \right)^{1 / n}\]or simply
\[\begin{split}PP(W) &= \exp(\log p(W) / n) \\ &= \exp(H(W))\end{split}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, H(w) is the crossentropy of W under the current model, and n is the number of Ngrams in W.
Minimizing perplexity is equivalent to maximizing the probability of words under the Ngram model. It may also be interpreted as the average branching factor when predicting the next word under the language model.
Parameters: Returns: perplexity (float) – The model perlexity for the words in words.

train
(corpus_fp, vocab=None, encoding=None)[source]¶ Compile the ngram counts for the text(s) in corpus_fp.
Notes
After running train, the
self.counts
attribute will store dictionaries of the N, N1, …, 1gram counts.Parameters:  corpus_fp (str) – The path to a newlineseparated text corpus file.
 vocab (
Vocabulary
instance or None) – If not None, only the words in vocab will be used to construct the language model; all outofvocabulary words will either be mappend to<unk>
(ifself.unk = True
) or removed (ifself.unk = False
). Default is None.  encoding (str or None) – Specifies the text encoding for corpus. Common entries are ‘utf8’, ‘utf8sig’, ‘utf16’. Default is None.