AdditiveNGram
¶

class
numpy_ml.ngram.
AdditiveNGram
(N, K=1, unk=True, filter_stopwords=True, filter_punctuation=True)[source]¶ An NGram model with smoothed probabilities calculated via additive / Lidstone smoothing.
Notes
The resulting estimates correspond to the expected value of the posterior, p(ngram_prob  counts), when using a symmetric Dirichlet prior on counts with parameter K.
Parameters:  N (int) – The maximum length (in words) of the contextwindow to use in the langauge model. Model will compute all ngrams from 1, …, N
 K (float) – The pseudocount to add to each observation. Larger values allocate more probability toward unseen events. When K = 1, the model is known as Laplace smoothing. When K = 0.5, the model is known as expected likelihood estimation (ELE) or the JeffreysPerks law. Default is 1.
 unk (bool) – Whether to include the
<unk>
(unknown) token in the LM. Default is True.  filter_stopwords (bool) – Whether to remove stopwords before training. Default is True.
 filter_punctuation (bool) – Whether to remove punctuation before training. Default is True.

log_prob
(words, N)[source]¶ Compute the smoothed log probability of a sequence of words under the Ngram language model with additive smoothing.
Notes
For a bigram, additive smoothing amounts to:
\[P(w_i \mid w_{i1}) = \frac{A + K}{B + KV}\]where
\[\begin{split}A &= \text{Count}(w_{i1}, w_i) \\ B &= \sum_j \text{Count}(w_{i1}, w_j) \\ V &= \{ w_j \ : \ \text{Count}(w_{i1}, w_j) > 0 \}\end{split}\]This is equivalent to pretending we’ve seen every possible Ngram sequence at least K times.
 Additive smoothing can be problematic, as it:
 Treats each predicted word in the same way
 Can assign too much probability mass to unseen Ngrams
Parameters:  words (list of strings) – A sequence of words.
 N (int) – The gramsize of the language model to use when calculating the log probabilities of the sequence.
Returns: total_prob (float) – The total logprobability of the sequence words under the Ngram language model.

completions
(words, N)[source]¶ Return the distribution over proposed next words under the Ngram language model.
Parameters: Returns: probs (list of (word, log_prob) tuples) – The list of possible next words and their log probabilities under the Ngram language model (unsorted)

cross_entropy
(words, N)[source]¶ Calculate the model crossentropy on a sequence of words against the empirical distribution of words in a sample.
Notes
Model crossentropy, H, is defined as
\[H(W) = \frac{\log p(W)}{n}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, and n is the number of Ngrams in W.
The model crossentropy is proportional (not equal, since we use base e) to the average number of bits necessary to encode W under the model distribution.
Parameters: Returns: H (float) – The model crossentropy for the words in words.

generate
(N, seed_words=['<bol>'], n_sentences=5)[source]¶ Use the Ngram language model to generate sentences.
Parameters: Returns: sentences (str) – Samples from the Ngram model, joined by white spaces, with individual sentences separated by newlines.

perplexity
(words, N)[source]¶ Calculate the model perplexity on a sequence of words.
Notes
Perplexity, PP, is defined as
\[PP(W) = \left( \frac{1}{p(W)} \right)^{1 / n}\]or simply
\[\begin{split}PP(W) &= \exp(\log p(W) / n) \\ &= \exp(H(W))\end{split}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, H(w) is the crossentropy of W under the current model, and n is the number of Ngrams in W.
Minimizing perplexity is equivalent to maximizing the probability of words under the Ngram model. It may also be interpreted as the average branching factor when predicting the next word under the language model.
Parameters: Returns: perplexity (float) – The model perlexity for the words in words.

train
(corpus_fp, vocab=None, encoding=None)[source]¶ Compile the ngram counts for the text(s) in corpus_fp.
Notes
After running train, the
self.counts
attribute will store dictionaries of the N, N1, …, 1gram counts.Parameters:  corpus_fp (str) – The path to a newlineseparated text corpus file.
 vocab (
Vocabulary
instance or None) – If not None, only the words in vocab will be used to construct the language model; all outofvocabulary words will either be mappend to<unk>
(ifself.unk = True
) or removed (ifself.unk = False
). Default is None.  encoding (str or None) – Specifies the text encoding for corpus. Common entries are ‘utf8’, ‘utf8sig’, ‘utf16’. Default is None.