AdditiveNGram

class numpy_ml.ngram.AdditiveNGram(N, K=1, unk=True, filter_stopwords=True, filter_punctuation=True)[source]

An N-Gram model with smoothed probabilities calculated via additive / Lidstone smoothing.

Notes

The resulting estimates correspond to the expected value of the posterior, p(ngram_prob | counts), when using a symmetric Dirichlet prior on counts with parameter K.

Parameters:
  • N (int) – The maximum length (in words) of the context-window to use in the langauge model. Model will compute all n-grams from 1, …, N
  • K (float) – The pseudocount to add to each observation. Larger values allocate more probability toward unseen events. When K = 1, the model is known as Laplace smoothing. When K = 0.5, the model is known as expected likelihood estimation (ELE) or the Jeffreys-Perks law. Default is 1.
  • unk (bool) – Whether to include the <unk> (unknown) token in the LM. Default is True.
  • filter_stopwords (bool) – Whether to remove stopwords before training. Default is True.
  • filter_punctuation (bool) – Whether to remove punctuation before training. Default is True.
log_prob(words, N)[source]

Compute the smoothed log probability of a sequence of words under the N-gram language model with additive smoothing.

Notes

For a bigram, additive smoothing amounts to:

\[P(w_i \mid w_{i-1}) = \frac{A + K}{B + KV}\]

where

\[\begin{split}A &= \text{Count}(w_{i-1}, w_i) \\ B &= \sum_j \text{Count}(w_{i-1}, w_j) \\ V &= |\{ w_j \ : \ \text{Count}(w_{i-1}, w_j) > 0 \}|\end{split}\]

This is equivalent to pretending we’ve seen every possible N-gram sequence at least K times.

Additive smoothing can be problematic, as it:
  • Treats each predicted word in the same way
  • Can assign too much probability mass to unseen N-grams
Parameters:
  • words (list of strings) – A sequence of words.
  • N (int) – The gram-size of the language model to use when calculating the log probabilities of the sequence.
Returns:

total_prob (float) – The total log-probability of the sequence words under the N-gram language model.

completions(words, N)[source]

Return the distribution over proposed next words under the N-gram language model.

Parameters:
  • words (list or tuple of strings) – The initial sequence of words
  • N (int) – The gram-size of the language model to use to generate completions
Returns:

probs (list of (word, log_prob) tuples) – The list of possible next words and their log probabilities under the N-gram language model (unsorted)

cross_entropy(words, N)[source]

Calculate the model cross-entropy on a sequence of words against the empirical distribution of words in a sample.

Notes

Model cross-entropy, H, is defined as

\[H(W) = -\frac{\log p(W)}{n}\]

where \(W = [w_1, \ldots, w_k]\) is a sequence of words, and n is the number of N-grams in W.

The model cross-entropy is proportional (not equal, since we use base e) to the average number of bits necessary to encode W under the model distribution.

Parameters:
  • N (int) – The gram-size of the model to calculate cross-entropy on.
  • words (list or tuple of strings) – The sequence of words to compute cross-entropy on.
Returns:

H (float) – The model cross-entropy for the words in words.

generate(N, seed_words=['<bol>'], n_sentences=5)[source]

Use the N-gram language model to generate sentences.

Parameters:
  • N (int) – The gram-size of the model to generate from
  • seed_words (list of strs) – A list of seed words to use to condition the initial sentence generation. Default is ["<bol>"].
  • sentences (int) – The number of sentences to generate from the N-gram model. Default is 50.
Returns:

sentences (str) – Samples from the N-gram model, joined by white spaces, with individual sentences separated by newlines.

perplexity(words, N)[source]

Calculate the model perplexity on a sequence of words.

Notes

Perplexity, PP, is defined as

\[PP(W) = \left( \frac{1}{p(W)} \right)^{1 / n}\]

or simply

\[\begin{split}PP(W) &= \exp(-\log p(W) / n) \\ &= \exp(H(W))\end{split}\]

where \(W = [w_1, \ldots, w_k]\) is a sequence of words, H(w) is the cross-entropy of W under the current model, and n is the number of N-grams in W.

Minimizing perplexity is equivalent to maximizing the probability of words under the N-gram model. It may also be interpreted as the average branching factor when predicting the next word under the language model.

Parameters:
  • N (int) – The gram-size of the model to calculate perplexity with.
  • words (list or tuple of strings) – The sequence of words to compute perplexity on.
Returns:

perplexity (float) – The model perlexity for the words in words.

train(corpus_fp, vocab=None, encoding=None)[source]

Compile the n-gram counts for the text(s) in corpus_fp.

Notes

After running train, the self.counts attribute will store dictionaries of the N, N-1, …, 1-gram counts.

Parameters:
  • corpus_fp (str) – The path to a newline-separated text corpus file.
  • vocab (Vocabulary instance or None) – If not None, only the words in vocab will be used to construct the language model; all out-of-vocabulary words will either be mappend to <unk> (if self.unk = True) or removed (if self.unk = False). Default is None.
  • encoding (str or None) – Specifies the text encoding for corpus. Common entries are ‘utf-8’, ‘utf-8-sig’, ‘utf-16’. Default is None.