GoodTuringNGram
¶

class
numpy_ml.ngram.
GoodTuringNGram
(N, conf=1.96, unk=True, filter_stopwords=True, filter_punctuation=True)[source]¶ An NGram model with smoothed probabilities calculated with the simple GoodTuring estimator from Gale (2001).
Parameters:  N (int) – The maximum length (in words) of the contextwindow to use in the langauge model. Model will compute all ngrams from 1, …, N.
 conf (float) – The multiplier of the standard deviation of the empirical smoothed count (the default, 1.96, corresponds to a 95% confidence interval). Controls how many datapoints are smoothed using the loglinear model.
 unk (bool) – Whether to include the
<unk>
(unknown) token in the LM. Default is True.  filter_stopwords (bool) – Whether to remove stopwords before training. Default is True.
 filter_punctuation (bool) – Whether to remove punctuation before training. Default is True.

train
(corpus_fp, vocab=None, encoding=None)[source]¶ Compile the ngram counts for the text(s) in corpus_fp. Upon completion the self.counts attribute will store dictionaries of the N, N1, …, 1gram counts.
Parameters:  corpus_fp (str) – The path to a newlineseparated text corpus file
 vocab (
Vocabulary
instance or None.) – If not None, only the words in vocab will be used to construct the language model; all outofvocabulary words will either be mappend to<unk>
(ifself.unk = True
) or removed (ifself.unk = False
). Default is None.  encoding (str or None) – Specifies the text encoding for corpus. Common entries are ‘utf8’, ‘utf8sig’, ‘utf16’. Default is None.

log_prob
(words, N)[source]¶ Compute the smoothed log probability of a sequence of words under the Ngram language model with GoodTuring smoothing.
Notes
For a bigram, GoodTuring smoothing amounts to:
\[P(w_i \mid w_{i1}) = \frac{C^*}{\text{Count}(w_{i1})}\]where \(C^*\) is the GoodTuring smoothed estimate of the bigram count:
\[C^* = \frac{(c + 1) \text{NumCounts}(c + 1, 2)}{\text{NumCounts}(c, 2)}\]where
\[\begin{split}c &= \text{Count}(w_{i1}, w_i) \\ \text{NumCounts}(r, k) &= \{ k\text{gram} : \text{Count}(k\text{gram}) = r \}\end{split}\]In words, the probability of an Ngram that occurs r times in the corpus is estimated by dividing up the probability mass occupied by Ngrams that occur r+1 times.
For large values of r, NumCounts becomes unreliable. In this case, we compute a smoothed version of NumCounts using a power law function:
\[\log \text{NumCounts}(r) = b + a \log r\]Under the GoodTuring estimator, the total probability assigned to unseen Ngrams is equal to the relative occurrence of Ngrams that appear only once.
Parameters:  words (list of strings) – A sequence of words.
 N (int) – The gramsize of the language model to use when calculating the log probabilities of the sequence.
Returns: total_prob (float) – The total logprobability of the sequence words under the Ngram language model.

completions
(words, N)[source]¶ Return the distribution over proposed next words under the Ngram language model.
Parameters: Returns: probs (list of (word, log_prob) tuples) – The list of possible next words and their log probabilities under the Ngram language model (unsorted)

cross_entropy
(words, N)[source]¶ Calculate the model crossentropy on a sequence of words against the empirical distribution of words in a sample.
Notes
Model crossentropy, H, is defined as
\[H(W) = \frac{\log p(W)}{n}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, and n is the number of Ngrams in W.
The model crossentropy is proportional (not equal, since we use base e) to the average number of bits necessary to encode W under the model distribution.
Parameters: Returns: H (float) – The model crossentropy for the words in words.

generate
(N, seed_words=['<bol>'], n_sentences=5)[source]¶ Use the Ngram language model to generate sentences.
Parameters: Returns: sentences (str) – Samples from the Ngram model, joined by white spaces, with individual sentences separated by newlines.

perplexity
(words, N)[source]¶ Calculate the model perplexity on a sequence of words.
Notes
Perplexity, PP, is defined as
\[PP(W) = \left( \frac{1}{p(W)} \right)^{1 / n}\]or simply
\[\begin{split}PP(W) &= \exp(\log p(W) / n) \\ &= \exp(H(W))\end{split}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, H(w) is the crossentropy of W under the current model, and n is the number of Ngrams in W.
Minimizing perplexity is equivalent to maximizing the probability of words under the Ngram model. It may also be interpreted as the average branching factor when predicting the next word under the language model.
Parameters: Returns: perplexity (float) – The model perlexity for the words in words.