GoodTuringNGram
¶
-
class
numpy_ml.ngram.
GoodTuringNGram
(N, conf=1.96, unk=True, filter_stopwords=True, filter_punctuation=True)[source]¶ An N-Gram model with smoothed probabilities calculated with the simple Good-Turing estimator from Gale (2001).
Parameters: - N (int) – The maximum length (in words) of the context-window to use in the langauge model. Model will compute all n-grams from 1, …, N.
- conf (float) – The multiplier of the standard deviation of the empirical smoothed count (the default, 1.96, corresponds to a 95% confidence interval). Controls how many datapoints are smoothed using the log-linear model.
- unk (bool) – Whether to include the
<unk>
(unknown) token in the LM. Default is True. - filter_stopwords (bool) – Whether to remove stopwords before training. Default is True.
- filter_punctuation (bool) – Whether to remove punctuation before training. Default is True.
-
train
(corpus_fp, vocab=None, encoding=None)[source]¶ Compile the n-gram counts for the text(s) in corpus_fp. Upon completion the self.counts attribute will store dictionaries of the N, N-1, …, 1-gram counts.
Parameters: - corpus_fp (str) – The path to a newline-separated text corpus file
- vocab (
Vocabulary
instance or None.) – If not None, only the words in vocab will be used to construct the language model; all out-of-vocabulary words will either be mappend to<unk>
(ifself.unk = True
) or removed (ifself.unk = False
). Default is None. - encoding (str or None) – Specifies the text encoding for corpus. Common entries are ‘utf-8’, ‘utf-8-sig’, ‘utf-16’. Default is None.
-
log_prob
(words, N)[source]¶ Compute the smoothed log probability of a sequence of words under the N-gram language model with Good-Turing smoothing.
Notes
For a bigram, Good-Turing smoothing amounts to:
\[P(w_i \mid w_{i-1}) = \frac{C^*}{\text{Count}(w_{i-1})}\]where \(C^*\) is the Good-Turing smoothed estimate of the bigram count:
\[C^* = \frac{(c + 1) \text{NumCounts}(c + 1, 2)}{\text{NumCounts}(c, 2)}\]where
\[\begin{split}c &= \text{Count}(w_{i-1}, w_i) \\ \text{NumCounts}(r, k) &= |\{ k\text{-gram} : \text{Count}(k\text{-gram}) = r \}|\end{split}\]In words, the probability of an N-gram that occurs r times in the corpus is estimated by dividing up the probability mass occupied by N-grams that occur r+1 times.
For large values of r, NumCounts becomes unreliable. In this case, we compute a smoothed version of NumCounts using a power law function:
\[\log \text{NumCounts}(r) = b + a \log r\]Under the Good-Turing estimator, the total probability assigned to unseen N-grams is equal to the relative occurrence of N-grams that appear only once.
Parameters: - words (list of strings) – A sequence of words.
- N (int) – The gram-size of the language model to use when calculating the log probabilities of the sequence.
Returns: total_prob (float) – The total log-probability of the sequence words under the N-gram language model.
-
completions
(words, N)[source]¶ Return the distribution over proposed next words under the N-gram language model.
Parameters: Returns: probs (list of (word, log_prob) tuples) – The list of possible next words and their log probabilities under the N-gram language model (unsorted)
-
cross_entropy
(words, N)[source]¶ Calculate the model cross-entropy on a sequence of words against the empirical distribution of words in a sample.
Notes
Model cross-entropy, H, is defined as
\[H(W) = -\frac{\log p(W)}{n}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, and n is the number of N-grams in W.
The model cross-entropy is proportional (not equal, since we use base e) to the average number of bits necessary to encode W under the model distribution.
Parameters: Returns: H (float) – The model cross-entropy for the words in words.
-
generate
(N, seed_words=['<bol>'], n_sentences=5)[source]¶ Use the N-gram language model to generate sentences.
Parameters: Returns: sentences (str) – Samples from the N-gram model, joined by white spaces, with individual sentences separated by newlines.
-
perplexity
(words, N)[source]¶ Calculate the model perplexity on a sequence of words.
Notes
Perplexity, PP, is defined as
\[PP(W) = \left( \frac{1}{p(W)} \right)^{1 / n}\]or simply
\[\begin{split}PP(W) &= \exp(-\log p(W) / n) \\ &= \exp(H(W))\end{split}\]where \(W = [w_1, \ldots, w_k]\) is a sequence of words, H(w) is the cross-entropy of W under the current model, and n is the number of N-grams in W.
Minimizing perplexity is equivalent to maximizing the probability of words under the N-gram model. It may also be interpreted as the average branching factor when predicting the next word under the language model.
Parameters: Returns: perplexity (float) – The model perlexity for the words in words.