`SmoothedLDA`¶

class numpy_ml.lda.SmoothedLDA(T, **kwargs)[source]¶

A smoothed LDA model trained using collapsed Gibbs sampling. Generates posterior mean estimates for model parameters phi and theta.

Parameters:

T (int) – Number of topics

Variables:

D (int) – Number of documents
N (int) – Total number of words across all documents
V (int) – Number of unique word tokens across all documents
phi (ndarray of shape (N[d], T)) – The word-topic distribution
theta (ndarray of shape (D, T)) – The document-topic distribution
alpha (ndarray of shape (1, T)) – Parameter for the Dirichlet prior on the document-topic distribution
beta (ndarray of shape (V, T)) – Parameter for the Dirichlet prior on the topic-word distribution

train(texts, tokens, n_gibbs=2000)[source]¶

Trains a topic model on the documents in texts.

Parameters:

texts (array of length (D,)) – The training corpus represented as an array of subarrays, where each subarray corresponds to the tokenized words of a single document.
tokens (array of length (V,)) – The set of unique tokens in the documents in texts.
n_gibbs (int) – The number of steps to run the collapsed Gibbs sampler during training. Default is 2000.

Returns:

C_wt (ndarray of shape (V, T)) – The word-topic count matrix
C_dt (ndarray of shape (D, T)) – The document-topic count matrix
assignments (ndarray of shape (N, n_gibbs)) – The topic assignments for each word in the corpus on each Gibbs step.

what_did_you_learn(top_n=10)[source]¶: Print the top_n most probable words under each topic

fit_params(C_wt, C_dt)[source]¶

Estimate phi, the word-topic distribution, and theta, the topic-document distribution.

Parameters:

Returns:

numpy-ml