# SmoothedLDA¶

class numpy_ml.lda.SmoothedLDA(T, **kwargs)[source]

Bases: object

A smoothed LDA model trained using collapsed Gibbs sampling. Generates posterior mean estimates for model parameters phi and theta.

Parameters: T (int) – Number of topics D (int) – Number of documents N (int) – Total number of words across all documents V (int) – Number of unique word tokens across all documents phi (ndarray of shape (N[d], T)) – The word-topic distribution theta (ndarray of shape (D, T)) – The document-topic distribution alpha (ndarray of shape (1, T)) – Parameter for the Dirichlet prior on the document-topic distribution beta (ndarray of shape (V, T)) – Parameter for the Dirichlet prior on the topic-word distribution
train(texts, tokens, n_gibbs=2000)[source]

Trains a topic model on the documents in texts.

Parameters: texts (array of length (D,)) – The training corpus represented as an array of subarrays, where each subarray corresponds to the tokenized words of a single document. tokens (array of length (V,)) – The set of unique tokens in the documents in texts. n_gibbs (int) – The number of steps to run the collapsed Gibbs sampler during training. Default is 2000. C_wt (ndarray of shape (V, T)) – The word-topic count matrix C_dt (ndarray of shape (D, T)) – The document-topic count matrix assignments (ndarray of shape (N, n_gibbs)) – The topic assignments for each word in the corpus on each Gibbs step.
what_did_you_learn(top_n=10)[source]

Print the top_n most probable words under each topic

fit_params(C_wt, C_dt)[source]

Estimate phi, the word-topic distribution, and theta, the topic-document distribution.

Parameters: C_wt (ndarray of shape (V, T)) – The word-topic count matrix C_dt (ndarray of shape (D, T)) – The document-topic count matrix phi (ndarray of shape (V, T)) – The word-topic distribution theta (ndarray of shape (D, T)) – The document-topic distribution