SmoothedLDA

class numpy_ml.lda.SmoothedLDA(T, **kwargs)[source]

Bases: object

A smoothed LDA model trained using collapsed Gibbs sampling. Generates posterior mean estimates for model parameters phi and theta.

Parameters:

T (int) – Number of topics

Variables:
  • D (int) – Number of documents
  • N (int) – Total number of words across all documents
  • V (int) – Number of unique word tokens across all documents
  • phi (ndarray of shape (N[d], T)) – The word-topic distribution
  • theta (ndarray of shape (D, T)) – The document-topic distribution
  • alpha (ndarray of shape (1, T)) – Parameter for the Dirichlet prior on the document-topic distribution
  • beta (ndarray of shape (V, T)) – Parameter for the Dirichlet prior on the topic-word distribution
train(texts, tokens, n_gibbs=2000)[source]

Trains a topic model on the documents in texts.

Parameters:
  • texts (array of length (D,)) – The training corpus represented as an array of subarrays, where each subarray corresponds to the tokenized words of a single document.
  • tokens (array of length (V,)) – The set of unique tokens in the documents in texts.
  • n_gibbs (int) – The number of steps to run the collapsed Gibbs sampler during training. Default is 2000.
Returns:

  • C_wt (ndarray of shape (V, T)) – The word-topic count matrix
  • C_dt (ndarray of shape (D, T)) – The document-topic count matrix
  • assignments (ndarray of shape (N, n_gibbs)) – The topic assignments for each word in the corpus on each Gibbs step.

what_did_you_learn(top_n=10)[source]

Print the top_n most probable words under each topic

fit_params(C_wt, C_dt)[source]

Estimate phi, the word-topic distribution, and theta, the topic-document distribution.

Parameters:
  • C_wt (ndarray of shape (V, T)) – The word-topic count matrix
  • C_dt (ndarray of shape (D, T)) – The document-topic count matrix
Returns:

  • phi (ndarray of shape (V, T)) – The word-topic distribution
  • theta (ndarray of shape (D, T)) – The document-topic distribution