LDA

class numpy_ml.lda.LDA(T=10)[source]

Vanilla (non-smoothed) LDA model trained using variational EM. Generates maximum-likelihood estimates for model paramters alpha and beta.

Parameters:

T (int) – Number of topics

Variables:
  • D (int) – Number of documents
  • N (list of length D) – Number of words in each document
  • V (int) – Number of unique word tokens across all documents
  • phi (ndarray of shape (D, N[d], T)) – Variational approximation to word-topic distribution
  • gamma (ndarray of shape (D, T)) – Variational approximation to document-topic distribution
  • alpha (ndarray of shape (1, T)) – Parameter for the Dirichlet prior on the document-topic distribution
  • beta (ndarray of shape (V, T)) – Word-topic distribution
VLB()[source]

Return the variational lower bound associated with the current model parameters.

initialize_parameters()[source]

Provide reasonable initializations for model and variational parameters.

train(corpus, verbose=False, max_iter=1000, tol=5)[source]

Train the LDA model on a corpus of documents (bags of words).

Parameters:
  • corpus (list of length D) – A list of lists, with each sublist containing the tokenized text of a single document.
  • verbose (bool) – Whether to print the VLB at each training iteration. Default is True.
  • max_iter (int) – The maximum number of training iterations to perform before breaking. Default is 1000.
  • tol (int) – Break the training loop if the difference betwen the VLB on the current iteration and the previous iteration is less than tol. Default is 5.