Natural language processing¶

`BytePairEncoder`¶

class numpy_ml.preprocessing.nlp.BytePairEncoder(max_merges=3000, encoding='utf-8')[source]¶

A byte-pair encoder for sub-word embeddings.

Notes

Byte-pair encoding [1][2] is a compression algorithm that iteratively replaces the most frequently ocurring byte pairs in a set of documents with a new, single token. It has gained popularity as a preprocessing step for many NLP tasks due to its simplicity and expressiveness: using a base coebook of just 256 unique tokens (bytes), any string can be encoded.

References

[1]	Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

[2]	Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715-1725.

Parameters:	max_merges (int) – The maximum number of byte pair merges to perform during the `fit()` operation. Default is 3000. encoding (str) – The encoding scheme for the documents used to train the encoder. Default is ‘utf-8’.

fit(corpus_fps, encoding='utf-8')[source]¶

Train a byte pair codebook on a set of documents.

Parameters:	corpus_fps (str or list of strs) – The filepath / list of filepaths for the document(s) to be used to learn the byte pair codebook. encoding (str) – The text encoding for documents. Common entries are either ‘utf-8’ (no header byte), or ‘utf-8-sig’ (header byte). Default is ‘utf-8’.

transform(text)[source]¶

Transform the words in text into their byte pair encoded token IDs.

Parameters:	text (str or list of N strings) – The list of strings to encode
Returns:	codes (list of N lists) – A list of byte pair token IDs for each of the N strings in text.

Examples

>>> B = BytePairEncoder(max_merges=100).fit("./example.txt")
>>> encoded_tokens = B.transform("Hello! How are you 😁 ?")
>>> encoded_tokens
[[72, 879, 474, ...]]

inverse_transform(codes)[source]¶

Transform an encoded sequence of byte pair codeword IDs back into human-readable text.

Parameters:	codes (list of N lists) – A list of N lists. Each sublist is a collection of integer byte-pair token IDs representing a particular text string.
Returns:	text (list of N strings) – The decoded strings corresponding to the N sublists in codes.

Examples

>>> B = BytePairEncoder(max_merges=100).fit("./example.txt")
>>> encoded_tokens = B.transform("Hello! How are you 😁 ?")
>>> encoded_tokens
[[72, 879, 474, ...]]
>>> B.inverse_transform(encoded_tokens)
["Hello! How are you 😁 ?"]

codebook[source]¶: A list of the learned byte pair codewords, decoded into human-readable format

tokens[source]¶: A list of the byte pair codeword IDs

`HuffmanEncoder`¶

class numpy_ml.preprocessing.nlp.HuffmanEncoder[source]¶

fit(text)[source]¶

Build a Huffman tree for the tokens in text and compute each token’s binary encoding.

Notes

In a Huffman code, tokens that occur more frequently are (generally) represented using fewer bits. Huffman codes produce the minimum expected codeword length among all methods for encoding tokens individually.

Huffman codes correspond to paths through a binary tree, with 1 corresponding to “move right” and 0 corresponding to “move left”. In contrast to standard binary trees, the Huffman tree is constructed from the bottom up. Construction begins by initializing a min-heap priority queue consisting of each token in the corpus, with priority corresponding to the token frequency. At each step, the two most infrequent tokens in the corpus are removed and become the children of a parent pseudotoken whose “frequency” is the sum of the frequencies of its children. This new parent pseudotoken is added to the priority queue and the process is repeated recursively until no tokens remain.

Parameters:	text (list of strs or `Vocabulary` instance) – The tokenized text or a pretrained `Vocabulary` object to use for building the Huffman code.

transform(text)[source]¶

Transform the words in text into their Huffman-code representations.

Parameters:	text (list of N strings) – The list of words to encode
Returns:	codes (list of N binary strings) – The encoded words in text

inverse_transform(codes)[source]¶

Transform an encoded sequence of bit-strings back into words.

Parameters:	codes (list of N binary strings) – A list of encoded bit-strings, represented as strings.
Returns:	text (list of N strings) – The decoded text.

tokens[source]¶: A list the unique tokens in text

codes[source]¶: A list with the Huffman code for each unique token in text

`TFIDFEncoder`¶

class numpy_ml.preprocessing.nlp.TFIDFEncoder(vocab=None, lowercase=True, min_count=0, smooth_idf=True, max_tokens=None, input_type='files', filter_stopwords=True, filter_punctuation=True, tokenizer='words')[source]¶

An object for compiling and encoding the term-frequency inverse-document-frequency (TF-IDF) representation of the tokens in a text corpus.

Notes

TF-IDF is intended to reflect how important a word is to a document in a collection or corpus. For a word token w in a document d, and a corpus, \(D = \{d_1, \ldots, d_N\}\), we have:

\[\begin{split}\text{TF}(w, d) &= \text{num. occurences of }w \text{ in document }d \\ \text{IDF}(w, D) &= \log \frac{|D|}{|\{ d \in D: t \in d \}|}\end{split}\]

Parameters:

vocab (Vocabulary object or list-like) – An existing vocabulary to filter the tokens in the corpus against. Default is None.
lowercase (bool) – Whether to convert each string to lowercase before tokenization. Default is True.
min_count (int) – Minimum number of times a token must occur in order to be included in vocab. Default is 0.
smooth_idf (bool) – Whether to add 1 to the denominator of the IDF calculation to avoid divide-by-zero errors. Default is True.
max_tokens (int) – Only add the max_tokens most frequent tokens that occur more than min_count to the vocabulary. If None, add all tokens greater that occur more than than min_count. Default is None.
input_type ({'files', 'strings'}) – If ‘files’, the sequence input to fit is expected to be a list of filepaths. If ‘strings’, the input is expected to be a list of lists, each sublist containing the raw strings for a single document in the corpus. Default is ‘filename’.
filter_stopwords (bool) – Whether to remove stopwords before encoding the words in the corpus. Default is True.
filter_punctuation (bool) – Whether to remove punctuation before encoding the words in the corpus. Default is True.
tokenizer ({'whitespace', 'words', 'characters', 'bytes'}) – Strategy to follow when mapping strings to tokens. The ‘whitespace’ tokenizer splits strings at whitespace characters. The ‘words’ tokenizer splits strings using a “word” regex. The ‘characters’ tokenizer splits strings into individual characters. The ‘bytes’ tokenizer splits strings into a collection of individual bytes.

fit(corpus_seq, encoding='utf-8-sig')[source]¶

Compute term-frequencies and inverse document frequencies on a collection of documents.

Parameters:

corpus_seq (str or list of strs) – The filepath / list of filepaths / raw string contents of the document(s) to be encoded, in accordance with the input_type parameter passed to the __init__() method. Each document is expected to be a string of tokens separated by whitespace.
encoding (str) – Specifies the text encoding for corpus if input_type is files. Common entries are either ‘utf-8’ (no header byte), or ‘utf-8-sig’ (header byte). Default is ‘utf-8-sig’.

Returns:

self

transform(ignore_special_chars=True)[source]¶

Generate the term-frequency inverse-document-frequency encoding of a text corpus.

Parameters:	ignore_special_chars (bool) – Whether to drop columns corresponding to “<eol>”, “<bol>”, and “<unk>” tokens from the final tfidf encoding. Default is True.
Returns:	tfidf (numpy array of shape (D, M [- 3])) – The encoded corpus, with each row corresponding to a single document, and each column corresponding to a token id. The mapping between column numbers and tokens is stored in the idx2token attribute IFF ignore_special_chars is False. Otherwise, the mappings are not accurate.

`Vocabulary`¶

class numpy_ml.preprocessing.nlp.Vocabulary(lowercase=True, min_count=None, max_tokens=None, filter_stopwords=True, filter_punctuation=True, tokenizer='words')[source]¶

An object for compiling and encoding the unique tokens in a text corpus.

Parameters:

lowercase (bool) – Whether to convert each string to lowercase before tokenization. Default is True.
min_count (int) – Minimum number of times a token must occur in order to be included in vocab. If None, include all tokens from corpus_fp in vocab. Default is None.
max_tokens (int) – Only add the max_tokens most frequent tokens that occur more than min_count to the vocabulary. If None, add all tokens that occur more than than min_count. Default is None.
filter_stopwords (bool) – Whether to remove stopwords before encoding the words in the corpus. Default is True.
filter_punctuation (bool) – Whether to remove punctuation before encoding the words in the corpus. Default is True.
tokenizer ({'whitespace', 'words', 'characters', 'bytes'}) – Strategy to follow when mapping strings to tokens. The ‘whitespace’ tokenizer splits strings at whitespace characters. The ‘words’ tokenizer splits strings using a “word” regex. The ‘characters’ tokenizer splits strings into individual characters. The ‘bytes’ tokenizer splits strings into a collection of individual bytes.

n_tokens[source]¶: The number of unique word tokens in the vocabulary

n_words[source]¶: The total number of words in the corpus

shape[source]¶: The number of unique word tokens in the vocabulary

most_common(n=5)[source]¶: Return the top n most common tokens in the corpus

words_with_count(k)[source]¶: Return all tokens that occur k times in the corpus

filter(words, unk=True)[source]¶

Filter (or replace) any word in words that is not present in Vocabulary.

Parameters:	words (list of strs) – A list of words to filter unk (bool) – Whether to replace any out of vocabulary words in words with the `<unk>` token (True) or skip them entirely (False). Default is True.
Returns:	filtered (list of strs) – The list of words filtered against the words in Vocabulary.

words_to_indices(words)[source]¶

Convert the words in words to their token indices. If a word is not in the vocabulary, return the index for the <unk> token

Parameters:	words (list of strs) – A list of words to filter
Returns:	indices (list of ints) – The token indices for each word in words

indices_to_words(indices)[source]¶

Convert the indices in indices to their word values. If an index is not in the vocabulary, return the <unk> token.

Parameters:	indices (list of ints) – The token indices for each word in words
Returns:	words (list of strs) – The word strings corresponding to each token index in indices

fit(corpus_fps, encoding='utf-8-sig')[source]¶

Compute the vocabulary across a collection of documents.

Parameters:

corpus_fps (str or list of strs) – The filepath / list of filepaths for the document(s) to be encoded. Each document is expected to be encoded as newline-separated string of text, with adjacent tokens separated by a whitespace character.
encoding (str) – Specifies the text encoding for corpus. Common entries are either ‘utf-8’ (no header byte), or ‘utf-8-sig’ (header byte). Default is ‘utf-8-sig’.

Returns:

self

`Token`¶

class numpy_ml.preprocessing.nlp.Token(word)[source]¶

`ngrams`¶

numpy_ml.preprocessing.nlp.ngrams(sequence, N)[source]¶: Return all N-grams of the elements in sequence

`remove_stop_words`¶

numpy_ml.preprocessing.nlp.remove_stop_words(words)[source]¶: Remove stop words from a list of word strings

`strip_punctuation`¶

numpy_ml.preprocessing.nlp.strip_punctuation(line)[source]¶: Remove punctuation from a string

`tokenize_words`¶

numpy_ml.preprocessing.nlp.tokenize_words(line, lowercase=True, filter_stopwords=True, filter_punctuation=True, **kwargs)[source]¶: Split a string into individual words, optionally removing punctuation and stop-words in the process.

`tokenize_whitespace`¶

numpy_ml.preprocessing.nlp.tokenize_whitespace(line, lowercase=True, filter_stopwords=True, filter_punctuation=True, **kwargs)[source]¶: Split a string at any whitespace characters, optionally removing punctuation and stop-words in the process.

`tokenize_chars`¶

numpy_ml.preprocessing.nlp.tokenize_chars(line, lowercase=True, filter_punctuation=True, **kwargs)[source]¶: Split a string into individual characters, optionally removing punctuation and stop-words in the process.

`tokenize_bytes_raw`¶

numpy_ml.preprocessing.nlp.tokenize_bytes_raw(line, encoding='utf-8', splitter=None, **kwargs)[source]¶

Convert the characters in line to a collection of bytes. Each byte is represented in decimal as an integer between 0 and 255.

Parameters:	line (str) – The string to tokenize. encoding (str) – The encoding scheme for the characters in line. Default is ‘utf-8’. splitter ({'punctuation', None}) – If ‘punctuation’, split the string at any punctuation character before encoding into bytes. If None, do not split line at all. Default is None.
Returns:	bytes (list) – A list of the byte-encoded characters in line. Each item in the list is a string of space-separated integers between 0 and 255 representing the bytes encoding the characters in line.

`bytes_to_chars`¶

numpy_ml.preprocessing.nlp.bytes_to_chars(byte_list, encoding='utf-8')[source]¶: Decode bytes (represented as an integer between 0 and 255) to characters in the specified encoding.

numpy-ml

Navigation

Related Topics

Natural language processing¶

`BytePairEncoder`¶

`HuffmanEncoder`¶

`TFIDFEncoder`¶

`Vocabulary`¶

`Token`¶

`ngrams`¶

`remove_stop_words`¶

`strip_punctuation`¶

`tokenize_words`¶

`tokenize_whitespace`¶

`tokenize_chars`¶

`tokenize_bytes_raw`¶

`bytes_to_chars`¶

Natural language processing¶

BytePairEncoder¶

HuffmanEncoder¶

TFIDFEncoder¶

Vocabulary¶

Token¶

ngrams¶

remove_stop_words¶

strip_punctuation¶

tokenize_words¶

tokenize_whitespace¶

tokenize_chars¶

tokenize_bytes_raw¶

bytes_to_chars¶

`BytePairEncoder`¶

`HuffmanEncoder`¶

`TFIDFEncoder`¶

`Vocabulary`¶

`Token`¶

`ngrams`¶

`remove_stop_words`¶

`strip_punctuation`¶

`tokenize_words`¶

`tokenize_whitespace`¶

`tokenize_chars`¶

`tokenize_bytes_raw`¶

`bytes_to_chars`¶