turicreate.text_analytics.parse_docword

turicreate.text_analytics.parse_docword(filename, vocab_filename)

Parse a file that’s in “docword” format. This consists of a 3-line header comprised of the document count, the vocabulary count, and the number of tokens, i.e. unique (doc_id, word_id) pairs. After the header, each line contains a space-separated triple of (doc_id, word_id, frequency), where frequency is the number of times word_id occurred in document doc_id.

This format assumes that documents and words are identified by a positive integer (whose lowest value is 1). Thus, the first word in the vocabulary file has word_id=1.

2 272 5 1 5 1 1 105 3 1 272 5 2 1 3 …

Parameters:
filename : str

The name of the file to parse.

vocab_filename : str

A list of words that are used for this data set.

Returns:
out : SArray

Each element represents a document in bag-of-words format.

Examples

>>> textfile = 'https://static.turi.com/datasets/text/docword.nips.txt')
>>> vocab = 'https://static.turi.com/datasets/text/vocab.nips.txt')
>>> docs = turicreate.text_analytics.parse_docword(textfile, vocab)