turicreate.text_analytics.parse_docword¶
-
turicreate.text_analytics.
parse_docword
(filename, vocab_filename)¶ Parse a file that’s in “docword” format. This consists of a 3-line header comprised of the document count, the vocabulary count, and the number of tokens, i.e. unique (doc_id, word_id) pairs. After the header, each line contains a space-separated triple of (doc_id, word_id, frequency), where frequency is the number of times word_id occurred in document doc_id.
This format assumes that documents and words are identified by a positive integer (whose lowest value is 1). Thus, the first word in the vocabulary file has word_id=1.
2 272 5 1 5 1 1 105 3 1 272 5 2 1 3 …
Parameters: - filename : str
The name of the file to parse.
- vocab_filename : str
A list of words that are used for this data set.
Returns: - out : SArray
Each element represents a document in bag-of-words format.
Examples
>>> textfile = 'https://static.turi.com/datasets/text/docword.nips.txt') >>> vocab = 'https://static.turi.com/datasets/text/vocab.nips.txt') >>> docs = turicreate.text_analytics.parse_docword(textfile, vocab)