turicreate.text_analytics.parse_sparse

turicreate.text_analytics.parse_sparse(filename, vocab_filename)

Parse a file that’s in libSVM format. In libSVM format each line of the text file represents a document in bag of words format:

num_unique_words_in_doc word_id:count another_id:count

The word_ids have 0-based indexing, i.e. 0 corresponds to the first word in the vocab filename.

Parameters:
filename : str

The name of the file to parse.

vocab_filename : str

A list of words that are used for this data set.

Returns:
out : SArray

Each element represents a document in bag-of-words format.

Examples

If we have two documents: 1. “It was the best of times, it was the worst of times” 2. “It was the age of wisdom, it was the age of foolishness”

Then the vocabulary file might contain the unique words, with a word on each line, in the following order: it, was, the, best, of, times, worst, age, wisdom, foolishness

In this case, the file in libSVM format would have two lines: 7 0:2 1:2 2:2 3:1 4:2 5:1 6:1 7 0:2 1:2 2:2 7:2 8:1 9:1 10:1

The following command will parse the above two files into an SArray of type dict.

>>> file = 'https://static.turi.com/datasets/text/ap.dat'
>>> vocab = 'https://static.turi.com/datasets/text/ap.vocab.txt'
>>> docs = turicreate.text_analytics.parse_sparse(file, vocab)