turicreate.text_analytics.bm25

turicreate.text_analytics.bm25(dataset, query, k1=1.5, b=0.75)

For a given query and set of documents, compute the BM25 score for each document. If we have a query with words q_1, …, q_n the BM25 score for a document is:

\[\sum_{i=1}^N IDF(q_i)\frac{f(q_i) * (k_1+1)}{f(q_i) + k_1 * (1-b+b*|D|/d_avg))}\]

where

  • \(\mbox{IDF}(q_i) = log((N - n(q_i) + .5)/(n(q_i) + .5)\)
  • \(f(q_i)\) is the number of times q_i occurs in the document
  • \(n(q_i)\) is the number of documents containing q_i
  • \(|D|\) is the number of words in the document
  • \(d_avg\) is the average number of words per document in the corpus
  • \(k_1\) and \(b\) are free parameters.
Parameters:
dataset : SArray of type dict, list, or str

An SArray where each element either represents a document in:

  • dict : a bag-of-words format, where each key is a word and each value is the number of times that word occurs in the document.
  • list : The list is converted to bag of words of format, where the keys are the unique elements in the list and the values are the counts of those unique elements. After this step, the behaviour is identical to dict.
  • string : Behaves identically to a dict, where the dictionary is generated by converting the string into a bag-of-words format. For example, ‘I really like really fluffy dogs” would get converted to {‘I’ : 1, ‘really’: 2, ‘like’: 1, ‘fluffy’: 1, ‘dogs’:1}.
query : A list, set, or SArray of type str

A list, set or SArray where each element is a word.

k1 : float, optional

Free parameter which controls the relative importance of term frequencies. Recommended values are [1.2, 2.0].

b : float, optional

Free parameter which controls how much to downweight scores for long documents. Recommended value is 0.75.

Returns:
out : SFrame

An SFrame containing the BM25 score for each document containing one of the query words. The doc_id column is the row number of the document.

References

[BM25]“Okapi BM-25”

Examples

>>> import turicreate

>>> dataset = turicreate.SArray([
  {'a':5, 'b':7, 'c':10},
  {'a':3, 'c':1, 'd':2},
  {'a':10, 'b':3, 'e':5},
  {'a':1},
  {'f':5}])

>>> query = ['a', 'b', 'c']
>>> turicreate.text_analytics.bm25(dataset, query)