turicreate.nearest_neighbors.create

turicreate.nearest_neighbors.create(dataset, label=None, features=None, distance=None, method='auto', verbose=True, **kwargs)

Create a nearest neighbor model, which can be searched efficiently and quickly for the nearest neighbors of a query observation. If the method argument is specified as auto, the type of model is chosen automatically based on the type of data in dataset.

Parameters:
dataset : SFrame

Reference data. If the features for each observation are numeric, they may be in separate columns of ‘dataset’ or a single column with lists of values. The features may also be in the form of a column of sparse vectors (i.e. dictionaries), with string keys and numeric values.

label : string, optional

Name of the SFrame column with row labels. If ‘label’ is not specified, row numbers are used to identify reference dataset rows when the model is queried.

features : list[string], optional

Name of the columns with features to use in computing distances between observations and the query points. ‘None’ (the default) indicates that all columns except the label should be used as features. Each column can be one of the following types:

  • Numeric: values of numeric type integer or float.
  • Array: list of numeric (integer or float) values. Each list element is treated as a separate variable in the model.
  • Dictionary: key-value pairs with numeric (integer or float) values. Each key indicates a separate variable in the model.
  • List: list of integer or string values. Each element is treated as a separate variable in the model.
  • String: string values.

Please note: if a composite distance is also specified, this parameter is ignored.

distance : string, function, or list[list], optional

Function to measure the distance between any two input data rows. This may be one of three types:

  • String: the name of a standard distance function. One of ‘euclidean’, ‘squared_euclidean’, ‘manhattan’, ‘levenshtein’, ‘jaccard’, ‘weighted_jaccard’, ‘cosine’ or ‘transformed_dot_product’.
  • Function: a function handle from the distances module.
  • Composite distance: the weighted sum of several standard distance functions applied to various features. This is specified as a list of distance components, each of which is itself a list containing three items:
    1. list or tuple of feature names (strings)
    2. standard distance name (string)
    3. scaling factor (int or float)

For more information about Turi Create distance functions, please see the distances module.

If ‘distance’ is left unspecified or set to ‘auto’, a composite distance is constructed automatically based on feature types.

method : {‘auto’, ‘ball_tree’, ‘brute_force’, ‘lsh’}, optional

Method for computing nearest neighbors. The options are:

  • auto (default): the method is chosen automatically, based on the type of data and the distance. If the distance is ‘manhattan’ or ‘euclidean’ and the features are numeric or vectors of numeric values, then the ‘ball_tree’ method is used. Otherwise, the ‘brute_force’ method is used.
  • ball_tree: use a tree structure to find the k-closest neighbors to each query point. The ball tree model is slower to construct than the brute force model, but queries are faster than linear time. This method is not applicable for the cosine or transformed_dot_product distances. See Liu, et al (2004) for implementation details.
  • brute_force: compute the distance from a query point to all reference observations. There is no computation time for model creation with the brute force method (although the reference data is held in the model, but each query takes linear time.
  • lsh: use Locality Sensitive Hashing (LSH) to find approximate nearest neighbors efficiently. The LSH model supports ‘euclidean’, ‘squared_euclidean’, ‘manhattan’, ‘cosine’, ‘jaccard’, and ‘transformed_dot_product’ distances. Two options are provided for LSH – num_tables and num_projections_per_table. See the notes below for details.
verbose: bool, optional

If True, print progress updates and model details.

**kwargs : optional

Options for the distance function and query method.

  • leaf_size: for the ball tree method, the number of points in each leaf of the tree. The default is to use the max of 1,000 and n/(2^11), which ensures a maximum tree depth of 12.
  • num_tables: For the LSH method, the number of hash tables constructed. The default value is 20. We recommend choosing values from 10 to 30.
  • num_projections_per_table: For the LSH method, the number of projections/hash functions for each hash table. The default value is 4 for ‘jaccard’ distance, 16 for ‘cosine’ distance and 8 for other distances. We recommend using number 2 ~ 6 for ‘jaccard’ distance, 8 ~ 20 for ‘cosine’ distance and 4 ~ 12 for other distances.
Returns:
out : NearestNeighborsModel

A structure for efficiently computing the nearest neighbors in ‘dataset’ of new query points.

Notes

  • Missing data is not allowed in the ‘dataset’ provided to this function. Please use the turicreate.SFrame.fillna() and turicreate.SFrame.dropna() utilities to handle missing data before creating a nearest neighbors model.
  • Missing keys in sparse vectors are assumed to have value 0.
  • If the features should be weighted equally in the distance calculations but are measured on different scales, it is important to standardize the features. One way to do this is to subtract the mean of each column and divide by the standard deviation.

Locality Sensitive Hashing (LSH)

There are several efficient nearest neighbors search algorithms that work well for data with low dimensions \(d\) (approximately 50). However, most of the solutions suffer from either space or query time that is exponential in \(d\). For large \(d\), they often provide little, if any, improvement over the ‘brute_force’ method. This is a well-known consequence of the phenomenon called The Curse of Dimensionality.

Locality Sensitive Hashing (LSH) is an approach that is designed to efficiently solve the approximate nearest neighbor search problem for high dimensional data. The key idea of LSH is to hash the data points using several hash functions, so that the probability of collision is much higher for data points which are close to each other than those which are far apart.

An LSH family is a family of functions \(h\) which map points from the metric space to a bucket, so that

  • if \(d(p, q) \leq R\), then \(h(p) = h(q)\) with at least probability \(p_1\).
  • if \(d(p, q) \geq cR\), then \(h(p) = h(q)\) with probability at most \(p_2\).

LSH for efficient approximate nearest neighbor search:

  • We define a new family of hash functions \(g\), where each function \(g\) is obtained by concatenating \(k\) functions \(h_1, ..., h_k\), i.e., \(g(p)=[h_1(p),...,h_k(p)]\). The algorithm constructs \(L\) hash tables, each of which corresponds to a different randomly chosen hash function \(g\). There are \(k \cdot L\) hash functions used in total.
  • In the preprocessing step, we hash all \(n\) reference points into each of the \(L\) hash tables.
  • Given a query point \(q\), the algorithm iterates over the \(L\) hash functions \(g\). For each \(g\) considered, it retrieves the data points that are hashed into the same bucket as q. These data points from all the \(L\) hash tables are considered as candidates that are then re-ranked by their real distances with the query data.

Note that the number of tables \(L\) and the number of hash functions per table \(k\) are two main parameters. They can be set using the options num_tables and num_projections_per_table respectively.

Hash functions for different distances:

  • euclidean and squared_euclidean: \(h(q) = \lfloor \frac{a \cdot q + b}{w} \rfloor\) where \(a\) is a vector, of which the elements are independently sampled from normal distribution, and \(b\) is a number uniformly sampled from \([0, r]\). \(r\) is a parameter for the bucket width. We set \(r\) using the average all-pair euclidean distances from a small randomly sampled subset of the reference data.
  • manhattan: The hash function of manhattan is similar with that of euclidean. The only difference is that the elements of a are sampled from Cauchy distribution, instead of normal distribution.
  • cosine: Random Projection is designed to approximate the cosine distance between vectors. The hash function is \(h(q) = sgn(a \cdot q)\), where \(a\) is randomly sampled normal unit vector.
  • jaccard: We use a recently proposed method one permutation hashing by Shrivastava and Li. See the paper [Shrivastava and Li, UAI 2014] for details.

References

Examples

Construct a nearest neighbors model with automatically determined method and distance:

>>> sf = turicreate.SFrame({'X1': [0.98, 0.62, 0.11],
...                       'X2': [0.69, 0.58, 0.36],
...                       'str_feature': ['cat', 'dog', 'fossa']})
>>> model = turicreate.nearest_neighbors.create(sf, features=['X1', 'X2'])

For datasets with a large number of rows and up to about 100 variables, the ball tree method often leads to much faster queries.

>>> model = turicreate.nearest_neighbors.create(sf, features=['X1', 'X2'],
...                                           method='ball_tree')

Often the final determination of a neighbor is based on several distance computations over different sets of features. Each part of this composite distance may have a different relative weight.

>>> my_dist = [[['X1', 'X2'], 'euclidean', 2.],
...            [['str_feature'], 'levenshtein', 3.]]
...
>>> model = turicreate.nearest_neighbors.create(sf, distance=my_dist)