Turi Create  4.0
turi::unity_sketch Class Reference

#include <ml/sketches/unity_sketch.hpp>

Public Member Functions

void construct_from_sarray (std::shared_ptr< unity_sarray_base > uarray, bool background=false, const std::vector< flexible_type > &keys={})
 
bool sketch_ready ()
 
size_t num_elements_processed ()
 
double get_quantile (double quantile)
 
double frequency_count (flexible_type value)
 
std::vector< std::pair< flexible_type, size_t > > frequent_items ()
 
double num_unique ()
 
std::map< flexible_type, std::shared_ptr< unity_sketch_base > > element_sub_sketch (const std::vector< flexible_type > &keys)
 
std::shared_ptr< unity_sketch_base > element_length_summary ()
 
std::shared_ptr< unity_sketch_base > element_summary ()
 
std::shared_ptr< unity_sketch_base > dict_key_summary ()
 
std::shared_ptr< unity_sketch_base > dict_value_summary ()
 
double mean ()
 
double max ()
 
double min ()
 
double numeric_epsilon ()
 
double sum ()
 
double var ()
 
size_t size ()
 
size_t num_undefined ()
 
void cancel ()
 

Detailed Description

Provides a query interface to a collection of statistics about an SArray accumulated via various sketching methods. The unity_sketch object contains a summary of a single SArray (a column of an SFrame). It contains sketched statistics about the Array which can be queried efficiently.

The sketch computation is fast and has complexity approximately linear in the length of the Array. After which, all queryable functions in the sketch can be performed nearly instantly.

The sketch's contents vary depending on whether it is a numeric array, or a non-numeric (string) array, or list type list vector/dict/recursive If numeric:

This is essentially a union among a collection of sketches, depend on value type of SArray, here is what's availble in the sketch for each sarray value type: numeric type (int, float):

  • m_numeric_sketch – numeric summary like max/min/var/std/mean/quantile
  • m_discrete_sketch – discrete summary like unique items/counts/frequent_items string type:
  • m_discrete_sketch – discrete summary like unique items/counts/frequent_items dictionary type:
  • m_discrete_sketch – discrete summary like unique items/counts/frequent_items
  • m_dict_key_sketch – sketch summary of flattened dict keys, it is a sketch summary of string type where we treat each key as string
  • m_dict_value_sketch – sketch summary of flattened dictionary values. we infer the type of dictionary value by peek into first 100 rows of of data and then decide whether or not to use numeric sketch.
  • m_element_sub_sketch – optional. Available only if user explicitly asks for sketch summary for subset of dictionary keys. The sub sketch type is of the same type as m_dict_value_sketch vector(array) type:
  • m_discrete_sketch – discrete summary like unique items/counts/frequent_items
  • m_element_sketch – sketch summary for all values of the vector as if the values are flattened, and the element sketch is of type float (numeric sketch)
  • m_element_sub_sketch – optional. If user asks for sketch summary for certain columns in the vector value, then this will be available. It is a collection of sketches for the corresponding columns and the sketch type is numeric. list(recursive) type:
  • m_discrete_sketch – discrete summary like unique items/counts/frequent_items
  • m_element_sketch – sketch summary for all values of the vector as if the values are flattened. The element sketch is of type string. We convert all list values to string and then do a sketch on it

The following information is provided exactly:

And the following information is provided approximately:

For SArray of type recursive/dict/array, additional sketch information is available:

For SArray of type list, there is a sketch summary for all values inside the list element. Sketch summary flattens all list values and do a sketch summery over flattened values. Each value in list is casted to string for sketch summary. The summary can be retrieved by calling:

For SArray of type array(vector), there is a sketch summary for all values inside vector element. Sketch summary flattens all vector values and do a sketch summery over flattened values. The summary can be retrieved by calling:

For SArray of type dict, additional sketch summary over the keys and values are provided. They can be retrieved by calling:

For SArray of type dict, user can also pass in a list of dictionary keys to sketch_summary function, this would cause one sub sketch for each of the key. For example: >>> sketch = sa.sketch_summary(sub_sketch_keys=["a", "b"]) Then the sub summary may be retrieved by: >>> sketch.element_sub_sketch() Or: >>> sketch.element_sub_sketch(["key1", "key2"]) for subset of keys

Similarly, for SArray of type vector(array), user can also pass in a list of integers which is the index into the vector to get sub sketch For example: >>> sketch = sa.sketch_summary(sub_sketch_keys=[1,3,5]) Then the sub summary may be retrieved by: >>> sketch.element_sub_sketch() Or: >>> sketch.element_sub_sketch([1,3]) for subset of keys

Definition at line 136 of file unity_sketch.hpp.

Member Function Documentation

◆ cancel()

void turi::unity_sketch::cancel ( )

Cancels any ongoing sketch computation.

◆ construct_from_sarray()

void turi::unity_sketch::construct_from_sarray ( std::shared_ptr< unity_sarray_base >  uarray,
bool  background = false,
const std::vector< flexible_type > &  keys = {} 
)

Generates all the sketch statistics from an input SArray. If background is true, the sketch will be constructed in the background. While the sketch is being constructed in a background thread, queries can be executed on the sketch, but none of the quality guarantees will apply.

◆ dict_key_summary()

std::shared_ptr<unity_sketch_base> turi::unity_sketch::dict_key_summary ( )

For SArray of dictionary type, returns the sketch summary for the dictionary keys It only counts the keys if the key can be converted to string

◆ dict_value_summary()

std::shared_ptr<unity_sketch_base> turi::unity_sketch::dict_value_summary ( )

For SArray of dictionary type, returns the sketch summary for the dictionary values It only counts the values if the value can be converted to float

◆ element_length_summary()

std::shared_ptr<unity_sketch_base> turi::unity_sketch::element_length_summary ( )

Returns element length sketch summary if the sarray is a list/vector/dict type raises exception otherwise

◆ element_sub_sketch()

std::map<flexible_type, std::shared_ptr<unity_sketch_base> > turi::unity_sketch::element_sub_sketch ( const std::vector< flexible_type > &  keys)

Returns sketch summary for a given key in dictionary SArray sketch, or a given index in SArray of vector

Parameters
keyis either an index into vector or a key in dictionary

◆ element_summary()

std::shared_ptr<unity_sketch_base> turi::unity_sketch::element_summary ( )

For SArray of array/list(recursive) type, returns the sketch summary for the list values the summary only works if element can be converted to string. Elements that cannot be converted to string will be ignored

◆ frequency_count()

double turi::unity_sketch::frequency_count ( flexible_type  value)

Returns a sketched estimate of the number of occurances of a given element. This estimate is based on the count sketch. The element type must be of the same type as the input SArray; throws an exception otherwise.

◆ frequent_items()

std::vector<std::pair<flexible_type, size_t> > turi::unity_sketch::frequent_items ( )

Returns a sketched estimate of the most frequent elements in the SArray based on the SpaceSaving sketch. It is only guaranteed that all elements which appear in more than 0.01% (0.0001) rows of the array will appear in the set of returned elements. However, other elements may also appear in the result. The item counts are estimated using the CountSketch.

◆ get_quantile()

double turi::unity_sketch::get_quantile ( double  quantile)

Returns a sketched estimate of the value at a particular quantile between 0.0 and 1.0. The quantile is guaranteed to be accurate within 1%: meaning that if you ask for the 0.55 quantile, the returned value is guaranteed to be between the true 0.54 quantile and the true 0.56 quantile. The quantiles are only defined for numeric arrays and this function will throw an exception if called on a sketch constructed for a non-numeric column.

◆ max()

double turi::unity_sketch::max ( )
inline

Returns the max of the values in the sarray. Returns NaN on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 254 of file unity_sketch.hpp.

◆ mean()

double turi::unity_sketch::mean ( )
inline

Returns the mean of the values in the sarray. Returns 0 on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 242 of file unity_sketch.hpp.

◆ min()

double turi::unity_sketch::min ( )
inline

Returns the min of the values in the sarray. Returns NaN on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 266 of file unity_sketch.hpp.

◆ num_elements_processed()

size_t turi::unity_sketch::num_elements_processed ( )

Returns the number of elements processed by the sketch is complete. If the sketch is constructed with background == false, this will always return the number of elements of the array. If the sketch is constructed using a background thread this may return a value between 0 and the length of the array.

◆ num_undefined()

size_t turi::unity_sketch::num_undefined ( )
inline

Returns the number of undefined elements in the input SArray.

Definition at line 319 of file unity_sketch.hpp.

◆ num_unique()

double turi::unity_sketch::num_unique ( )

Returns a sketched estimate of the number of unique values in the SArray based on the Hyperloglog sketch.

◆ numeric_epsilon()

double turi::unity_sketch::numeric_epsilon ( )
inline

Returns the epsilon value used by the numeric sketch. Returns NaN on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 278 of file unity_sketch.hpp.

◆ size()

size_t turi::unity_sketch::size ( )
inline

Returns the number of elements in the input SArray.

Definition at line 312 of file unity_sketch.hpp.

◆ sketch_ready()

bool turi::unity_sketch::sketch_ready ( )

Returns true if the sketch is complete. If the sketch is constructed with background == false, this will always return true. If not the sketch is constructed using a background thread and this will return false until the sketch is ready.

◆ sum()

double turi::unity_sketch::sum ( )
inline

Returns the sum of the values in the sarray. Returns 0 on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 290 of file unity_sketch.hpp.

◆ var()

double turi::unity_sketch::var ( )
inline

Returns the variance of the values in the sarray. Returns 0 on an empty array. Throws an exception if called on an sarray with non-numeric type.

Definition at line 301 of file unity_sketch.hpp.


The documentation for this class was generated from the following file: