Turi Create  4.0
turi::sketches::quantile_sketch< T, Comparator > Class Template Reference

#include <ml/sketches/quantile_sketch.hpp>

Public Member Functions

 quantile_sketch ()
 
 quantile_sketch (size_t desired_n, double epsilon=0.005, const Comparator &comparator=Comparator())
 
void init (size_t desired_n, double epsilon, const Comparator &comparator=Comparator())
 
size_t sketch_size ()
 
size_t memory_usage ()
 
void add (T t)
 
void finalize ()
 
size_t size () const
 
query (size_t rank) const
 
query_quantile (double quantile) const
 
fast_query (size_t rank) const
 
fast_query_quantile (double quantile) const
 
void combine (const quantile_sketch &other)
 

Detailed Description

template<typename T, typename Comparator = std::less<T>>
class turi::sketches::quantile_sketch< T, Comparator >

This class implements a quantile sketching datastructure as described in Qi Zhang and Wei Wang. A Fast Algorithm for Approximate Quantiles in High Speed Data Streams. SSDBM 2007.

It constructs a low memory sketch from a stream of known length. To summarize the basic idea, it draws samples from the stream and maintains for each sample a lower bound and an upper bound on its rank in the stream.

The known length condition is not strictly true. While the sketch is constructed for particular stream length N, this datastructure does not limit the number of elements can be inserted, but then in which case the accuracy guarantee no longer applies.

Usage is simple:

// construct sketch at a particular size
quantile_sketch<double> sketch(100000, 0.01);
...
insert any number elements into the sketch using
quality guarantees only apply when <100000 elements are inserted.
sketch.add(value)
...
sketch.finalize() // finalize must be called before queries can be made

To query, two family of functions are provided.

The regular query functions take linear time in the size of the sketch, and have accuracy guarantees when <=desired_n elements are inserted (100000 elements in the above example)

sketch.query(rank) // rank is between [0, N-1]
sketch.query_quantile(quantile) // quantile is between [0, 1.0]

The fast query functions take logarithmic time in the size of the sketch, and have no accuracy guarantees.

sketch.fast_query(rank) // rank is between [0, N-1]
sketch.fast_query_quantile(quantile) // quantile is between [0, 1.0]

The sketch structure also support parallel sketching:

Example: If I have 16 data streams, each of size N/16 I can construct 16 quantile_sketches, each initialized with desired_n = N (note. N here! not N/16)

Each sketch can then be ran on each data stream independently. Then all can be combined into one:

quantile_sketch sketch(N)
for each sub-stream sketch:
sketch.combine(sub-stream-sketch)
sketch.finalize()
Template Parameters
Typeto contain in the sketch. type must be less-than-comparable, constructible, copy constructible, assignable.

Technical Details

The basic mechanic of the sketch lies in making a hierarchy of sketches, each having size m_b. The first sketch has error 0. The second sketch has error 1/b, the 3rd sketch has error 2/b and so on. These sketches are treated like a binary sequence:

  • When a buffer of size b has been accumulated, it gets sorted and inserted into sketch 1
  • If sketch 1 exists, the two are merged (no increase in error) and compressed (increase error by 1/b) and the result tries to get placed as sketch 2.
  • If sketch 2 exists, this repeats.

The procedure thus seem like binary addition.

By picking the size b appropriately for a given value of N and , At the end, all the sketches can be merged to get a sketch of error .

Here is where we deviate from the paper slightly. The "compress" procedure has the advantage of decreasing the total amount of memory required, By finishing/finalizing with a simple merge, means that the resultant array could be much bigger than desired. The way the binary addition works, also means the resultant sketch size is somewhat unpredictable. Thus here lies the modification.

We compute the sketch as usual, but sizing "b" to end up with a final merged error of / 2. We then perform one addition compress(/2) phase to compress the final sketch into requiring only O(1/epsilon) memory.

Definition at line 118 of file quantile_sketch.hpp.

Constructor & Destructor Documentation

◆ quantile_sketch() [1/2]

template<typename T, typename Comparator = std::less<T>>
turi::sketches::quantile_sketch< T, Comparator >::quantile_sketch ( )
inline

Default constructor. Does nothing. init() MUST be called before any operations are performed.

Definition at line 125 of file quantile_sketch.hpp.

◆ quantile_sketch() [2/2]

template<typename T, typename Comparator = std::less<T>>
turi::sketches::quantile_sketch< T, Comparator >::quantile_sketch ( size_t  desired_n,
double  epsilon = 0.005,
const Comparator &  comparator = Comparator() 
)
inlineexplicit

Constructs a quantile_sketch with a desired target number of elements, and a desired accuracy. See init() for details.

Definition at line 131 of file quantile_sketch.hpp.

Member Function Documentation

◆ add()

template<typename T, typename Comparator = std::less<T>>
void turi::sketches::quantile_sketch< T, Comparator >::add ( t)
inline

Inserts a value into the sketch. Not safe to use in parallel.

Definition at line 225 of file quantile_sketch.hpp.

◆ combine()

template<typename T, typename Comparator = std::less<T>>
void turi::sketches::quantile_sketch< T, Comparator >::combine ( const quantile_sketch< T, Comparator > &  other)
inline

Merges two unfinalized quantile_sketch into one quantile sketch. Both this and the other quantile sketch must not be finalized(). This allows two quantile sketches to be generated on two disjoint streams of data, and then combined at the end. For quality guarantees to apply, BOTH sketches must be constructed with the final number of elements.

Example: If I have 16 data streams, each of size N/16 I can construct 16 quantile_sketches, each initialized with desired_n = N (note. N here! not N/16)

Each sketch can then be ran on each data stream independently. Then all can be combined into one:

quantile_sketch sketch(N)
for each sub-stream sketch:
sketch.combine(sub-stream-sketch)
sketch.finalize()
Note
The predictions produced by the combined sketch is not generally going to be the same as the sequentially produced sketch

Definition at line 461 of file quantile_sketch.hpp.

◆ fast_query()

template<typename T, typename Comparator = std::less<T>>
T turi::sketches::quantile_sketch< T, Comparator >::fast_query ( size_t  rank) const
inline

Quickly queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.

Note that this rank is in relation to the total number of elements inserted. For instance,

quantile_sketch<double> sketch(1000);
...
insert 10,000 elements into the sketch

sketch.fast_query(9999) will be the maximum.

Unlike query(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.

Note
finalize() must be called prior to using this function.
The complexity of this function is logarithmic in the sketch size.

Definition at line 405 of file quantile_sketch.hpp.

◆ fast_query_quantile()

template<typename T, typename Comparator = std::less<T>>
T turi::sketches::quantile_sketch< T, Comparator >::fast_query_quantile ( double  quantile) const
inline

Quickly returns the value at a quantile. For instance:

sketch.fast_query_quantile(0.01) // returns the 1% quantile
sketch.fast_query_quantile(0.50) // returns the median
sketch.fast_query_quantile(1.00) // returns the max

quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).

Unlike query_quantile(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.

Note
finalize() must be called prior to using this function.
The complexity of this function is logarithmic in the sketch size.

Definition at line 430 of file quantile_sketch.hpp.

◆ finalize()

template<typename T, typename Comparator = std::less<T>>
void turi::sketches::quantile_sketch< T, Comparator >::finalize ( )
inline

Finalizes the sketch datastructure. Must be called before queries are made. Once finalize() is called, elements may no longer be inserted.

Definition at line 247 of file quantile_sketch.hpp.

◆ init()

template<typename T, typename Comparator = std::less<T>>
void turi::sketches::quantile_sketch< T, Comparator >::init ( size_t  desired_n,
double  epsilon,
const Comparator &  comparator = Comparator() 
)
inline

Initializes the quantile sketch, resetting it. epsilon is the desired accuracy of the quantiles. An -approximate quantile is an element in the data stream whose approximate rank r' is within [r- n and r + n] i.e. a 0.0001 approximate median in a 1,000,000 element array may have true rank [499,000, and 501,000].

The runtime memory requirements of this sketch is O((1/) (log N)^2) The final memory requirements after finalization is O(1/).

To give a sense of the runtime memory requirements, at = 0.01

* N = 100,000: sketch has size = 45528 bytes
* N = 1,000,000: sketch has size = 96144 bytes
* N = 10,000,000: sketch has size = 123504 bytes
* N = 100,000,000: sketch has size = 316536 bytes
* 

(Increasing N by 10 roughly increases the sketch size by 2. Note that the growth function is a little wierd. this is due to the structure of the algorithm which maintains multi-level sketches, then during the final phase, just merging all those sketches together. This results in somewhat oddly behaved memory utilization as a function of N.)

At N = 1,000,000

* epsilon = 0.1: sketch has size = 12480 bytes
* epsilon = 0.01: sketch has size = 96144 bytes
* epsilon = 0.001: sketch has size = 442776 bytes
* epsilon = 0.0001: sketch has size = 3271440 bytes
* 

(decreasing epsilon by 10 roughly increases the sketch size by 5)

Parameters
desired_nAn UPPER BOUND on the number of elements to be inserted, to guarantee epsilon accuracy. The epsilon accuracy guarantee is only true if you do not insert more than n elements.
epsilonThe desired accuracy

Definition at line 178 of file quantile_sketch.hpp.

◆ memory_usage()

template<typename T, typename Comparator = std::less<T>>
size_t turi::sketches::quantile_sketch< T, Comparator >::memory_usage ( )
inline

Returns the approximate current memory usage of the sketch in bytes.

Definition at line 218 of file quantile_sketch.hpp.

◆ query()

template<typename T, typename Comparator = std::less<T>>
T turi::sketches::quantile_sketch< T, Comparator >::query ( size_t  rank) const
inline

Queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.

Note that this rank is in relation to the total number of elements inserted. For instance,

quantile_sketch<double> sketch(1000);
...
insert 10,000 elements into the sketch

sketch.query(9999) will be the maximum.

Note
finalize() must be called prior to using this function.
This function is guaranteed to provide an epsilon accurate value if only the desired number of elements (as set in the constructor) are inserted. In all other cases, this may not.
The complexity of this function can be logarithmic in the sketch size in the best case (appears to be all the time, empirically), or linear in the worst case. Essentially, it first tests the result of the fast_query to see if that fits the bound requirements. If it does not, it takes a full linear search. The fast_query seems to work pretty much all the time on randomized inputs of varying distributions.

Definition at line 295 of file quantile_sketch.hpp.

◆ query_quantile()

template<typename T, typename Comparator = std::less<T>>
T turi::sketches::quantile_sketch< T, Comparator >::query_quantile ( double  quantile) const
inline

Returns the value at a quantile. For instance:

sketch.query_quantile(0.01) // returns the 1% quantile
sketch.query_quantile(0.50) // returns the median
sketch.query_quantile(1.00) // returns the max

quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).

Note
finalize() must be called prior to using this function.
This function is guaranteed to provide an epsilon accurate value if only the desired number of elements (as set in the constructor) are inserted. In all other cases, this may not.
The complexity of this function can be logarithmic in the sketch size in the best case (appears to be all the time, empirically), or linear in the worst case. Essentially, it first tests the result of the fast_query to see if that fits the bound requirements. If it does not, it takes a full linear search. The fast_query seems to work pretty much all the time on randomized inputs of varying distributions.

Definition at line 376 of file quantile_sketch.hpp.

◆ size()

template<typename T, typename Comparator = std::less<T>>
size_t turi::sketches::quantile_sketch< T, Comparator >::size ( ) const
inline

Returns the number of elements inserted into the sketch.

Definition at line 264 of file quantile_sketch.hpp.

◆ sketch_size()

template<typename T, typename Comparator = std::less<T>>
size_t turi::sketches::quantile_sketch< T, Comparator >::sketch_size ( )
inline

Returns the number of elements stored in the sketch

Definition at line 206 of file quantile_sketch.hpp.


The documentation for this class was generated from the following file: