Turi Create  4.0
turi::sketches::streaming_quantile_sketch< T, Comparator > Class Template Reference

#include <ml/sketches/streaming_quantile_sketch.hpp>

Public Member Functions

 streaming_quantile_sketch (double epsilon=0.005, const Comparator &comparator=Comparator())
 
void init (double epsilon, const Comparator &comparator)
 
size_t sketch_size ()
 
size_t memory_usage ()
 
void add (T t)
 
void finalize ()
 
size_t size () const
 
query (size_t rank)
 
query_quantile (double quantile)
 
fast_query (size_t rank)
 
fast_query_quantile (double quantile)
 
void substream_finalize ()
 
void combine (streaming_quantile_sketch other)
 
void combine_finalize ()
 

Detailed Description

template<typename T, typename Comparator>
class turi::sketches::streaming_quantile_sketch< T, Comparator >

This class implements the streaming quantile sketching datastructure as described in Qi Zhang and Wei Wang. A Fast Algorithm for Approximate Quantiles in High Speed Data Streams. SSDBM 2007.

This sketch is basically a streaming version of the quantile_sketch

It constructs a low memory sketch from a stream of unknown length. To summarize the basic idea, it draws samples from the stream and maintains for each sample a lower bound and an upper bound on its rank in the stream.

Usage:

// construct a sketch at a particular accuracy
quantile_sketch<double> sketch(0.01);
...
insert any number elements into the sketch
sketch.add(value)
...
sketch.finalize() // finalize must be called before queries can be made

To query, two family of functions are provided.

The regular query functions take linear time in the size of the sketch, and have accuracy guarantees

sketch.query(rank) // rank is between [0, N-1]
sketch.query_quantile(quantile) // quantile is between [0, 1.0]

The fast query functions take logarithmic time in the size of the sketch, and have no accuracy guarantees.

sketch.fast_query(rank) // rank is between [0, N-1]
sketch.fast_query_quantile(quantile) // quantile is between [0, 1.0]

The sketch structure also support parallel sketching:

Example: If I have 16 data streams, I can construct 16 streaming_quantile_sketches:

Each sketch can then be ran on each data stream independently. Then all can be combined into one. Note that the process for doing this on the streaming_quantile_sketch is somewhat more complicated. It involves a partial finalization of each substream (sub_stream_finalize() ), before combining. And after combining, a different finalization call must be made (combine_finalize())

// add stuff to each sub stream
for each sub-stream sketch:
sub-stream-sketch.add(...)
// pre-finalize each sub stream
for each sub-stream sketch:
sub-stream-sketch.substream_finalize();
// combine into one sketch
quantile_sketch sketch(0.01)
for each sub-stream sketch:
sketch.combine(sub-stream-sketch)
// finalize the main sketch
Template Parameters
Typeto contain in the sketch. type must be less-than-comparable, constructible, copy constructible, assignable.

Technical Details

We create a sequence of fixed length sketches, with each subsequent sketch sized to sketch twice the amount of data as the previous one. Each sketch is built to have a final finalized error of / 3.

If no sub-streams are used, the finalize() phase will merge all sketches and recompress to introduce an additional error of (2 * / 3)

If sub-streams are used, the sub-stream finalize will merge all the sub-stream sketches to introduce an additional error of ( / 3). Then each of the substream sketches are merged into the final sketch and compressed again to introduce an additional error of ( / 3)

Definition at line 22 of file quantile_sketch.hpp.

Constructor & Destructor Documentation

◆ streaming_quantile_sketch()

template<typename T, typename Comparator>
turi::sketches::streaming_quantile_sketch< T, Comparator >::streaming_quantile_sketch ( double  epsilon = 0.005,
const Comparator &  comparator = Comparator() 
)
inlineexplicit

Constructs a quantile_sketch with a desired target number of elements, and a desired accuracy. See init() for details.

Definition at line 110 of file streaming_quantile_sketch.hpp.

Member Function Documentation

◆ add()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::add ( t)
inline

Inserts a value into the sketch. Not safe to use in parallel.

Definition at line 163 of file streaming_quantile_sketch.hpp.

◆ combine()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::combine ( streaming_quantile_sketch< T, Comparator >  other)
inline

Merges two substream_finalized quantile_sketch into one quantile sketch. Both this and the other quantile sketch must be substream_finalized(). This allows two quantile sketches to be generated on two disjoint streams of data, and then combined at the end.

Example: If I have 16 data streams, I can construct 16 quantile_sketches,

Each sketch can then be ran on each data stream independently. Then all can be combined into one:

// add stuff to each sub stream
for each sub-stream sketch:
sub-stream-sketch.add(...)
// pre-finalize each sub stream
for each sub-stream sketch:
sub-stream-sketch.substream_finalize();
// combine into one sketch
quantile_sketch sketch(0.01)
for each sub-stream sketch:
sketch.combine(sub-stream-sketch)
// finalize the main sketch
Note
The predictions produced by the combined sketch is not generally going to be the same as the sequentially produced sketch
See also
combine_finalize() substream_finalize()

Definition at line 377 of file streaming_quantile_sketch.hpp.

◆ combine_finalize()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::combine_finalize ( )
inline

Stops combining, and finishes the final sketch making it available for querying.

See also
combine() substream_finalize()

Definition at line 390 of file streaming_quantile_sketch.hpp.

◆ fast_query()

template<typename T, typename Comparator>
T turi::sketches::streaming_quantile_sketch< T, Comparator >::fast_query ( size_t  rank)
inline

Quickly queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.

Note that this rank is in relation to the total number of elements inserted. For instance,

quantile_sketch<double> sketch(1000);
...
insert 10,000 elements into the sketch

sketch.fast_query(9999) will be the maximum.

Unlike query(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.

Note
finalize() must be called prior to using this function.
The complexity of this function can be logarithmic in the sketch size in the best case (appears to be all the time, empirically), or linear in the worst case. Essentially, it first tests the result of the fast_query to see if that fits the bound requirements. If it does not, it takes a full linear search. The fast_query seems to work pretty much all the time on randomized inputs of varying distributions.

Definition at line 286 of file streaming_quantile_sketch.hpp.

◆ fast_query_quantile()

template<typename T, typename Comparator>
T turi::sketches::streaming_quantile_sketch< T, Comparator >::fast_query_quantile ( double  quantile)
inline

Quickly returns the value at a quantile. For instance:

sketch.fast_query_quantile(0.01) // returns the 1% quantile
sketch.fast_query_quantile(0.50) // returns the median
sketch.fast_query_quantile(1.00) // returns the max

quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).

Unlike query_quantile(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.

Note
finalize() must be called prior to using this function.
The complexity of this function is logarithmic in the sketch size.

Definition at line 311 of file streaming_quantile_sketch.hpp.

◆ finalize()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::finalize ( )
inline

Finalizes the sketch datastructure. Must be called before queries are made. Once finalize() is called, elements may no longer be inserted.

Definition at line 182 of file streaming_quantile_sketch.hpp.

◆ init()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::init ( double  epsilon,
const Comparator &  comparator 
)
inline

Reinitializes the quantile sketch, resetting it. epsilon is the desired accuracy of the quantiles.

Parameters
epsilonThe desired accuracy

Definition at line 121 of file streaming_quantile_sketch.hpp.

◆ memory_usage()

template<typename T, typename Comparator>
size_t turi::sketches::streaming_quantile_sketch< T, Comparator >::memory_usage ( )
inline

Returns the approximate current memory usage of the sketch in bytes.

Definition at line 151 of file streaming_quantile_sketch.hpp.

◆ query()

template<typename T, typename Comparator>
T turi::sketches::streaming_quantile_sketch< T, Comparator >::query ( size_t  rank)
inline

Queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.

Note that this rank is in relation to the total number of elements inserted. For instance,

quantile_sketch<double> sketch(1000);
...
insert 10,000 elements into the sketch

sketch.query(9999) will be the maximum.

Note
finalize() must be called prior to using this function.
The complexity of this function can be logarithmic in the sketch size in the best case (appears to be all the time, empirically), or linear in the worst case. Essentially, it first tests the result of the fast_query to see if that fits the bound requirements. If it does not, it takes a full linear search. The fast_query seems to work pretty much all the time on randomized inputs of varying distributions.

Definition at line 234 of file streaming_quantile_sketch.hpp.

◆ query_quantile()

template<typename T, typename Comparator>
T turi::sketches::streaming_quantile_sketch< T, Comparator >::query_quantile ( double  quantile)
inline

Returns the value at a quantile. For instance:

sketch.query_quantile(0.01) // returns the 1% quantile
sketch.query_quantile(0.50) // returns the median
sketch.query_quantile(1.00) // returns the max

quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).

Note
finalize() must be called prior to using this function.
The complexity of this function is linear in the sketch size.

Definition at line 254 of file streaming_quantile_sketch.hpp.

◆ size()

template<typename T, typename Comparator>
size_t turi::sketches::streaming_quantile_sketch< T, Comparator >::size ( ) const
inline

Returns the number of elements inserted into the sketch.

Definition at line 207 of file streaming_quantile_sketch.hpp.

◆ sketch_size()

template<typename T, typename Comparator>
size_t turi::sketches::streaming_quantile_sketch< T, Comparator >::sketch_size ( )
inline

Returns the number of elements stored in the sketch

Definition at line 138 of file streaming_quantile_sketch.hpp.

◆ substream_finalize()

template<typename T, typename Comparator>
void turi::sketches::streaming_quantile_sketch< T, Comparator >::substream_finalize ( )
inline

Finalizes the sketch datastructure, in preparation for combining with other streaming quantile sketches. Once substream_finalize() is called, elements may no longer be inserted.

See also
combine_finalize() combine()

Definition at line 323 of file streaming_quantile_sketch.hpp.


The documentation for this class was generated from the following files: