Turi Create
4.0
|
#include <ml/sketches/quantile_sketch.hpp>
Public Member Functions | |
quantile_sketch () | |
quantile_sketch (size_t desired_n, double epsilon=0.005, const Comparator &comparator=Comparator()) | |
void | init (size_t desired_n, double epsilon, const Comparator &comparator=Comparator()) |
size_t | sketch_size () |
size_t | memory_usage () |
void | add (T t) |
void | finalize () |
size_t | size () const |
T | query (size_t rank) const |
T | query_quantile (double quantile) const |
T | fast_query (size_t rank) const |
T | fast_query_quantile (double quantile) const |
void | combine (const quantile_sketch &other) |
This class implements a quantile sketching datastructure as described in Qi Zhang and Wei Wang. A Fast Algorithm for Approximate Quantiles in High Speed Data Streams. SSDBM 2007.
It constructs a low memory sketch from a stream of known length. To summarize the basic idea, it draws samples from the stream and maintains for each sample a lower bound and an upper bound on its rank in the stream.
The known length condition is not strictly true. While the sketch is constructed for particular stream length N, this datastructure does not limit the number of elements can be inserted, but then in which case the accuracy guarantee no longer applies.
Usage is simple:
To query, two family of functions are provided.
The regular query functions take linear time in the size of the sketch, and have accuracy guarantees when <=desired_n elements are inserted (100000 elements in the above example)
The fast query functions take logarithmic time in the size of the sketch, and have no accuracy guarantees.
The sketch structure also support parallel sketching:
Example: If I have 16 data streams, each of size N/16 I can construct 16 quantile_sketches, each initialized with desired_n = N (note. N here! not N/16)
Each sketch can then be ran on each data stream independently. Then all can be combined into one:
Type | to contain in the sketch. type must be less-than-comparable, constructible, copy constructible, assignable. |
The basic mechanic of the sketch lies in making a hierarchy of sketches, each having size m_b. The first sketch has error 0. The second sketch has error 1/b, the 3rd sketch has error 2/b and so on. These sketches are treated like a binary sequence:
The procedure thus seem like binary addition.
By picking the size b appropriately for a given value of N and , At the end, all the sketches can be merged to get a sketch of error .
Here is where we deviate from the paper slightly. The "compress" procedure has the advantage of decreasing the total amount of memory required, By finishing/finalizing with a simple merge, means that the resultant array could be much bigger than desired. The way the binary addition works, also means the resultant sketch size is somewhat unpredictable. Thus here lies the modification.
We compute the sketch as usual, but sizing "b" to end up with a final merged error of / 2. We then perform one addition compress(/2) phase to compress the final sketch into requiring only O(1/epsilon) memory.
Definition at line 118 of file quantile_sketch.hpp.
|
inline |
Default constructor. Does nothing. init() MUST be called before any operations are performed.
Definition at line 125 of file quantile_sketch.hpp.
|
inlineexplicit |
Constructs a quantile_sketch with a desired target number of elements, and a desired accuracy. See init() for details.
Definition at line 131 of file quantile_sketch.hpp.
|
inline |
Inserts a value into the sketch. Not safe to use in parallel.
Definition at line 225 of file quantile_sketch.hpp.
|
inline |
Merges two unfinalized quantile_sketch into one quantile sketch. Both this and the other quantile sketch must not be finalized(). This allows two quantile sketches to be generated on two disjoint streams of data, and then combined at the end. For quality guarantees to apply, BOTH sketches must be constructed with the final number of elements.
Example: If I have 16 data streams, each of size N/16 I can construct 16 quantile_sketches, each initialized with desired_n = N (note. N here! not N/16)
Each sketch can then be ran on each data stream independently. Then all can be combined into one:
Definition at line 461 of file quantile_sketch.hpp.
|
inline |
Quickly queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.
Note that this rank is in relation to the total number of elements inserted. For instance,
sketch.fast_query(9999) will be the maximum.
Unlike query(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.
Definition at line 405 of file quantile_sketch.hpp.
|
inline |
Quickly returns the value at a quantile. For instance:
quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).
Unlike query_quantile(), this function is never guaranteed to provide an epsilon error bound. Instead, this function assumes that each element in the sketch has a point estimate of its rank (the center of its rmin/rmax range), and looks for the element closest to the desired rank.
Definition at line 430 of file quantile_sketch.hpp.
|
inline |
Finalizes the sketch datastructure. Must be called before queries are made. Once finalize() is called, elements may no longer be inserted.
Definition at line 247 of file quantile_sketch.hpp.
|
inline |
Initializes the quantile sketch, resetting it. epsilon is the desired accuracy of the quantiles. An -approximate quantile is an element in the data stream whose approximate rank r' is within [r- n and r + n] i.e. a 0.0001 approximate median in a 1,000,000 element array may have true rank [499,000, and 501,000].
The runtime memory requirements of this sketch is O((1/) (log N)^2) The final memory requirements after finalization is O(1/).
To give a sense of the runtime memory requirements, at = 0.01
* N = 100,000: sketch has size = 45528 bytes * N = 1,000,000: sketch has size = 96144 bytes * N = 10,000,000: sketch has size = 123504 bytes * N = 100,000,000: sketch has size = 316536 bytes *
(Increasing N by 10 roughly increases the sketch size by 2. Note that the growth function is a little wierd. this is due to the structure of the algorithm which maintains multi-level sketches, then during the final phase, just merging all those sketches together. This results in somewhat oddly behaved memory utilization as a function of N.)
At N = 1,000,000
* epsilon = 0.1: sketch has size = 12480 bytes * epsilon = 0.01: sketch has size = 96144 bytes * epsilon = 0.001: sketch has size = 442776 bytes * epsilon = 0.0001: sketch has size = 3271440 bytes *
(decreasing epsilon by 10 roughly increases the sketch size by 5)
desired_n | An UPPER BOUND on the number of elements to be inserted, to guarantee epsilon accuracy. The epsilon accuracy guarantee is only true if you do not insert more than n elements. |
epsilon | The desired accuracy |
Definition at line 178 of file quantile_sketch.hpp.
|
inline |
Returns the approximate current memory usage of the sketch in bytes.
Definition at line 218 of file quantile_sketch.hpp.
|
inline |
Queries for the value at rank k of all the elements inserted. Assuming N elements are inserted. Rank 0 will be the minimum value, rank N-1, the maximum value.
Note that this rank is in relation to the total number of elements inserted. For instance,
sketch.query(9999) will be the maximum.
Definition at line 295 of file quantile_sketch.hpp.
|
inline |
Returns the value at a quantile. For instance:
quantile values below 0 will be interpreted as 0 (hence the minimum). quantile values above 1 will be interpreted as 1 (hence the maximum).
Definition at line 376 of file quantile_sketch.hpp.
|
inline |
Returns the number of elements inserted into the sketch.
Definition at line 264 of file quantile_sketch.hpp.
|
inline |
Returns the number of elements stored in the sketch
Definition at line 206 of file quantile_sketch.hpp.