Original parer can be found
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged
into a sorted array that contains previously computed centroids.
This can be very fast because the cost of sorting and merging is amortized over several insertion. If
we keep N centroids total and have the input array is k long, then the amortized cost is something like
N/k + log k
These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an
algorithm. For different values of compression factor, the following table shows estimated asymptotic
values of N and suggested values of k:
Compression | N | k |
50 | 78 | 25 |
100 | 157 | 42 |
200 | 314 | 73 |
Sizing considerations for t-digest
The virtues of this kind of t-digest implementation include:
- No allocation is required after initialization
- The data structure automatically compresses existing centroids when possible
- No object overhead is incurred for centroids since data is kept in primitive arrays
The current implementation takes the liberty of using ping-pong buffers for implementing the merge resulting
in a substantial memory penalty, but the complexity of an in place merge was not considered as worthwhile
since even with the overhead, the memory cost is less than 40 bytes per centroid which is much less than half
what the AVLTreeDigest uses. Speed tests are still not complete so it is uncertain whether the merge
strategy is faster than the tree strategy.