Adaptive String Dictionary Compression In In Memory

Notes

This paper discusses dictionary compression of string columns in in-memory data stores. They note that most columns are strings, and that many columns have only a few values, but some have large dictionaries.

They list out a few schemes:

Uncompressed
Bit compression
Hu-Tucker (order-preserving Huffman)
N-gram compression.
Re-pair

All of these do well depending on the data.

For each scheme, they lay out a prediction model for the variant, and then build out a compression manager, which gets statistics on the space and time for compression of the column based on the set size, which tunes itself to trade time for space.

If memory pressure is high, favor more compression If memory pressure is low, favor less compression

They plot out a curve where variants would fit for this column and then pick the one that takes the least time.

Takashi's Notes

Explorer

Adaptive String Dictionary Compression In In Memory

Notes

Graph View