and Retrieval”
Storage and Retrieval
Prev: data-models-and-query-languages Next: encoding-and-evolution
Data Structures That Power Your Database
Imagine this key-value database:
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
On db_set
, it stores a key and its value. (O(1) time, since this
appends)
On db_get
, it fetches the value that corresponds to the key, searching
backwards to the beginning of the dataset. (O(n) time).
We can do better by indexing:
Hash Index
We can create an in-memory hash map where every key is mapped to a byte
offset in the data file. If you append a new key-value pair to the file,
update the hashmap to the location of the new value. This is what Riak
does.
We now have O(1) average reads.
Implementation details:
- File Format
- Text formats are not particularly compact. Binary formats are better.
- Deleting records
- We can delete records by appending a tombstone to the hashmap. The tombstone tells the merging process to discard any previous values for the deleted key.
- Crash recovery
- Take a snapshot of the hashmap on disk intermittently, so when the database crashes, restarting it won’t take so long.
- Partially written records
- Calculate a checksum for every log, allowing corrupted parts of the log to be detected and ignored.
- Concurrency control
- Writes should be strictly sequential, so there should be one writer thread that appends, with multiple reader threads.
Hash Index in practice
Pros
Appending is much faster than random writes, as random access writes can be 100xs slower than a sequential write.
Concurrency and crash recovery are much easier, since incomplete writes can be thrown away easily with checksums.
Merging avoids data fragmentation, which slows reads.
Cons
The hash table should fit in memory. On disk hash indexes are much slower.
Hash indexes are useless for range queries since they are unordered. For example, a query like this does not use a hash index, and requires a full table scan. A tree like index is sorted, and thus a query can use it.
SELECT *
FROM table
WHERE income > 30000;
SStables
SSTables are like our previous log structure, but they require that the sequence of key-value pairs are sorted by key.
Indexes can be compacted using a mergesort like algorithm, reading the input files side by side, looking at the first key in each file, copying the lowest key to the output file, and repeating.
Since items are divided by time into segments, any items encountered later that are in the set come later, so overwrite the ones that came before with the ones that come later.
In order to find a particular key in the file, you dont need an index at all — you can use the byte offsets as a range query indicator.
Thus, the whole engine works like this:
- When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.
- When the memtable gets bigger than some threshold---typically a few megabytes ---- write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
- In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
- From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
LSM-Trees
SSTables can be combined to create an LSM-Tree, which is what LevelDB and RocksDB do.
BTrees
- Btrees are a balanced tree (like a Red-Black Tree) where all
children are arrays. This allows Btrees to be compatible with range
operators (
<
,>
,<=
,>=
) and cache friendly. - BTrees take up a significant amount of space alongside the table.
- They can fail when updating, so a WAL (Write Ahead Log) is sometimes used alongside it.
Other Indexing Structures
It can be common to have a secondary index in a table, to index something other than the primary key. This is common on read-heavy tables (Think an accounts table, or a transactions table).
Some databases, like Postgres and MsSQL allow for partial indexes, where an index only covers some of the data. This can be very useful in certain cases.
Multi-column Indexes
You can index columns at a time, like for geospatial data. (PostGIS with a GIST (QuadTree or R-Tree) index)).
Here’s an example of a 2D query:
SELECT *
FROM restaurants
WHERE latitude > 51.4946 AND latitude < 51.5079
AND longitude > -0.1162 AND longitude < -0.1004;
Full-text search and fuzzy indexes
Lucene uses edit distance along with an LSM-tree table for indexing.
Keeping everything in memory
Disk offers two advantages over RAM. Disk is cheaper than RAM (~10x) and durable, so it stores state after it is powered off (or crashes).
RAM is becoming far cheaper, which allows in-memory Key-value stores, like Memcached and Redis to be doable.
In-memory key value databases still write to disk (normally in the form of an append-only log) in order to allow for backups. They are fast as they serve reads purely in memory, and come with some durability guarantees because they write to disk. Because in-memory KV stores like Redis don’t have to worry about persisting data structures to disk, some normally hard to represent data structures like queues and sets can be found.
Transaction Processing or Analytics?
- OLTP (Transaction Processing Systems)
- You write and read small amounts of data to this system.
- OLAP (Analytics Systems)
- Data Lakes for storing less frequently accessed data.
- ETL (Extract, Transform, Load) is used to get data into data warehouses for business analysts.
Stars and Snowflakes: Schemas for Analytics
Data warehouses are used in a simple, formulaic style, known as a star schema. At the center of the schema is a fact table, which represents an event that occurred at a particular time.
A fact table might look like this:
date_key product_sk store_sk promotion_sk customer_sk quantity net_price discount_price
140102 31 3 NULL NULL 1 2.49 2.49
140102 69 5 19 NULL 3 14.99 9.99
140102 74 3 23 191 1 4.49 3.89
And it might have other tables like this product table it points to:
product_sk sku description brand category
30 OK4012 Bananas Freshmax Fresh fruit
31 KA9511 Fish food Aquatech Pet supplies
32 AB1234 Croissant Dealicious Bakery
This is called a star schema, because all tables have a foreign key in this single table.
Column-Oriented Storage
Let’s say we want to query a table for lots of rows but only a few columns; say all purchases in 2013.
We’ll need the date, the product_sk, the quantity, and price. How would we do this efficiently?
In most RDBMS’, data is stored by row. That means that filtering data that is indexed is slow, as the engine must read the whole row into memory before parsing and filtering it.
Column-oriented storage holds the values from each column together, instead.
Instead of something like this, where data is accessed by row:
[
# date_key, product_sk, store_sk
[140102, 69, 4],
[140102, 69, 5],
[140102, 74, 3]
]
Imagine a set of lists that contain all of their values:
date_keys = [140102, 140102, 140102]
product_sks = [69, 69, 74]
store_sks = [4,5,3]
To reassemble a row, query the same index of all the rows.
Run-Length Encoding
As well, this scheme lends itself well to Run-Length Encoding and opens itself up to compression opportunities.
Let’s take a look at our product_sk
and make a bitmap of it instead.
product_sks = [69, 69, 74]
product_sk_69 = [1, 1, 0]
product_sk_74 = [0, 0, 1]
We have two products, so we can generate two different bitmap arrays for
the values of 69
and 74
. Let’s say we want to do a query like this:
SELECT date_key
FROM fact_table
WHERE product_sk = 69;
All we have to do is scan the product_sk_69
bitmap to see that the
first two rows are set, and query those.
If we want to do something like this:
SELECT date_key
FROM fact_table
WHERE product_sk IN (69, 74);
We can bitwise OR the relevant bitmaps:
product_sk_69_or_74 = [x | y for x, y in zip(product_sk_69, product_sk_74)]
# This is [1, 1, 1]
And use the relevant result of that.
In short, Column Storage is very efficient at using CPU-cycles.
Sort Order in Column Storage
To sort data in a column, we can’t sort one column at a time, since that would break our scheme. A wise DB admin may choose to sort rows by a frequently accessed column, like date, to make it so queries in a date range are faster. They may also sort by second and third indicies, but these have fewer speed gains.
Writing to Column-Oriented Storage
Updating in place is impossible with a sorted column approach. However, LSM trees work here, where all writes go to an in-memory store, and then are added to a sorted structure and prepared for writing to disk.
Aggregation: Data Cubes and Materialized Views
Views allow for faster reads in SQL databases. If we often run the same
set of queries, it’s wasteful to run them without caching, as that
wastes compute. We can make a view
, which is a table which is a
shortcut for writing a specific query. We can write a
materialized view
, which is a persisted to disk copy of the data that
a query would produce.
On a write to a column that a materialized view depends on, the materialized view needs to update itself, so they are more common on read-heavy workloads.
A special case of this is a data cube or an OLAP cube. It is a grid of aggregates grouped by dimension. This allows for a set of materialized views to be grouped together on specific dimensions, allowing users to query based on more dimensions that just a simple materialized view. This can be a noticeable boost for some queries.
Prev: data-models-and-query-languages Next: encoding-and-evolution