Prev: query-planning–optimization-i Next: concurrency-control-theory
The DBMS’ optimizer uses an internal cost model to estimate the cost of a query plan. This is a heuristic to choosing which query plan is better, witohut running it.
It can be derived from the following:
To calculate the following, the DBMS stores internal statistics about
tables, attributes, and indexes in a catalog table. For open source
systems, they can be found in information_schema
or a
similar table.
For a relation \(R\), the DBMS stores the number of tuples (\(N_r\)) and the distinct values per attribute (\(V (A, R)\)).
The selection cardinality (\(SC(A, R)\)) is the average number of records with a value for an attribute \(A\) given \(N_R / V(A, R)\).
Complex Predicates
Join Estimation
\[estSize \approx N_R * N_S / max(V(A,R), V(A, S))\]
We don’t have to store all values, since it’s expensive. We thus want to find a good summary of the data.
Histograms:
Histograms work well for uniformly distributed data, but data in real life is not. There are some optimizations, like trying to put data into buckets to reduce the size of histograms. This, however, can lead to inaccuracies.
Sampling:
Modern DBMS’ use sampling to estimate predicate selectivity. This is simple since you can increase the amount of samples for more accurate data, although it can be skewed by some datasets.
The basic blueprint for a search algorithm is:
It’s important to pick the best access method for each table accessed in the query.
For multiway joins, the number of alternative plans grows rapidly as the number of joined tables increases. For an n-way join, the number of different ways to join the numbers is a Catalan number (\(4^n\)). This can be cut down with heuristics, like left-deep join trees, which let you pipeline data, and only maintain a single join table in memory.
The DBMS can handle nested sub-queries in a few ways.
Prev: query-planning–optimization-i Next: concurrency-control-theory