ParadeDB

Tokenization: splitting text into searchable chunks
Stemming: reducing words to root, e.g. running → run
inverted indexes: data structure for full text search
Aggregation: statistics over a dataset, counts, sums, averages
Facet: Aggregation over full text search
Elastic DSL: DSL used by elastic to search

ParadeDB: Elasticsearch alternative built on Postgres:

pg_search: Full text search with BM25
pg_analytics: Read data lakes (S3) and table formats (iceberg)

Why Migrate?

Data reliability
Freshness
No schema changes
Fast reindexing in update-heavy scenarios

What is Full Text Search?

Query documents by presence of keywords or phases Simple or Complex

Indexing: Tokenize and store text in an inverted index
Querying: Finding text in the index that matches a query

Full Text Search in Postgres

Beforehand:

LIKE operator
ts_vector/ts_rank + GIN index
pg_tgrm (pg_trigram)

ts_vector stores the tokenized, stemmed representation of text. ts_rank lets you rank them using TF_IDF. GIN indexes are an inverted index of ts_vector columns, improving query performance.

pg_tgrmturns text into trigrams, so cheese → (che, hee, ees, ese).

What is Postgres FTS Missing?

BM25 Relevance (Best match 25), kinda like TF-IDF that considers relevance of documents
More powerful tokenizers and token filters
Elastic DSL search
Fast facets and aggregations

BM25

Improvement on the TF-IDF algorithm

Longer documents have more terms, but these aren’t necessarily more relevant. BM25 takes care of document length.
More common words are considered less important.

What is pg_search

Rust extension that uses Tantivy to add FTS to postgres.
Tantivy is a FTS library in rust with inverted indexes and columnar storage

Tantivy is Fast

Immutable segment-based design
Mmapped files for fast I/O
Rust and SIMD

How pg_search is Build

Custom FTS operator @@@
Custom postgres index
Query builder API
custom scan

SELECT * from mock_items WHERE id < 10 AND description @@@ 'keyboard'; -- searches for keyboards

Custom Index

It’s a BM25 index on any text field. It’s also meant to be covering, so over multiple columns.

Query Builder API

Queries can use the query builder API:

SELECT * FROM mock_items WHERE id => paradedb.boolean(
	should => ARRAY[
	paradedb.boost(query => paradedb.parse('description:shoes'), boost => 2.0), paradedb.term(field => 'description', value => 'running)
	]
);

Custom Scan

Allows for 3 optimizations for FTS
Predicate pushdown
BM25 scoring
Fast facets + aggregations

Predicate Pushdown

SELECT * from mock_items WHERE description @@@ 'keyboard' AND rating < 5.
Without a custom scan, postgres performs separate scans over description and rating, even if these are both indexed

Scoring Projection

SELECT * FROM mock_items WHERE description @@@ 'keyboard' lets you project a score column into the result

Tantivy has fast fields, which is columnar storage, so postgres can do visibility checking on batches
Even faster facets with DataFusion

Takashi's Notes

Explorer

ParadeDB

ParadeDB

Why Migrate?

What is Full Text Search?

Full Text Search in Postgres

What is Postgres FTS Missing?

BM25

What is pg_search

Tantivy is Fast

How pg_search is Build

Custom Index

Query Builder API

Custom Scan

Predicate Pushdown

Scoring Projection

Facets Are Slow

Graph View

Table of Contents