Chapter 8: Specialized Databases โ
"A specialized database does one thing 10ร faster than a general-purpose database. The engineering question is always: does your workload actually need 10ร?"
Mind Map โ
Prerequisites
This chapter assumes familiarity with storage engine fundamentals (Ch01) and PostgreSQL internals (Ch05). Review those first if needed.
Time-Series Databases โ
Time-series workloads share three characteristics: data arrives with a timestamp, it is almost always appended (rarely updated), and queries are predominantly range-based (e.g., "last 24 hours"). These properties call for different storage optimizations than OLTP databases.
ClickHouse: Columnar OLAP at CDN Scale โ
ClickHouse stores data column-by-column rather than row-by-row. A query touching only 3 columns of a 100-column table reads 3% of the data a row-store would read. Combined with vectorized execution (SIMD CPU instructions processing 256 bits at once), ClickHouse achieves extraordinary scan throughput.
MergeTree Engine โ
ClickHouse's primary storage engine is MergeTree โ a family of LSM-like engines that append data in sorted parts and merge them in the background.
-- ClickHouse: web analytics events table
CREATE TABLE page_views (
timestamp DateTime,
session_id UUID,
user_id UInt64,
page_url String,
referrer String,
country LowCardinality(String), -- dictionary encoding for low-cardinality
duration_ms UInt32,
is_bounce UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp) -- one partition per month
ORDER BY (timestamp, session_id) -- primary sort key (also the sparse index)
TTL timestamp + INTERVAL 2 YEAR; -- auto-delete rows older than 2 years
-- Insert at high throughput: use batch inserts (100K+ rows per INSERT)
-- ClickHouse amortizes merge overhead across large batches
-- Materialized view: pre-aggregate by hour for dashboard queries
CREATE MATERIALIZED VIEW hourly_stats
ENGINE = SummingMergeTree()
ORDER BY (hour, country, page_url)
AS SELECT
toStartOfHour(timestamp) AS hour,
country,
page_url,
count() AS views,
sum(duration_ms) AS total_duration,
countIf(is_bounce = 1) AS bounces
FROM page_views
GROUP BY hour, country, page_url;ClickHouse Benchmarks โ
| Metric | Value | Notes |
|---|---|---|
| Ingestion rate | 4M rows/second | Single node, batch inserts, uncompressed |
| Compression ratio | 5โ15ร | LZ4 default; ZSTD for higher ratio |
| Scan throughput | 2โ100 GB/s | Depends on query selectivity and column count |
| Query latency (1B rows, aggregation) | 100msโ2s | Single node, indexed sort key |
| Storage efficiency | 10โ50 bytes/row | After compression (vs 100โ500 bytes in PostgreSQL) |
TimescaleDB: PostgreSQL for Time-Series โ
TimescaleDB is a PostgreSQL extension that adds time-series capabilities: automatic partitioning (hypertables), native compression, and continuous aggregates.
-- Create a hypertable: PostgreSQL table + automatic time-based partitioning
SELECT create_hypertable('sensor_readings', 'time',
chunk_time_interval => INTERVAL '1 week' -- one partition per week
);
-- Native compression: 10-30ร compression ratio
ALTER TABLE sensor_readings SET (
timescaledb.compress,
timescaledb.compress_orderby = 'time DESC',
timescaledb.compress_segmentby = 'sensor_id'
);
-- Compress chunks older than 7 days
SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');
-- Continuous aggregate: materializes query results, refreshes automatically
CREATE MATERIALIZED VIEW sensor_hourly
WITH (timescaledb.continuous) AS
SELECT
sensor_id,
time_bucket('1 hour', time) AS bucket,
AVG(value) AS avg_val,
MAX(value) AS max_val,
MIN(value) AS min_val
FROM sensor_readings
GROUP BY sensor_id, bucket;
SELECT add_continuous_aggregate_policy('sensor_hourly',
start_offset => INTERVAL '2 days',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '30 minutes'
);
-- Query the continuous aggregate (fast: hits materialized data)
SELECT bucket, avg_val
FROM sensor_hourly
WHERE sensor_id = 'sensor-42'
AND bucket >= NOW() - INTERVAL '7 days'
ORDER BY bucket DESC;TimescaleDB Benchmarks:
| Metric | Value | Notes |
|---|---|---|
| Ingestion rate | 1.3M rows/second | Single node, via COPY or multi-row INSERT |
| Compression ratio | 10โ30ร | Native columnar compression on chunks |
| Query speedup vs vanilla PostgreSQL | 10โ100ร | For time-range aggregations on compressed chunks |
| Continuous aggregate query | 1โ50ms | Pre-materialized; near-instant |
InfluxDB: Purpose-Built TSDB โ
InfluxDB 3.0 (IOx engine) is a rewrite in Rust using Apache Arrow and Parquet for storage, enabling SQL queries via Apache DataFusion. The older InfluxDB 1.x and 2.x use the TSM (Time-Structured Merge tree) storage engine.
-- InfluxDB 3.0: SQL interface over IOx
SELECT
date_trunc('hour', time) AS hour,
mean(temperature) AS avg_temp,
max(temperature) AS max_temp
FROM weather_readings
WHERE location = 'NYC'
AND time >= now() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;Ingestion Benchmarks Comparison:
| Database | Ingestion Rate (single node) | Compression | Best For |
|---|---|---|---|
| ClickHouse | 4M rows/second | 10โ50ร | OLAP analytics, CDN-scale events |
| TimescaleDB | 1.3M rows/second | 10โ30ร | PostgreSQL teams, mixed OLTP+TS |
| InfluxDB 3.0 | 1.2M rows/second | 5โ20ร | IoT, metrics, pure time-series |
When PostgreSQL Is Enough โ
Search Engines โ
Elasticsearch: Inverted Index at Scale โ
Elasticsearch stores documents in an inverted index โ a map from terms to the list of documents containing that term. This enables full-text search across billions of documents in milliseconds.
BM25 Ranking considers term frequency in the document, inverse document frequency (rarer terms score higher), and document length normalization. It outperforms TF-IDF for most text search tasks.
// Elasticsearch: product search with filters and aggregations
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "wireless headphones",
"fields": ["title^3", "description", "tags"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
],
"filter": [
{ "term": { "in_stock": true } },
{ "range": { "price": { "gte": 20, "lte": 300 } } }
]
}
},
"aggs": {
"brands": {
"terms": { "field": "brand.keyword", "size": 10 }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50 }, { "from": 50, "to": 150 }, { "from": 150 }
]
}
}
},
"from": 0, "size": 20,
"sort": [{ "_score": "desc" }, { "sales_rank": "asc" }]
}Meilisearch: Sub-50ms Typo-Tolerant Search โ
Meilisearch maintains its entire index in RAM (memory-mapped files with mmap). This enables <50ms search latency at the cost of memory requirements proportional to index size.
// Meilisearch: instant search with typo tolerance
const client = new MeiliSearch({ host: 'http://localhost:7700' });
const index = client.index('products');
// Configure relevance and filtering
await index.updateSettings({
searchableAttributes: ['title', 'description', 'tags'],
filterableAttributes: ['category', 'price', 'in_stock'],
sortableAttributes: ['price', 'sales_rank'],
typoTolerance: {
enabled: true,
minWordSizeForTypos: { oneTypo: 4, twoTypos: 8 }
}
});
// Search: returns results in < 50ms for < 10M documents
const results = await index.search('wireles headphone', {
filter: 'in_stock = true AND price < 200',
sort: ['sales_rank:asc'],
facets: ['category', 'brand'],
limit: 20
});Typesense: Raft-Replicated In-Memory Search โ
Typesense uses Raft consensus for a distributed, fault-tolerant cluster of in-memory search nodes. Features native vector search (hybrid BM25 + ANN), multi-tenant isolation, and a strict schema.
Search Latency Benchmarks โ
| Engine | Dataset | p50 Latency | p99 Latency | Notes |
|---|---|---|---|---|
| Elasticsearch | 10K docs | 5ms | 15ms | Index warm, single shard |
| Elasticsearch | 1M docs | 10ms | 50ms | Single shard |
| Elasticsearch | 1B docs | 100ms | 500ms | Multi-shard cluster |
| Meilisearch | 10K docs | 1ms | 5ms | All in RAM |
| Meilisearch | 1M docs | 5ms | 20ms | RAM-mapped |
| Meilisearch | 1B docs | Not recommended | โ | Memory requirement too high |
| Typesense | 10K docs | 1ms | 5ms | In-memory |
| Typesense | 1M docs | 3ms | 15ms | In-memory |
| PostgreSQL (GIN/tsvector) | 1M docs | 10ms | 100ms | Good for simple FTS |
When PostgreSQL Full-Text Search Is Enough โ
-- PostgreSQL full-text search with GIN index
-- Handles up to ~5M documents well; beyond that, Elasticsearch wins
-- Add tsvector column for fast FTS
ALTER TABLE products ADD COLUMN search_vector tsvector;
UPDATE products SET search_vector =
to_tsvector('english',
coalesce(title, '') || ' ' ||
coalesce(description, '') || ' ' ||
coalesce(tags, '')
);
CREATE INDEX idx_products_fts ON products USING GIN(search_vector);
-- Trigger to keep search_vector current
CREATE OR REPLACE FUNCTION update_search_vector() RETURNS trigger AS $$
BEGIN
NEW.search_vector := to_tsvector('english',
coalesce(NEW.title, '') || ' ' ||
coalesce(NEW.description, '') || ' ' ||
coalesce(NEW.tags, ''));
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_search_vector BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION update_search_vector();
-- Search query with ranking
SELECT id, title,
ts_rank(search_vector, query) AS rank
FROM products,
to_tsquery('english', 'wireless & headphone') query
WHERE search_vector @@ query
AND in_stock = true
ORDER BY rank DESC
LIMIT 20;Graph Databases โ
Neo4j: Property Graph Model โ
Neo4j stores data as nodes (entities) and relationships (typed, directed edges), both with properties. The key innovation: index-free adjacency โ each node stores direct pointers to its adjacent nodes, eliminating the need for B-tree index lookups during traversal.
-- Cypher: find Alice's second-degree connections who liked a post
MATCH (alice:User {name: 'Alice'})-[:FOLLOWS]->(friend:User)-[:FOLLOWS]->(fof:User)
WHERE NOT (alice)-[:FOLLOWS]->(fof)
AND alice <> fof
RETURN fof.name, COUNT(friend) AS mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10;
-- Fraud detection: find accounts sharing 3+ attributes (IP, device, email pattern)
MATCH (a:Account)-[:USES]->(attr)<-[:USES]-(b:Account)
WHERE a.id <> b.id
WITH a, b, COUNT(attr) AS shared_attrs
WHERE shared_attrs >= 3
RETURN a.id, b.id, shared_attrs
ORDER BY shared_attrs DESC;When Graph Beats Relational Joins โ
The performance advantage of graph traversal over SQL JOINs grows with traversal depth. For deep relationship queries (3+ hops), graph databases are dramatically faster because they follow direct pointers rather than performing index lookups and hash joins.
| Traversal Depth | SQL (self-join) | Neo4j | Winner |
|---|---|---|---|
| 1 hop (direct friends) | 2ms | 2ms | Tie |
| 2 hops (friends of friends) | 20ms | 5ms | Neo4j (4ร) |
| 3 hops | 300ms | 8ms | Neo4j (37ร) |
| 4 hops | 30s+ | 15ms | Neo4j (2000ร) |
| 5+ hops | Minutes/timeout | 50ms | Neo4j |
Benchmark data: LinkedIn recommendation engine, 10M user graph, Neotechnology (2013).
Use Cases for Graph Databases โ
| Use Case | Why Graph Wins | Example Query |
|---|---|---|
| Social network recommendations | Multi-hop traversal (FOAF, common interests) | "People you may know" |
| Fraud detection | Find connected rings of suspicious accounts | Shared device/IP/card chains |
| Knowledge graph | Arbitrary relationship types, semantic queries | "What drugs interact with X?" |
| Supply chain tracking | Bill of materials, dependency graphs | "What components use this chip?" |
| Access control (RBAC/ABAC) | Hierarchical permissions, role inheritance | "Does user have access via role chain?" |
Vector Databases (2026) โ
As of 2026, vector search is standard infrastructure for AI-augmented products. Every major RDBMS (PostgreSQL via pgvector, MongoDB Atlas, MySQL 9 Innovation) has native vector support. The question is no longer whether to add vector search, but which tier of implementation matches your scale.
Index Algorithms: HNSW vs IVFFlat โ
HNSW (Hierarchical Navigable Small World) โ the production standard:
- Multi-layer graph: top layers are sparse long-range links, bottom layers are dense short-range links
- Query: start at top layer, greedily navigate toward query point, descend layers
- No training step โ add vectors incrementally
- Excellent recall (>95% at typical settings), fast queries
- Memory: ~1.5โ2ร raw embedding size
- Tune with
m(graph connectivity, default 16) andef_construction(build quality, default 64)
IVFFlat (Inverted File Flat) โ lighter-weight alternative:
- Training phase: cluster embeddings into N centroids via k-means
- Query: find M nearest centroids, then search exhaustively within them
- Lower memory than HNSW, but requires training data upfront
- Recall degrades if data distribution shifts after training
For new deployments: default to HNSW. Use IVFFlat only when memory is severely constrained.
pgvector: Integrated Vector Search โ
pgvector 0.8+ is production-grade. All major cloud PostgreSQL providers support CREATE EXTENSION vector natively. Used in production by OpenAI, Supabase, and Neon.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE product_embeddings (
product_id BIGINT PRIMARY KEY,
category TEXT,
embedding vector(1536) -- OpenAI text-embedding-3-small dimensions
);
-- HNSW index: m=16 (connectivity), ef_construction=64 (build quality)
CREATE INDEX ON product_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Set query-time search width (higher = better recall, slower)
SET hnsw.ef_search = 100; -- default 40; raise for higher recall needs
-- Semantic search with pre-filter (applied before HNSW when selective)
SELECT
product_id,
category,
1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS cosine_similarity
FROM product_embeddings
WHERE category = 'electronics'
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;Dedicated Vector Databases (2026 State) โ
| System | Language | Index | Filtering | Hybrid Search | Best For |
|---|---|---|---|---|---|
| pgvector 0.8+ | โ (PostgreSQL ext.) | HNSW, IVFFlat | Full SQL WHERE | tsvector + ANN | PostgreSQL teams, < 10M vectors |
| Qdrant | Rust | HNSW | Payload filters (rich JSON) | Sparse + dense | Complex filtered ANN, low latency |
| Weaviate | Go | HNSW | GraphQL | Native BM25 + ANN (RRF) | Hybrid semantic + keyword |
| Milvus | Go/C++ | HNSW + IVF + DiskANN | Scalar + JSON | Yes | Distributed, 1B+ vectors |
| Pinecone | Managed SaaS | Proprietary | Metadata filters | Sparse + dense | Production SaaS, zero ops overhead |
| Chroma | Python | HNSW | Metadata | Basic | Local dev, prototyping |
Qdrant performance note (2026): Written in Rust with a dedicated HNSW implementation, Qdrant benchmarks show 10โ25% lower p99 latency than equivalent pgvector HNSW at 1Mโ10M vectors on the same hardware. However, the operational overhead of a second system often outweighs the latency difference for teams already on PostgreSQL.
Scale-Based Selection โ
Hybrid Search: BM25 + ANN โ
Pure vector search misses exact keyword matches ("iPhone 15 Pro Max SKU"). Hybrid search combines keyword scoring (BM25) with vector similarity and merges via Reciprocal Rank Fusion (RRF):
def hybrid_search(query: str, top_k: int = 20) -> list:
# Run both in parallel
bm25_results = keyword_search(query, top=100) # returns [(id, bm25_score)]
ann_results = vector_search(embed(query), top=100) # returns [(id, distance)]
# Reciprocal Rank Fusion: score = 1/(k + rank), k=60 standard
scores = {}
for rank, (doc_id, _) in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank + 1)
for rank, (doc_id, _) in enumerate(ann_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank + 1)
return sorted(scores, key=scores.get, reverse=True)[:top_k]Weaviate has RRF built-in. For pgvector: combine tsvector GIN search with HNSW in two parallel queries and merge in application code.
RAG Architecture with pgvector โ
# Retrieval-Augmented Generation with pgvector
import openai
import psycopg2
def embed(text: str) -> list[float]:
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def retrieve(query: str, top_k: int = 5) -> list[dict]:
query_embedding = embed(query)
conn = psycopg2.connect(DATABASE_URL)
with conn.cursor() as cur:
cur.execute("""
SELECT document_id, chunk_text,
1 - (embedding <=> %s::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, top_k))
return [{"id": r[0], "content": r[1], "similarity": r[2]}
for r in cur.fetchall()]
def answer(question: str) -> str:
context_docs = retrieve(question, top_k=5)
context = "\n\n".join(d["content"] for d in context_docs)
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.contentDuckDB: In-Process OLAP (2026) โ
DuckDB 1.x (current: v1.5.3, May 2026; v2.0 planned September 2026) is an in-process analytical SQL engine โ no server, no configuration, no dependencies. It embeds inside your Python/Go/Node application like SQLite, but executes columnar OLAP queries with vectorized SIMD execution.
When DuckDB Replaces a Spark Cluster โ
For analytical jobs over local files or cloud object storage, DuckDB eliminates the need for a distributed cluster in most cases:
| Workload | Previous Approach | DuckDB Approach |
|---|---|---|
| Aggregate 10GB CSV/Parquet | Spark job (5โ15 min setup) | SELECT * FROM read_parquet('s3://...') โ seconds |
| Ad-hoc analytics on data lake | Athena/Presto (cost/latency) | DuckDB + Iceberg extension โ free, local |
| ETL between formats | Pandas + custom code | DuckDB SQL + COPY TO 'output.parquet' |
| Analytical queries in test | SQLite (row-store, slow) | DuckDB (column-store, fast) |
-- DuckDB: analyze Parquet files directly from S3 without ETL
-- Install AWS extension: INSTALL httpfs; LOAD httpfs;
SELECT
date_trunc('month', event_ts) AS month,
country,
COUNT(*) AS events,
SUM(revenue) AS total_revenue
FROM read_parquet('s3://data-lake/events/2026-*/*.parquet')
WHERE event_type = 'purchase'
GROUP BY month, country
ORDER BY month DESC, total_revenue DESC;
-- DuckDB: read Apache Iceberg table (built-in since v1.0)
-- INSTALL iceberg; LOAD iceberg;
SELECT * FROM iceberg_scan('s3://catalog/my_table')
WHERE event_date >= '2026-01-01';DuckDB Key Features (v1.5, 2026) โ
| Feature | Details |
|---|---|
| VARIANT type | JSON-like flexible column for semi-structured data (added v1.5) |
| GEOMETRY type | Geospatial queries without extensions |
| Iceberg support | Native Apache Iceberg reader (no Spark needed) |
| Quack protocol | Client-server mode for remote connections (v1.5) |
| Arrow integration | Zero-copy data exchange with pandas/polars |
| Persistent mode | Single .duckdb file โ embeds into any app |
DuckDB for OLTP teams
If your team runs PostgreSQL for OLTP, DuckDB is the lowest-friction analytical layer: read Parquet snapshots or connect to S3 directly. No cluster management. Run locally or in a Lambda/Cloud Run function. Not a replacement for ClickHouse at petabyte scale, but replaces 80% of Spark jobs.
Apache Iceberg: The Open Table Format Standard (2026) โ
Apache Iceberg has won the open table-format competition. As of 2026, it has broader engine support than Delta Lake or Apache Hudi across the data ecosystem.
Engine Support (2026) โ
| Engine | Iceberg Support |
|---|---|
| Apache Spark | Native (96% adoption in 2025 survey) |
| Apache Flink | Native (32% adoption) |
| Trino / Presto | Native (61% adoption) |
| Snowflake | Native (Iceberg tables GA) |
| BigQuery | Native (managed Iceberg) |
| AWS Athena | Native |
| DuckDB | Extension (v1.0+) |
| Delta Lake | Conversion supported, but separate ecosystem |
Why Iceberg Won โ
Iceberg solved problems that Hive-partitioned Parquet tables could not:
- ACID at file level: snapshot isolation, atomic partition swaps, rollback
- Schema evolution without rewrites: add/rename/drop columns without rewriting files
- Hidden partitioning: partition logic in catalog, not query WHERE clauses
- Time travel: query data as it existed at any past snapshot
- Partition evolution: change partition scheme without rewriting data
Apache Polaris: Vendor-Neutral Catalog โ
Apache Polaris graduated in February 2026 as a vendor-neutral REST catalog for Iceberg, with fine-grained access control (row/column level). It is now the standard catalog interface across cloud providers.
-- DuckDB: query Iceberg table with time travel
SELECT COUNT(*) FROM iceberg_scan(
's3://warehouse/sales',
version => 5 -- snapshot ID for time travel
);
-- Spark: Iceberg with schema evolution (no rewrite)
ALTER TABLE catalog.db.events
ADD COLUMN user_agent STRING; -- instant, no file rewrite
-- Query the new column (NULL for rows written before the schema change)
SELECT user_agent, COUNT(*) FROM catalog.db.events
WHERE event_date = '2026-01-01'
GROUP BY user_agent;Iceberg vs Delta Lake (2026)
Iceberg is the neutral choice โ no single vendor controls it. Delta Lake is Databricks-originated and works best within the Databricks ecosystem (Unity Catalog). Snowflake, BigQuery, AWS, and most independent tools have standardized on Iceberg. For new lakehouses not tied to Databricks: default to Iceberg.
Edge & Serverless Databases (2026) โ
A new category emerged between 2023โ2026: databases designed for the compute edge, where latency to a centralized data center is unacceptable and cold-start time matters.
The Embedded Replica Pattern โ
Turso/libSQL pioneered the embedded replica pattern for SQLite: the database file runs inside the application process (or VM), providing microsecond read latency. Writes go to a remote primary and sync to the local replica asynchronously.
Application Process
โโโ SQLite embedded replica (reads: microseconds)
โโโ Remote sync โ Turso primary (writes: ~10ms)
Benefits:
- Read latency: from ~1ms (network) to ~50ยตs (local disk/RAM)
- Isolation: each tenant gets their own SQLite file (SaaS multi-tenancy)
- No connection pool: process-local SQLite needs no poolEdge Database Options (2026) โ
| System | Engine | Model | Latency | Best For |
|---|---|---|---|---|
| Turso | libSQL (SQLite fork) | Embedded replica + remote primary | ยตs reads, ~10ms writes | Edge SaaS, multi-tenant isolation |
| Cloudflare D1 | SQLite | Per-Worker DB (regional) | 1โ5ms | Cloudflare Workers apps |
| Neon | PostgreSQL | Serverless (branch per env) | 5โ50ms cold, <5ms warm | PR preview envs, dev/test parity |
| PlanetScale | MySQL (Vitess) | Serverless | 5โ20ms | MySQL-native, enterprise ($39/mo min) |
Neon and Databricks
Neon was acquired by Databricks in May 2025. Post-acquisition, storage cost fell 80% ($1.75 โ $0.35/GB-month) and compute fell 15โ25%. The free plan doubled to 100 compute unit-hours/month. Neon's "instant branch" feature (branch a database in milliseconds for each PR) is widely used in CI/CD pipelines as of 2026.
When to Use Edge Databases โ
- Use Turso/libSQL: per-tenant SQLite isolation in a SaaS app, Cloudflare Workers stack, or edge function needing sub-millisecond reads
- Use Neon: PostgreSQL with branch-per-PR workflows, serverless compute that scales to zero
- Use Cloudflare D1: exclusively within the Cloudflare Workers ecosystem
- Do not use edge DBs for: large analytical workloads, joins across tenants, >100GB datasets per tenant
Decision Matrix โ
Use this table to choose a specialized database for your workload:
| Workload | Volume | Recommended DB | Reasoning |
|---|---|---|---|
| Application metrics, dashboards | < 10M rows/day | PostgreSQL + TimescaleDB | Same cluster, SQL, continuous aggregates |
| CDN/web analytics | > 100M events/day | ClickHouse | Columnar, 4M rows/sec ingestion, MergeTree |
| IoT sensor data, monitoring | 1Mโ100M rows/day | TimescaleDB or InfluxDB | Hypertables, compression, native time functions |
| Full-text search, autocomplete | < 5M docs | PostgreSQL tsvector + GIN | No additional system |
| Full-text search, relevance scoring | 5Mโ100M docs | Elasticsearch or Typesense | Inverted index, BM25, faceting |
| Instant search, typo tolerance | < 50M docs | Meilisearch | Sub-50ms, zero-config |
| Social graph, fraud rings | Any | Neo4j | Index-free traversal, multi-hop performance |
| Recommendation engine (graph) | < 100M nodes | Neo4j Community | Open source |
| Semantic search, RAG | < 10M embeddings | pgvector HNSW | Integrated with existing PostgreSQL |
| Semantic search, RAG | 10Mโ100M embeddings | Qdrant or Weaviate | Purpose-built ANN, payload filtering |
| Semantic search + keyword hybrid | Any | Weaviate or Typesense | Native BM25 + ANN, RRF built-in |
| Semantic search, production SaaS | > 100M embeddings | Pinecone Serverless | Fully managed, petabyte scale |
| Ad-hoc analytics on data lake | Any | DuckDB | In-process, Iceberg/Parquet native, no cluster |
| Data lake table format | Petabyte+ | Apache Iceberg | Broadest engine support (Spark/Flink/Trino/Snowflake) |
| Edge / per-tenant isolation | < 10GB per tenant | Turso (libSQL) | Embedded replica, microsecond reads |
| Serverless PostgreSQL | Variable | Neon | Branch per PR, scale to zero, 80% cheaper storage (2025) |
Monitoring Specialized Databases โ
Specialized databases have unique failure modes. Monitor these engine-specific metrics.
| Engine | Key Metrics | Tool | Critical Alert |
|---|---|---|---|
| ClickHouse | MergeTree parts count, Query duration p99, Delayed inserts, Replication queue | system.metrics, system.query_log | Parts > 300 per partition (merge backlog) |
| Elasticsearch | Cluster health (green/yellow/red), JVM heap %, Search latency p99, Unassigned shards | _cluster/health, _cat/nodes | Cluster status != green or JVM heap > 85% |
| Neo4j | Page cache hit ratio, Transaction count, Bolt connections, Store size | Neo4j Metrics API, Prometheus exporter | Page cache hit ratio < 95% |
| pgvector | Index build time, Recall rate, Probes per query | pg_stat_user_indexes, EXPLAIN ANALYZE | Recall < 95% at target latency (increase ef_search) |
-- ClickHouse: check merge health
SELECT table, count() AS parts, sum(rows) AS total_rows
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table
ORDER BY parts DESC;
-- Elasticsearch: cluster health (via curl)
-- curl -s localhost:9200/_cluster/health?prettySpecialized DB Failure Modes
ClickHouse: too many parts = merge can't keep up with inserts โ insert delays. Elasticsearch: JVM GC pauses cause cluster instability โ size heap to 50% of RAM, max 31GB. Neo4j: cold page cache after restart โ warm with traversal queries before accepting traffic.
Case Study: Cloudflare's Analytics Pipeline with ClickHouse โ
Cloudflare operates one of the world's largest CDN networks, handling 50+ million HTTP requests per second across 300+ data centers. Every HTTP request generates a log event.
The Challenge: Cloudflare's analytics product lets customers query their traffic data in real-time โ "show me 5xx errors by data center, last 15 minutes" โ across petabytes of data. Traditional OLAP databases (Redshift, BigQuery) could not deliver sub-second query latency at this scale without cost-prohibitive caching.
Why ClickHouse: Cloudflare evaluated ClickHouse in 2018 and found it delivered 100โ1000ร lower query latency than their existing Hadoop + Presto pipeline for typical analytics queries. The columnar MergeTree engine's ability to skip irrelevant columns and time-range partitions meant that most customer queries touched < 1% of stored data.
Architecture (2023):
Key Technical Decisions:
- Kafka engine in ClickHouse: ClickHouse reads directly from Kafka topics via a materialized view pipeline โ no ETL process needed
- ReplicatedMergeTree: each shard is replicated 2ร using ClickHouse's built-in replication (via ZooKeeper/ClickHouse Keeper)
- Projection feature: pre-sorts a subset of columns in a different order, enabling fast aggregations on secondary sort keys without full scans
PARTITION BY toYYYYMMDD(timestamp): monthly or daily partitioning allows instantDROP PARTITIONfor data expiration instead of slow DELETE operations
Performance Numbers (2023):
- 50M+ HTTP events/second ingested continuously
- Typical customer analytics query: 50โ500ms (scanning billions of rows)
- Dashboard page load: < 100ms (uses pre-computed materialized views)
- Storage: ~50 bytes/event after ClickHouse LZ4 compression (vs ~500 bytes raw JSON in S3)
Lesson: ClickHouse's power comes not just from columnar storage but from the combination of columnar + partitioning + materialized views + efficient compression. A ClickHouse table without proper partitioning and materialized views for common query patterns will still be slow.
Related Chapters โ
| Chapter | Relevance |
|---|---|
| Ch01 โ Database Landscape | Storage engine fundamentals, LSM vs B-tree |
| Ch05 โ PostgreSQL in Production | pgvector extension, pg_stat_statements |
| Ch07 โ NoSQL at Scale | Redis as time-series cache layer |
| System Design Ch09 | SQL databases in system design |
| System Design Ch10 | NoSQL and specialized DBs in system design |
| Ch05 โ PostgreSQL in Production | pgvector HNSW tuning for AI workloads |
Common Mistakes โ
| Mistake | Why It Happens | Impact | Fix |
|---|---|---|---|
| Using Elasticsearch as a primary datastore | "It stores documents and is queryable" | No transactions, limited write throughput, complex shard management for OLTP | Use Elasticsearch as a search projection only; maintain the source of truth in PostgreSQL or MongoDB |
| Using ClickHouse for OLTP workloads | "It's a database, right?" | MergeTree is designed for batch inserts; frequent single-row updates cause merge backlog | ClickHouse is analytics-only; route OLTP writes to PostgreSQL/MySQL, CDC to ClickHouse |
| pgvector without HNSW index (defaults to brute force) | Not reading the setup docs | Every vector search is O(N); query time grows linearly with dataset size | Always CREATE INDEX ... USING hnsw after inserting embeddings; IVFFlat is the alternative |
| Storing location data without TTL in a geospatial DB | "We might need history" | Driver/user location tables grow unbounded; query performance degrades | Separate current-location (with TTL) from historical-location (time-partitioned) storage |
Practice Questions โ
Beginner โ
ClickHouse vs PostgreSQL for Analytics: Your application stores order data in PostgreSQL (50M rows). Business intelligence queries like "total revenue by product category per month for the last 2 years" take 45 seconds. A colleague suggests moving the orders table to ClickHouse. What would you expect the query time to be in ClickHouse, and what are the trade-offs of maintaining a separate ClickHouse instance?
Model Answer
ClickHouse would likely reduce this query to 1โ5 seconds (10โ45ร speedup) due to columnar storage (only reading category, revenue, date columns), vectorized execution, and efficient compression. Trade-offs: (1) operational overhead of a second database system; (2) data synchronization (need to stream changes from PostgreSQL to ClickHouse via CDC/Debezium or Kafka); (3) analytics queries run on slightly stale data (minutes lag). For a team without ClickHouse expertise, consider TimescaleDB (PostgreSQL extension) as a lower-overhead alternative.Vector Database Selection: A startup building a customer support chatbot wants to add semantic search over 50,000 support articles. They currently use PostgreSQL. Should they add pgvector or deploy a dedicated vector database like Pinecone?
Model Answer
pgvector is the correct choice at this scale. 50,000 vectors at 1536 dimensions = ~300MB of embedding data โ trivially fits in PostgreSQL shared_buffers. pgvector's HNSW index will return results in < 10ms with > 95% recall. Benefits: no additional system to operate, vectors stored alongside metadata in same database, ACID transactions, full SQL filtering. Pinecone is warranted at 10M+ embeddings with strict latency SLAs (< 10ms p99) or when you need zero operational overhead.
Intermediate โ
Cassandra Tombstone Investigation: A Cassandra cluster's read latency has increased from 5ms to 500ms over 3 months on the
user_activitytable. The table tracks user actions with frequent deletes (expired sessions, privacy deletions). Diagnose and fix the problem.Model Answer
Tombstone accumulation is the most likely cause. Diagnosis: `nodetool tablehistograms keyspace.user_activity` โ look for high tombstone counts (> 100 per read is problematic). Fix: (1) Switch to TTL-based expiration instead of explicit DELETEs โ this removes tombstones at compaction time after `gc_grace_seconds`. (2) Use TWCS (TimeWindowCompactionStrategy) which naturally expires old time windows. (3) Reduce `gc_grace_seconds` to match your compaction schedule (risky: must ensure all replicas have received the tombstone before it's GC'd). (4) Run a full compaction immediately to clear existing tombstones: `nodetool compact keyspace user_activity`.Search Engine Selection: You are building a SaaS product catalog search for a B2B marketplace with 2M products. Requirements: typo tolerance, faceted filtering (category, price, brand), sub-100ms p99, and the ability to tune relevance per customer. Compare Elasticsearch vs Typesense vs Meilisearch for this use case.
Model Answer
At 2M products: all three are viable. Meilisearch: easiest setup, excellent typo tolerance, sub-50ms out of the box. Limitation: less flexible relevance tuning, per-customer configuration requires multi-tenancy design. Typesense: similar simplicity, Raft replication for HA, good multi-tenancy. Elasticsearch: most flexible relevance tuning (per-customer boosting, scripted scoring), best for complex B2B requirements. More operational complexity. Recommendation: if per-customer relevance tuning is critical, Elasticsearch. If simplicity and instant search are the priority and relevance is uniform, Meilisearch or Typesense.
Advanced โ
Hybrid Search Architecture: Your e-commerce platform needs to implement search that combines keyword search (exact product codes, brand names) with semantic search (natural language queries like "affordable laptop for students"). Design a hybrid search architecture using both an inverted index and vector search, handling re-ranking and explaining latency trade-offs at 10M products.
Model Answer
Hybrid search (reciprocal rank fusion): (1) Run BM25 keyword search (Elasticsearch or pgvector full-text) โ get top 100 results with BM25 scores. (2) Run ANN vector search (pgvector HNSW or Weaviate) โ get top 100 results with cosine similarity scores. (3) Merge via Reciprocal Rank Fusion: score = 1/(k + rank_bm25) + 1/(k + rank_ann), where k=60 is standard. (4) Return top 20 merged results. Latency budget at 10M products: BM25 query = 20ms, ANN query = 30ms (run in parallel) โ merge = 2ms โ total = ~52ms. For sub-50ms p99, pre-filter by in-stock status before ANN (pgvector WHERE clause filters before HNSW traversal if selectivity is high). Weaviate natively supports hybrid search with RRF built-in.
References โ
- ClickHouse Documentation โ MergeTree Engine
- TimescaleDB Documentation โ Hypertables
- Elasticsearch: The Definitive Guide
- pgvector GitHub โ HNSW Index
- Qdrant Documentation โ HNSW Parameters
- Neo4j โ Index-Free Adjacency
- Cloudflare Blog โ ClickHouse at Cloudflare
- Pinecone โ Vector Database Fundamentals
- ANN Benchmarks โ Comparison of ANN Algorithms
- "Designing Data-Intensive Applications" โ Kleppmann, Ch. 3 (Storage & Retrieval)
- InfluxDB IOx Architecture
- DuckDB Documentation โ In-process OLAP, v1.5 release notes
- Apache Iceberg Documentation โ Open table format specification
- Apache Polaris โ REST Catalog โ Vendor-neutral Iceberg catalog (graduated Feb 2026)
- Turso / libSQL Documentation โ Edge SQLite with embedded replicas
- Neon โ Serverless PostgreSQL โ Branch-per-PR PostgreSQL (Databricks, May 2025)
- ANN Benchmarks โ Vector search algorithm comparison
- pgvector HNSW Documentation โ pgvector 0.8+ parameters

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.