Skip to contentSkip to content
0/47 chapters completed (0%)

Chapter 8: Specialized Databases โ€‹

"A specialized database does one thing 10ร— faster than a general-purpose database. The engineering question is always: does your workload actually need 10ร—?"

Mind Map โ€‹

Prerequisites

This chapter assumes familiarity with storage engine fundamentals (Ch01) and PostgreSQL internals (Ch05). Review those first if needed.


Time-Series Databases โ€‹

Time-series workloads share three characteristics: data arrives with a timestamp, it is almost always appended (rarely updated), and queries are predominantly range-based (e.g., "last 24 hours"). These properties call for different storage optimizations than OLTP databases.

ClickHouse: Columnar OLAP at CDN Scale โ€‹

ClickHouse stores data column-by-column rather than row-by-row. A query touching only 3 columns of a 100-column table reads 3% of the data a row-store would read. Combined with vectorized execution (SIMD CPU instructions processing 256 bits at once), ClickHouse achieves extraordinary scan throughput.

MergeTree Engine โ€‹

ClickHouse's primary storage engine is MergeTree โ€” a family of LSM-like engines that append data in sorted parts and merge them in the background.

sql
-- ClickHouse: web analytics events table
CREATE TABLE page_views (
  timestamp    DateTime,
  session_id   UUID,
  user_id      UInt64,
  page_url     String,
  referrer     String,
  country      LowCardinality(String),  -- dictionary encoding for low-cardinality
  duration_ms  UInt32,
  is_bounce    UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)   -- one partition per month
ORDER BY (timestamp, session_id)   -- primary sort key (also the sparse index)
TTL timestamp + INTERVAL 2 YEAR;  -- auto-delete rows older than 2 years

-- Insert at high throughput: use batch inserts (100K+ rows per INSERT)
-- ClickHouse amortizes merge overhead across large batches

-- Materialized view: pre-aggregate by hour for dashboard queries
CREATE MATERIALIZED VIEW hourly_stats
ENGINE = SummingMergeTree()
ORDER BY (hour, country, page_url)
AS SELECT
  toStartOfHour(timestamp) AS hour,
  country,
  page_url,
  count() AS views,
  sum(duration_ms) AS total_duration,
  countIf(is_bounce = 1) AS bounces
FROM page_views
GROUP BY hour, country, page_url;

ClickHouse Benchmarks โ€‹

MetricValueNotes
Ingestion rate4M rows/secondSingle node, batch inserts, uncompressed
Compression ratio5โ€“15ร—LZ4 default; ZSTD for higher ratio
Scan throughput2โ€“100 GB/sDepends on query selectivity and column count
Query latency (1B rows, aggregation)100msโ€“2sSingle node, indexed sort key
Storage efficiency10โ€“50 bytes/rowAfter compression (vs 100โ€“500 bytes in PostgreSQL)

TimescaleDB: PostgreSQL for Time-Series โ€‹

TimescaleDB is a PostgreSQL extension that adds time-series capabilities: automatic partitioning (hypertables), native compression, and continuous aggregates.

sql
-- Create a hypertable: PostgreSQL table + automatic time-based partitioning
SELECT create_hypertable('sensor_readings', 'time',
  chunk_time_interval => INTERVAL '1 week'  -- one partition per week
);

-- Native compression: 10-30ร— compression ratio
ALTER TABLE sensor_readings SET (
  timescaledb.compress,
  timescaledb.compress_orderby = 'time DESC',
  timescaledb.compress_segmentby = 'sensor_id'
);

-- Compress chunks older than 7 days
SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

-- Continuous aggregate: materializes query results, refreshes automatically
CREATE MATERIALIZED VIEW sensor_hourly
WITH (timescaledb.continuous) AS
SELECT
  sensor_id,
  time_bucket('1 hour', time) AS bucket,
  AVG(value) AS avg_val,
  MAX(value) AS max_val,
  MIN(value) AS min_val
FROM sensor_readings
GROUP BY sensor_id, bucket;

SELECT add_continuous_aggregate_policy('sensor_hourly',
  start_offset => INTERVAL '2 days',
  end_offset   => INTERVAL '1 hour',
  schedule_interval => INTERVAL '30 minutes'
);

-- Query the continuous aggregate (fast: hits materialized data)
SELECT bucket, avg_val
FROM sensor_hourly
WHERE sensor_id = 'sensor-42'
  AND bucket >= NOW() - INTERVAL '7 days'
ORDER BY bucket DESC;

TimescaleDB Benchmarks:

MetricValueNotes
Ingestion rate1.3M rows/secondSingle node, via COPY or multi-row INSERT
Compression ratio10โ€“30ร—Native columnar compression on chunks
Query speedup vs vanilla PostgreSQL10โ€“100ร—For time-range aggregations on compressed chunks
Continuous aggregate query1โ€“50msPre-materialized; near-instant

InfluxDB: Purpose-Built TSDB โ€‹

InfluxDB 3.0 (IOx engine) is a rewrite in Rust using Apache Arrow and Parquet for storage, enabling SQL queries via Apache DataFusion. The older InfluxDB 1.x and 2.x use the TSM (Time-Structured Merge tree) storage engine.

sql
-- InfluxDB 3.0: SQL interface over IOx
SELECT
  date_trunc('hour', time) AS hour,
  mean(temperature) AS avg_temp,
  max(temperature) AS max_temp
FROM weather_readings
WHERE location = 'NYC'
  AND time >= now() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;

Ingestion Benchmarks Comparison:

DatabaseIngestion Rate (single node)CompressionBest For
ClickHouse4M rows/second10โ€“50ร—OLAP analytics, CDN-scale events
TimescaleDB1.3M rows/second10โ€“30ร—PostgreSQL teams, mixed OLTP+TS
InfluxDB 3.01.2M rows/second5โ€“20ร—IoT, metrics, pure time-series

When PostgreSQL Is Enough โ€‹


Search Engines โ€‹

Elasticsearch: Inverted Index at Scale โ€‹

Elasticsearch stores documents in an inverted index โ€” a map from terms to the list of documents containing that term. This enables full-text search across billions of documents in milliseconds.

BM25 Ranking considers term frequency in the document, inverse document frequency (rarer terms score higher), and document length normalization. It outperforms TF-IDF for most text search tasks.

json
// Elasticsearch: product search with filters and aggregations
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "wireless headphones",
            "fields": ["title^3", "description", "tags"],
            "type": "best_fields",
            "fuzziness": "AUTO"
          }
        }
      ],
      "filter": [
        { "term": { "in_stock": true } },
        { "range": { "price": { "gte": 20, "lte": 300 } } }
      ]
    }
  },
  "aggs": {
    "brands": {
      "terms": { "field": "brand.keyword", "size": 10 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50 }, { "from": 50, "to": 150 }, { "from": 150 }
        ]
      }
    }
  },
  "from": 0, "size": 20,
  "sort": [{ "_score": "desc" }, { "sales_rank": "asc" }]
}

Meilisearch maintains its entire index in RAM (memory-mapped files with mmap). This enables <50ms search latency at the cost of memory requirements proportional to index size.

javascript
// Meilisearch: instant search with typo tolerance
const client = new MeiliSearch({ host: 'http://localhost:7700' });
const index = client.index('products');

// Configure relevance and filtering
await index.updateSettings({
  searchableAttributes: ['title', 'description', 'tags'],
  filterableAttributes: ['category', 'price', 'in_stock'],
  sortableAttributes: ['price', 'sales_rank'],
  typoTolerance: {
    enabled: true,
    minWordSizeForTypos: { oneTypo: 4, twoTypos: 8 }
  }
});

// Search: returns results in < 50ms for < 10M documents
const results = await index.search('wireles headphone', {
  filter: 'in_stock = true AND price < 200',
  sort: ['sales_rank:asc'],
  facets: ['category', 'brand'],
  limit: 20
});

Typesense uses Raft consensus for a distributed, fault-tolerant cluster of in-memory search nodes. Features native vector search (hybrid BM25 + ANN), multi-tenant isolation, and a strict schema.

Search Latency Benchmarks โ€‹

EngineDatasetp50 Latencyp99 LatencyNotes
Elasticsearch10K docs5ms15msIndex warm, single shard
Elasticsearch1M docs10ms50msSingle shard
Elasticsearch1B docs100ms500msMulti-shard cluster
Meilisearch10K docs1ms5msAll in RAM
Meilisearch1M docs5ms20msRAM-mapped
Meilisearch1B docsNot recommendedโ€”Memory requirement too high
Typesense10K docs1ms5msIn-memory
Typesense1M docs3ms15msIn-memory
PostgreSQL (GIN/tsvector)1M docs10ms100msGood for simple FTS

When PostgreSQL Full-Text Search Is Enough โ€‹

sql
-- PostgreSQL full-text search with GIN index
-- Handles up to ~5M documents well; beyond that, Elasticsearch wins

-- Add tsvector column for fast FTS
ALTER TABLE products ADD COLUMN search_vector tsvector;
UPDATE products SET search_vector =
  to_tsvector('english',
    coalesce(title, '') || ' ' ||
    coalesce(description, '') || ' ' ||
    coalesce(tags, '')
  );
CREATE INDEX idx_products_fts ON products USING GIN(search_vector);

-- Trigger to keep search_vector current
CREATE OR REPLACE FUNCTION update_search_vector() RETURNS trigger AS $$
BEGIN
  NEW.search_vector := to_tsvector('english',
    coalesce(NEW.title, '') || ' ' ||
    coalesce(NEW.description, '') || ' ' ||
    coalesce(NEW.tags, ''));
  RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_search_vector BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION update_search_vector();

-- Search query with ranking
SELECT id, title,
       ts_rank(search_vector, query) AS rank
FROM products,
     to_tsquery('english', 'wireless & headphone') query
WHERE search_vector @@ query
  AND in_stock = true
ORDER BY rank DESC
LIMIT 20;

Graph Databases โ€‹

Neo4j: Property Graph Model โ€‹

Neo4j stores data as nodes (entities) and relationships (typed, directed edges), both with properties. The key innovation: index-free adjacency โ€” each node stores direct pointers to its adjacent nodes, eliminating the need for B-tree index lookups during traversal.

cypher
-- Cypher: find Alice's second-degree connections who liked a post
MATCH (alice:User {name: 'Alice'})-[:FOLLOWS]->(friend:User)-[:FOLLOWS]->(fof:User)
WHERE NOT (alice)-[:FOLLOWS]->(fof)
  AND alice <> fof
RETURN fof.name, COUNT(friend) AS mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10;

-- Fraud detection: find accounts sharing 3+ attributes (IP, device, email pattern)
MATCH (a:Account)-[:USES]->(attr)<-[:USES]-(b:Account)
WHERE a.id <> b.id
WITH a, b, COUNT(attr) AS shared_attrs
WHERE shared_attrs >= 3
RETURN a.id, b.id, shared_attrs
ORDER BY shared_attrs DESC;

When Graph Beats Relational Joins โ€‹

The performance advantage of graph traversal over SQL JOINs grows with traversal depth. For deep relationship queries (3+ hops), graph databases are dramatically faster because they follow direct pointers rather than performing index lookups and hash joins.

Traversal DepthSQL (self-join)Neo4jWinner
1 hop (direct friends)2ms2msTie
2 hops (friends of friends)20ms5msNeo4j (4ร—)
3 hops300ms8msNeo4j (37ร—)
4 hops30s+15msNeo4j (2000ร—)
5+ hopsMinutes/timeout50msNeo4j

Benchmark data: LinkedIn recommendation engine, 10M user graph, Neotechnology (2013).

Use Cases for Graph Databases โ€‹

Use CaseWhy Graph WinsExample Query
Social network recommendationsMulti-hop traversal (FOAF, common interests)"People you may know"
Fraud detectionFind connected rings of suspicious accountsShared device/IP/card chains
Knowledge graphArbitrary relationship types, semantic queries"What drugs interact with X?"
Supply chain trackingBill of materials, dependency graphs"What components use this chip?"
Access control (RBAC/ABAC)Hierarchical permissions, role inheritance"Does user have access via role chain?"

Vector Databases (2026) โ€‹

As of 2026, vector search is standard infrastructure for AI-augmented products. Every major RDBMS (PostgreSQL via pgvector, MongoDB Atlas, MySQL 9 Innovation) has native vector support. The question is no longer whether to add vector search, but which tier of implementation matches your scale.

Index Algorithms: HNSW vs IVFFlat โ€‹

HNSW (Hierarchical Navigable Small World) โ€” the production standard:

  • Multi-layer graph: top layers are sparse long-range links, bottom layers are dense short-range links
  • Query: start at top layer, greedily navigate toward query point, descend layers
  • No training step โ€” add vectors incrementally
  • Excellent recall (>95% at typical settings), fast queries
  • Memory: ~1.5โ€“2ร— raw embedding size
  • Tune with m (graph connectivity, default 16) and ef_construction (build quality, default 64)

IVFFlat (Inverted File Flat) โ€” lighter-weight alternative:

  • Training phase: cluster embeddings into N centroids via k-means
  • Query: find M nearest centroids, then search exhaustively within them
  • Lower memory than HNSW, but requires training data upfront
  • Recall degrades if data distribution shifts after training

For new deployments: default to HNSW. Use IVFFlat only when memory is severely constrained.

pgvector 0.8+ is production-grade. All major cloud PostgreSQL providers support CREATE EXTENSION vector natively. Used in production by OpenAI, Supabase, and Neon.

sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE product_embeddings (
  product_id BIGINT PRIMARY KEY,
  category   TEXT,
  embedding  vector(1536)   -- OpenAI text-embedding-3-small dimensions
);

-- HNSW index: m=16 (connectivity), ef_construction=64 (build quality)
CREATE INDEX ON product_embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Set query-time search width (higher = better recall, slower)
SET hnsw.ef_search = 100;  -- default 40; raise for higher recall needs

-- Semantic search with pre-filter (applied before HNSW when selective)
SELECT
  product_id,
  category,
  1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS cosine_similarity
FROM product_embeddings
WHERE category = 'electronics'
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

Dedicated Vector Databases (2026 State) โ€‹

SystemLanguageIndexFilteringHybrid SearchBest For
pgvector 0.8+โ€” (PostgreSQL ext.)HNSW, IVFFlatFull SQL WHEREtsvector + ANNPostgreSQL teams, < 10M vectors
QdrantRustHNSWPayload filters (rich JSON)Sparse + denseComplex filtered ANN, low latency
WeaviateGoHNSWGraphQLNative BM25 + ANN (RRF)Hybrid semantic + keyword
MilvusGo/C++HNSW + IVF + DiskANNScalar + JSONYesDistributed, 1B+ vectors
PineconeManaged SaaSProprietaryMetadata filtersSparse + denseProduction SaaS, zero ops overhead
ChromaPythonHNSWMetadataBasicLocal dev, prototyping

Qdrant performance note (2026): Written in Rust with a dedicated HNSW implementation, Qdrant benchmarks show 10โ€“25% lower p99 latency than equivalent pgvector HNSW at 1Mโ€“10M vectors on the same hardware. However, the operational overhead of a second system often outweighs the latency difference for teams already on PostgreSQL.

Scale-Based Selection โ€‹

Hybrid Search: BM25 + ANN โ€‹

Pure vector search misses exact keyword matches ("iPhone 15 Pro Max SKU"). Hybrid search combines keyword scoring (BM25) with vector similarity and merges via Reciprocal Rank Fusion (RRF):

python
def hybrid_search(query: str, top_k: int = 20) -> list:
    # Run both in parallel
    bm25_results = keyword_search(query, top=100)   # returns [(id, bm25_score)]
    ann_results = vector_search(embed(query), top=100)  # returns [(id, distance)]

    # Reciprocal Rank Fusion: score = 1/(k + rank), k=60 standard
    scores = {}
    for rank, (doc_id, _) in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank + 1)
    for rank, (doc_id, _) in enumerate(ann_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank + 1)

    return sorted(scores, key=scores.get, reverse=True)[:top_k]

Weaviate has RRF built-in. For pgvector: combine tsvector GIN search with HNSW in two parallel queries and merge in application code.

RAG Architecture with pgvector โ€‹

python
# Retrieval-Augmented Generation with pgvector
import openai
import psycopg2

def embed(text: str) -> list[float]:
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = embed(query)
    conn = psycopg2.connect(DATABASE_URL)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT document_id, chunk_text,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM document_chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, top_k))
        return [{"id": r[0], "content": r[1], "similarity": r[2]}
                for r in cur.fetchall()]

def answer(question: str) -> str:
    context_docs = retrieve(question, top_k=5)
    context = "\n\n".join(d["content"] for d in context_docs)
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

DuckDB: In-Process OLAP (2026) โ€‹

DuckDB 1.x (current: v1.5.3, May 2026; v2.0 planned September 2026) is an in-process analytical SQL engine โ€” no server, no configuration, no dependencies. It embeds inside your Python/Go/Node application like SQLite, but executes columnar OLAP queries with vectorized SIMD execution.

When DuckDB Replaces a Spark Cluster โ€‹

For analytical jobs over local files or cloud object storage, DuckDB eliminates the need for a distributed cluster in most cases:

WorkloadPrevious ApproachDuckDB Approach
Aggregate 10GB CSV/ParquetSpark job (5โ€“15 min setup)SELECT * FROM read_parquet('s3://...') โ€” seconds
Ad-hoc analytics on data lakeAthena/Presto (cost/latency)DuckDB + Iceberg extension โ€” free, local
ETL between formatsPandas + custom codeDuckDB SQL + COPY TO 'output.parquet'
Analytical queries in testSQLite (row-store, slow)DuckDB (column-store, fast)
sql
-- DuckDB: analyze Parquet files directly from S3 without ETL
-- Install AWS extension: INSTALL httpfs; LOAD httpfs;
SELECT
  date_trunc('month', event_ts) AS month,
  country,
  COUNT(*) AS events,
  SUM(revenue) AS total_revenue
FROM read_parquet('s3://data-lake/events/2026-*/*.parquet')
WHERE event_type = 'purchase'
GROUP BY month, country
ORDER BY month DESC, total_revenue DESC;

-- DuckDB: read Apache Iceberg table (built-in since v1.0)
-- INSTALL iceberg; LOAD iceberg;
SELECT * FROM iceberg_scan('s3://catalog/my_table')
WHERE event_date >= '2026-01-01';

DuckDB Key Features (v1.5, 2026) โ€‹

FeatureDetails
VARIANT typeJSON-like flexible column for semi-structured data (added v1.5)
GEOMETRY typeGeospatial queries without extensions
Iceberg supportNative Apache Iceberg reader (no Spark needed)
Quack protocolClient-server mode for remote connections (v1.5)
Arrow integrationZero-copy data exchange with pandas/polars
Persistent modeSingle .duckdb file โ€” embeds into any app

DuckDB for OLTP teams

If your team runs PostgreSQL for OLTP, DuckDB is the lowest-friction analytical layer: read Parquet snapshots or connect to S3 directly. No cluster management. Run locally or in a Lambda/Cloud Run function. Not a replacement for ClickHouse at petabyte scale, but replaces 80% of Spark jobs.


Apache Iceberg: The Open Table Format Standard (2026) โ€‹

Apache Iceberg has won the open table-format competition. As of 2026, it has broader engine support than Delta Lake or Apache Hudi across the data ecosystem.

Engine Support (2026) โ€‹

EngineIceberg Support
Apache SparkNative (96% adoption in 2025 survey)
Apache FlinkNative (32% adoption)
Trino / PrestoNative (61% adoption)
SnowflakeNative (Iceberg tables GA)
BigQueryNative (managed Iceberg)
AWS AthenaNative
DuckDBExtension (v1.0+)
Delta LakeConversion supported, but separate ecosystem

Why Iceberg Won โ€‹

Iceberg solved problems that Hive-partitioned Parquet tables could not:

  • ACID at file level: snapshot isolation, atomic partition swaps, rollback
  • Schema evolution without rewrites: add/rename/drop columns without rewriting files
  • Hidden partitioning: partition logic in catalog, not query WHERE clauses
  • Time travel: query data as it existed at any past snapshot
  • Partition evolution: change partition scheme without rewriting data

Apache Polaris: Vendor-Neutral Catalog โ€‹

Apache Polaris graduated in February 2026 as a vendor-neutral REST catalog for Iceberg, with fine-grained access control (row/column level). It is now the standard catalog interface across cloud providers.

sql
-- DuckDB: query Iceberg table with time travel
SELECT COUNT(*) FROM iceberg_scan(
  's3://warehouse/sales',
  version => 5  -- snapshot ID for time travel
);

-- Spark: Iceberg with schema evolution (no rewrite)
ALTER TABLE catalog.db.events
ADD COLUMN user_agent STRING;  -- instant, no file rewrite

-- Query the new column (NULL for rows written before the schema change)
SELECT user_agent, COUNT(*) FROM catalog.db.events
WHERE event_date = '2026-01-01'
GROUP BY user_agent;

Iceberg vs Delta Lake (2026)

Iceberg is the neutral choice โ€” no single vendor controls it. Delta Lake is Databricks-originated and works best within the Databricks ecosystem (Unity Catalog). Snowflake, BigQuery, AWS, and most independent tools have standardized on Iceberg. For new lakehouses not tied to Databricks: default to Iceberg.


Edge & Serverless Databases (2026) โ€‹

A new category emerged between 2023โ€“2026: databases designed for the compute edge, where latency to a centralized data center is unacceptable and cold-start time matters.

The Embedded Replica Pattern โ€‹

Turso/libSQL pioneered the embedded replica pattern for SQLite: the database file runs inside the application process (or VM), providing microsecond read latency. Writes go to a remote primary and sync to the local replica asynchronously.

Application Process
โ”œโ”€โ”€ SQLite embedded replica (reads: microseconds)
โ””โ”€โ”€ Remote sync โ†’ Turso primary (writes: ~10ms)

Benefits:
- Read latency: from ~1ms (network) to ~50ยตs (local disk/RAM)
- Isolation: each tenant gets their own SQLite file (SaaS multi-tenancy)
- No connection pool: process-local SQLite needs no pool

Edge Database Options (2026) โ€‹

SystemEngineModelLatencyBest For
TursolibSQL (SQLite fork)Embedded replica + remote primaryยตs reads, ~10ms writesEdge SaaS, multi-tenant isolation
Cloudflare D1SQLitePer-Worker DB (regional)1โ€“5msCloudflare Workers apps
NeonPostgreSQLServerless (branch per env)5โ€“50ms cold, <5ms warmPR preview envs, dev/test parity
PlanetScaleMySQL (Vitess)Serverless5โ€“20msMySQL-native, enterprise ($39/mo min)

Neon and Databricks

Neon was acquired by Databricks in May 2025. Post-acquisition, storage cost fell 80% ($1.75 โ†’ $0.35/GB-month) and compute fell 15โ€“25%. The free plan doubled to 100 compute unit-hours/month. Neon's "instant branch" feature (branch a database in milliseconds for each PR) is widely used in CI/CD pipelines as of 2026.

When to Use Edge Databases โ€‹

  • Use Turso/libSQL: per-tenant SQLite isolation in a SaaS app, Cloudflare Workers stack, or edge function needing sub-millisecond reads
  • Use Neon: PostgreSQL with branch-per-PR workflows, serverless compute that scales to zero
  • Use Cloudflare D1: exclusively within the Cloudflare Workers ecosystem
  • Do not use edge DBs for: large analytical workloads, joins across tenants, >100GB datasets per tenant

Decision Matrix โ€‹

Use this table to choose a specialized database for your workload:

WorkloadVolumeRecommended DBReasoning
Application metrics, dashboards< 10M rows/dayPostgreSQL + TimescaleDBSame cluster, SQL, continuous aggregates
CDN/web analytics> 100M events/dayClickHouseColumnar, 4M rows/sec ingestion, MergeTree
IoT sensor data, monitoring1Mโ€“100M rows/dayTimescaleDB or InfluxDBHypertables, compression, native time functions
Full-text search, autocomplete< 5M docsPostgreSQL tsvector + GINNo additional system
Full-text search, relevance scoring5Mโ€“100M docsElasticsearch or TypesenseInverted index, BM25, faceting
Instant search, typo tolerance< 50M docsMeilisearchSub-50ms, zero-config
Social graph, fraud ringsAnyNeo4jIndex-free traversal, multi-hop performance
Recommendation engine (graph)< 100M nodesNeo4j CommunityOpen source
Semantic search, RAG< 10M embeddingspgvector HNSWIntegrated with existing PostgreSQL
Semantic search, RAG10Mโ€“100M embeddingsQdrant or WeaviatePurpose-built ANN, payload filtering
Semantic search + keyword hybridAnyWeaviate or TypesenseNative BM25 + ANN, RRF built-in
Semantic search, production SaaS> 100M embeddingsPinecone ServerlessFully managed, petabyte scale
Ad-hoc analytics on data lakeAnyDuckDBIn-process, Iceberg/Parquet native, no cluster
Data lake table formatPetabyte+Apache IcebergBroadest engine support (Spark/Flink/Trino/Snowflake)
Edge / per-tenant isolation< 10GB per tenantTurso (libSQL)Embedded replica, microsecond reads
Serverless PostgreSQLVariableNeonBranch per PR, scale to zero, 80% cheaper storage (2025)

Monitoring Specialized Databases โ€‹

Specialized databases have unique failure modes. Monitor these engine-specific metrics.

EngineKey MetricsToolCritical Alert
ClickHouseMergeTree parts count, Query duration p99, Delayed inserts, Replication queuesystem.metrics, system.query_logParts > 300 per partition (merge backlog)
ElasticsearchCluster health (green/yellow/red), JVM heap %, Search latency p99, Unassigned shards_cluster/health, _cat/nodesCluster status != green or JVM heap > 85%
Neo4jPage cache hit ratio, Transaction count, Bolt connections, Store sizeNeo4j Metrics API, Prometheus exporterPage cache hit ratio < 95%
pgvectorIndex build time, Recall rate, Probes per querypg_stat_user_indexes, EXPLAIN ANALYZERecall < 95% at target latency (increase ef_search)
sql
-- ClickHouse: check merge health
SELECT table, count() AS parts, sum(rows) AS total_rows
FROM system.parts
WHERE active AND database = 'default'
GROUP BY table
ORDER BY parts DESC;

-- Elasticsearch: cluster health (via curl)
-- curl -s localhost:9200/_cluster/health?pretty

Specialized DB Failure Modes

ClickHouse: too many parts = merge can't keep up with inserts โ†’ insert delays. Elasticsearch: JVM GC pauses cause cluster instability โ†’ size heap to 50% of RAM, max 31GB. Neo4j: cold page cache after restart โ†’ warm with traversal queries before accepting traffic.


Case Study: Cloudflare's Analytics Pipeline with ClickHouse โ€‹

Cloudflare operates one of the world's largest CDN networks, handling 50+ million HTTP requests per second across 300+ data centers. Every HTTP request generates a log event.

The Challenge: Cloudflare's analytics product lets customers query their traffic data in real-time โ€” "show me 5xx errors by data center, last 15 minutes" โ€” across petabytes of data. Traditional OLAP databases (Redshift, BigQuery) could not deliver sub-second query latency at this scale without cost-prohibitive caching.

Why ClickHouse: Cloudflare evaluated ClickHouse in 2018 and found it delivered 100โ€“1000ร— lower query latency than their existing Hadoop + Presto pipeline for typical analytics queries. The columnar MergeTree engine's ability to skip irrelevant columns and time-range partitions meant that most customer queries touched < 1% of stored data.

Architecture (2023):

Key Technical Decisions:

  • Kafka engine in ClickHouse: ClickHouse reads directly from Kafka topics via a materialized view pipeline โ€” no ETL process needed
  • ReplicatedMergeTree: each shard is replicated 2ร— using ClickHouse's built-in replication (via ZooKeeper/ClickHouse Keeper)
  • Projection feature: pre-sorts a subset of columns in a different order, enabling fast aggregations on secondary sort keys without full scans
  • PARTITION BY toYYYYMMDD(timestamp): monthly or daily partitioning allows instant DROP PARTITION for data expiration instead of slow DELETE operations

Performance Numbers (2023):

  • 50M+ HTTP events/second ingested continuously
  • Typical customer analytics query: 50โ€“500ms (scanning billions of rows)
  • Dashboard page load: < 100ms (uses pre-computed materialized views)
  • Storage: ~50 bytes/event after ClickHouse LZ4 compression (vs ~500 bytes raw JSON in S3)

Lesson: ClickHouse's power comes not just from columnar storage but from the combination of columnar + partitioning + materialized views + efficient compression. A ClickHouse table without proper partitioning and materialized views for common query patterns will still be slow.


ChapterRelevance
Ch01 โ€” Database LandscapeStorage engine fundamentals, LSM vs B-tree
Ch05 โ€” PostgreSQL in Productionpgvector extension, pg_stat_statements
Ch07 โ€” NoSQL at ScaleRedis as time-series cache layer
System Design Ch09SQL databases in system design
System Design Ch10NoSQL and specialized DBs in system design
Ch05 โ€” PostgreSQL in Productionpgvector HNSW tuning for AI workloads

Common Mistakes โ€‹

MistakeWhy It HappensImpactFix
Using Elasticsearch as a primary datastore"It stores documents and is queryable"No transactions, limited write throughput, complex shard management for OLTPUse Elasticsearch as a search projection only; maintain the source of truth in PostgreSQL or MongoDB
Using ClickHouse for OLTP workloads"It's a database, right?"MergeTree is designed for batch inserts; frequent single-row updates cause merge backlogClickHouse is analytics-only; route OLTP writes to PostgreSQL/MySQL, CDC to ClickHouse
pgvector without HNSW index (defaults to brute force)Not reading the setup docsEvery vector search is O(N); query time grows linearly with dataset sizeAlways CREATE INDEX ... USING hnsw after inserting embeddings; IVFFlat is the alternative
Storing location data without TTL in a geospatial DB"We might need history"Driver/user location tables grow unbounded; query performance degradesSeparate current-location (with TTL) from historical-location (time-partitioned) storage

Practice Questions โ€‹

Beginner โ€‹

  1. ClickHouse vs PostgreSQL for Analytics: Your application stores order data in PostgreSQL (50M rows). Business intelligence queries like "total revenue by product category per month for the last 2 years" take 45 seconds. A colleague suggests moving the orders table to ClickHouse. What would you expect the query time to be in ClickHouse, and what are the trade-offs of maintaining a separate ClickHouse instance?

    Model Answer ClickHouse would likely reduce this query to 1โ€“5 seconds (10โ€“45ร— speedup) due to columnar storage (only reading category, revenue, date columns), vectorized execution, and efficient compression. Trade-offs: (1) operational overhead of a second database system; (2) data synchronization (need to stream changes from PostgreSQL to ClickHouse via CDC/Debezium or Kafka); (3) analytics queries run on slightly stale data (minutes lag). For a team without ClickHouse expertise, consider TimescaleDB (PostgreSQL extension) as a lower-overhead alternative.
  2. Vector Database Selection: A startup building a customer support chatbot wants to add semantic search over 50,000 support articles. They currently use PostgreSQL. Should they add pgvector or deploy a dedicated vector database like Pinecone?

    Model Answer pgvector is the correct choice at this scale. 50,000 vectors at 1536 dimensions = ~300MB of embedding data โ€” trivially fits in PostgreSQL shared_buffers. pgvector's HNSW index will return results in < 10ms with > 95% recall. Benefits: no additional system to operate, vectors stored alongside metadata in same database, ACID transactions, full SQL filtering. Pinecone is warranted at 10M+ embeddings with strict latency SLAs (< 10ms p99) or when you need zero operational overhead.

Intermediate โ€‹

  1. Cassandra Tombstone Investigation: A Cassandra cluster's read latency has increased from 5ms to 500ms over 3 months on the user_activity table. The table tracks user actions with frequent deletes (expired sessions, privacy deletions). Diagnose and fix the problem.

    Model Answer Tombstone accumulation is the most likely cause. Diagnosis: `nodetool tablehistograms keyspace.user_activity` โ€” look for high tombstone counts (> 100 per read is problematic). Fix: (1) Switch to TTL-based expiration instead of explicit DELETEs โ€” this removes tombstones at compaction time after `gc_grace_seconds`. (2) Use TWCS (TimeWindowCompactionStrategy) which naturally expires old time windows. (3) Reduce `gc_grace_seconds` to match your compaction schedule (risky: must ensure all replicas have received the tombstone before it's GC'd). (4) Run a full compaction immediately to clear existing tombstones: `nodetool compact keyspace user_activity`.
  2. Search Engine Selection: You are building a SaaS product catalog search for a B2B marketplace with 2M products. Requirements: typo tolerance, faceted filtering (category, price, brand), sub-100ms p99, and the ability to tune relevance per customer. Compare Elasticsearch vs Typesense vs Meilisearch for this use case.

    Model Answer At 2M products: all three are viable. Meilisearch: easiest setup, excellent typo tolerance, sub-50ms out of the box. Limitation: less flexible relevance tuning, per-customer configuration requires multi-tenancy design. Typesense: similar simplicity, Raft replication for HA, good multi-tenancy. Elasticsearch: most flexible relevance tuning (per-customer boosting, scripted scoring), best for complex B2B requirements. More operational complexity. Recommendation: if per-customer relevance tuning is critical, Elasticsearch. If simplicity and instant search are the priority and relevance is uniform, Meilisearch or Typesense.

Advanced โ€‹

  1. Hybrid Search Architecture: Your e-commerce platform needs to implement search that combines keyword search (exact product codes, brand names) with semantic search (natural language queries like "affordable laptop for students"). Design a hybrid search architecture using both an inverted index and vector search, handling re-ranking and explaining latency trade-offs at 10M products.

    Model Answer Hybrid search (reciprocal rank fusion): (1) Run BM25 keyword search (Elasticsearch or pgvector full-text) โ†’ get top 100 results with BM25 scores. (2) Run ANN vector search (pgvector HNSW or Weaviate) โ†’ get top 100 results with cosine similarity scores. (3) Merge via Reciprocal Rank Fusion: score = 1/(k + rank_bm25) + 1/(k + rank_ann), where k=60 is standard. (4) Return top 20 merged results. Latency budget at 10M products: BM25 query = 20ms, ANN query = 30ms (run in parallel) โ†’ merge = 2ms โ†’ total = ~52ms. For sub-50ms p99, pre-filter by in-stock status before ANN (pgvector WHERE clause filters before HNSW traversal if selectivity is high). Weaviate natively supports hybrid search with RRF built-in.

References โ€‹

Comments powered by Giscus. Enable GitHub Discussions on the repo to activate.

Built with VitePress + Dracula Theme