May 27, 2026
· 11 min readTime-Series Databases: How They Store Billions of Points in a Couple of Bytes Each
Traditional B-tree databases collapse under append-only telemetry. This is a deep dive into how TSDBs ingest millions of points per second — WAL, MemTables, LSM compaction, and the Gorilla compression that squeezes a 16-byte data point down to ~1.37 bytes. Covers cardinality, data modeling, and how to pick between Prometheus, VictoriaMetrics, TimescaleDB, and InfluxDB.

TL;DR
- A TSDB treats time as a first-class index. It optimizes for fast sequential writes and window-based range scans, not in-place transactional updates.
- Traditional B-tree / OLTP engines choke on append-only telemetry: random write overhead, lock contention, and poor compression of repetitive values.
- The architecture is an LSM-style write path — WAL for durability, an in-memory MemTable, then flush to immutable compressed segments, with background compaction merging them.
- Compression is the magic trick: delta-of-delta for timestamps, XOR (Gorilla) for floats, RLE for repeats. Facebook's Gorilla paper squeezed a 16-byte point to ~1.37 bytes — a 12x reduction. Source: Gorilla, VLDB 2015
- The number one way to destroy a TSDB is high cardinality — never put unbounded IDs into indexed tags.
Why time-series data is its own problem
Time-series data is one of the fastest-growing data categories around, and it has a shape unlike anything a transactional database was designed for. Every metric, every trace, every sensor reading is a point stamped with a time, arriving in a relentless append-only stream.
The defining traits are consistent across domains:
- Data is time-indexed at high resolution (millisecond or nanosecond).
- Write throughput matters far more than transactional updates — you almost never go back and edit a reading from three hours ago.
- Queries are overwhelmingly window-based: "p99 latency over the last 5 minutes," "average temperature over 30 days."
- Recent data is hot, old data is cold — and eventually downsampled or dropped entirely.
Once you internalize that shape, the architectural choices a TSDB makes stop looking exotic and start looking inevitable.
Why traditional databases fall over
Reach for Postgres or MySQL first — that's the reasonable instinct, and it works until it doesn't. Here's where it breaks.
Write performance. OLTP engines lean on B-tree indexes tuned for frequent in-place updates and point lookups with ACID guarantees. Feed them a continuous, time-ordered firehose and B-trees suffer random disk-write overhead and lock contention. You're paying for a machine built to mutate records when all you ever do is append them.
Storage bloat. Without columnar compression tuned for slowly-changing values, storing repetitive metric data balloons on disk. Relational row formats don't naturally collapse near-identical consecutive values, so your storage bill climbs faster than your insight does.
Query collapse. Let a relational table accumulate a few billion timestamped rows and range-scan performance falls off a cliff unless you've carefully time-partitioned. The planner ends up evaluating millions of scattered rows for a single analytical query, exhausting memory and crawling.
💡 The tell: if your "metrics table" already needs manual partitioning, a covering index per query shape, and a nightly DELETE job, you've rebuilt a worse TSDB by hand.
How a TSDB is built
The write path and the read path
Incoming samples take a deliberate route designed around durability and speed:
On the write path, every sample is appended to a Write-Ahead Log (WAL) for crash recovery, while concurrently landing in an in-memory mutable buffer — the MemTable or "Head Block" — sorted by series key and time.
On the read path, the query planner uses late materialization and data-skipping indexes (MinMax ranges, Bloom filters) to prune entire blocks before decompressing them. You don't pay the CPU cost of decompressing data you were never going to return.
WAL, MemTables, and compaction
This is essentially an LSM (Log-Structured Merge) tree specialized for time. Once the MemTable crosses a size threshold, it's flushed to disk as an immutable, highly compressed segment. Because segments are immutable, writes never block on locks the way an in-place B-tree update would.
The cost is fragmentation — many small files accumulate. So background compaction threads continuously merge small segments into larger optimized ones, dropping expired or deleted data along the way. This is the engine's housekeeping, and it's also where a lot of the I/O and CPU cost lives (more on that under challenges).
Compression: the part that feels like cheating
This is where TSDBs earn their keep. Because time-series values change slowly and predictably, you can apply algorithms that would be useless on random data.
| Technique | Applies to | How it works |
|---|---|---|
| Delta-of-delta | Timestamps | Stores the change in the interval. A perfectly regular interval has a delta-of-delta of zero — encodable in a single bit. |
| XOR (Gorilla) | Floats | XORs each value against the previous one; near-identical values produce mostly-zero results, so only the meaningful middle bits are stored. |
| RLE | Repeating values | Stores "value × count" instead of N copies. |
| Bit-packing | Booleans / small ints | Packs many values into the bits of a single word. |
The Gorilla paper from Facebook is the canonical reference here. By combining delta-of-delta timestamps with XOR'd floats, it compressed each 16-byte (timestamp, value) point down to an average of 1.37 bytes — roughly a 12x reduction, which is what let Gorilla keep its working set in memory. Source: Gorilla, VLDB 2015
The detail I find genuinely delightful: the paper found that ~96% of all timestamps could be compressed to a single bit, because that fraction of points arrive at perfectly regular intervals (delta-of-delta = 0). Source: The Morning Paper — Gorilla
Sharding and replication
To scale, data is sharded by time (and sometimes by tag space). Time-partitioning means all incoming writes hit the "now" partition while reads only touch partitions overlapping the queried window. Replication across nodes then buys high availability, fault tolerance, and zero-downtime upgrades.
Data modeling: where you win or lose
The schema decisions you make on day one decide whether your TSDB scales or implodes. The core split is between tags and fields:
- Tags / labels — indexed metadata you filter and group by (hostname, region, service).
- Fields — the unindexed raw measurements (the actual numbers).
You then choose a schema shape:
- A narrow schema (one metric per row) scales well with sparse data and high cardinality.
- A wide schema (multiple metrics per row) simplifies joins between related measurements and cuts per-row overhead.
High cardinality: the cardinal sin
Cardinality is the total number of unique series created by the combinatorial product of your tag values. Track three regions × four services × ten hosts and you have 120 series — fine. Put user_id or trace_id into an indexed tag and you have millions, churning constantly. That's a series explosion, and it saturates the RAM holding your index mappings while shredding query performance.
⚠️ Warning: Never associate high-cardinality values — trace IDs, IP addresses, session IDs, container UUIDs — with indexed tags. Store them as unindexed fields. Build hierarchical tag structures to aggregate without losing context.
This single rule prevents more production TSDB outages than any other.
Querying time-series data
Queries are almost always window-based, and the engine is built around that assumption:
- Time-range queries scan contiguous ranges and use time-partitioned chunks to skip irrelevant blocks.
- Aggregations and rollups — sums, averages, percentiles, min/max over time buckets — power most dashboards.
- Real-time analytics work because the freshest data sits uncompressed in the in-memory Head Block; queries merge it on the fly with older compressed on-disk blocks for sub-second latency.
SQL vs PromQL vs Flux
The query language is one of the biggest differentiators between engines.
| Language | Used by | Sweet spot |
|---|---|---|
| SQL | TimescaleDB, QuestDB, InfluxDB 3 | Familiarity, rich tooling, joins with relational data |
| PromQL | Prometheus, VictoriaMetrics | Real-time alerting, rate calculations over instant vectors |
| Flux | Legacy InfluxDB 2.x | Functional scripting — powerful but a steep curve |
Worth flagging: Flux is on the way out. InfluxDB's 3.x line is built around native SQL via the Apache Arrow Flight SQL interface, explicitly removing the need for a v2-era domain-specific language like Flux. Source: AWS — Timestream for InfluxDB 3 If you're starting fresh, don't invest in Flux.
Picking a TSDB
There's no single "best" engine — there's the right fit for your workload. Here's how the major players line up.
| Prometheus | VictoriaMetrics | TimescaleDB | InfluxDB 3 | |
|---|---|---|---|---|
| Model | Pull / scrape | Pull (PromQL-compatible) | PostgreSQL extension | Purpose-built columnar |
| Query language | PromQL | PromQL / MetricsQL | SQL | SQL / InfluxQL |
| Best for | K8s metrics + alerting | High-cardinality, long-term Prom storage | SQL joins with business data | Extreme write throughput, IoT |
| Storage | Local single-node | Disk-backed inverted index | Hypertables on Postgres | Parquet on object storage (S3) |
| High cardinality | ⚠️ Weak point | ✅ Built for it | ⚠️ Gets expensive | ✅ Unlimited (3.x) |
| ACID / joins | ❌ | ❌ | ✅ Full | Limited |
A few notes from verifying the current state of each:
- Prometheus uses a pull-based model to scrape metric endpoints and runs primarily as a single-node local store for operational alerting and short-term retention.
- VictoriaMetrics uses a disk-backed inverted index and TSID architecture, which makes it exceptionally resilient to high cardinality and series churn while staying PromQL-compatible — hence its popularity as Prometheus long-term storage.
- TimescaleDB extends PostgreSQL, turning ordinary tables into time-partitioned hypertables with full ACID and native joins to relational data. Note the branding: Timescale the company rebranded to TigerData in June 2025, but the open-source extension you install is still called TimescaleDB. Source: TigerData announcement
- InfluxDB 3 is a ground-up rewrite in Rust on the FDAP stack (Flight, DataFusion, Arrow, Parquet), replacing the old TSM engine, persisting to object storage, and making SQL the primary language. Source: InfluxData / BigDATAwire
- For finance, QuestDB and kdb+ dominate where nanosecond precision and ultra-low latency for trading signals are non-negotiable.
Scaling and optimization
Four levers do most of the heavy lifting once you're in production:
- Retention policies (TTL). Old data loses relevance, so drop entire aged-out partitions from disk rather than running expensive row-by-row
DELETEqueries. - Downsampling. Roll granular raw data (10-second intervals) into coarse summaries (1-hour or 1-day averages). You keep the long-term trend and shed billions of rows.
- Compression tuning. Delta encoding on timestamps plus RLE/XOR on values reduces 64-bit records to a couple of bytes — storage savings measured in orders of magnitude.
- Storage tiering. Move data from fast, expensive hot nodes to cheap cold storage (S3-compatible object storage), and scale horizontally by distributing partitions and queries across a cluster.
Here's the kind of compression you're realistically looking at on regular-interval data:
| Raw point | After delta-of-delta + XOR | Reduction |
|---|---|---|
| 16 bytes (ts + value) | ~1.37 bytes (Gorilla avg) | ~12x |
| 16 bytes (random-ish) | 2–5 bytes (typical) | 3–8x |
Source: Gorilla, VLDB 2015; VictoriaMetrics compression analysis
Challenges and limitations
No free lunch. The same design that makes TSDBs fast creates its own failure modes.
- High cardinality is the recurring villain — high-churn environments generate millions of unique series metadata entries that saturate index RAM and bottleneck the CPU when merging scattered streams.
- Storage costs run away if you skip downsampling and tiering. Raw time-series grows indefinitely by default.
- Query complexity — large cross-measurement joins or sparse string searches can force full scans if the column isn't in the sort index or a skip index. Query paths go pathological and time out.
- Operational tuning — WAL checkpoints, LSM compaction, and index merging continuously consume disk I/O and CPU, which can cause ingest lag or backpressure during traffic spikes. Compaction is a feature and a liability.
Where this is heading
Three trends are reshaping the category:
- Cloud-native, compute/storage separation. InfluxDB 3 persisting Parquet directly to S3-compatible object storage is the template — near-unlimited scale, pay for what you store.
- AI and predictive analytics. TSDBs are growing into feature generation and ML, blending time-series with vector similarity search (e.g. Timescale's
pgvectorscale) for failure prediction and intelligent trend analysis. - Edge and real-time. Lightweight TSDBs run directly on IoT devices to buffer and aggregate locally, cutting the bandwidth needed to ship telemetry back to the cloud.
Production checklist
- Model tags vs fields deliberately — indexed metadata in tags, raw numbers in fields. Decide narrow vs wide up front.
- Cap your cardinality — never index unbounded IDs (trace, session, IP, UUID). Watch active series count like a hawk.
- Set retention and downsampling policies from day one — don't wait for the disk to fill.
- Pick the query language you'll actually live with — SQL for joins and tooling, PromQL for alerting. Skip Flux on greenfield work.
- Match the engine to the workload — Prometheus for K8s alerting, VictoriaMetrics for high-cardinality long-term storage, TimescaleDB for SQL joins, InfluxDB/QuestDB for write-heavy IoT and finance.
- Tier your storage — hot nodes for recent data, object storage for cold, before costs spiral.
- Budget for compaction — monitor I/O and ingest lag during traffic spikes; compaction is a background tax you must plan around.
- Replicate for availability — at least two nodes/regions for anything you'd page on.
Conclusion
Whenever I've reached for a TSDB, the win has come from one mental shift: treating time as a first-class dimension instead of just another column. Once you do that, the whole architecture — append-only WAL, in-memory Head Block, LSM compaction, Gorilla-style compression — stops being a pile of unfamiliar acronyms and starts reading like a single coherent answer to "how do I ingest millions of points a second and still query them in milliseconds?"
If you're starting out, don't over-engineer it. Pick the engine that matches your dominant query shape, model your tags and fields carefully, and put a hard ceiling on cardinality before anything else. Add tiering and downsampling once the data volume actually justifies them. Get those fundamentals right and a TSDB will quietly decode the past well enough to help you predict the future — which is, ultimately, the whole point.
FAQ
What is a time-series database used for?
A TSDB is optimized for timestamped data — infrastructure and app metrics, observability signals, IoT sensor readings, and financial tick data. It prioritizes fast sequential writes and range scans over the in-place updates a transactional database is built for.
Why can't I just use Postgres or MySQL for time-series data?
You can, up to a point. But B-tree indexes optimized for in-place updates suffer random-write overhead and lock contention under continuous append-only ingest, and relational row storage compresses repetitive metrics poorly. Range scans over billions of rows degrade without aggressive time-partitioning.
What is high cardinality and why does it kill a TSDB?
Cardinality is the number of unique series produced by every combination of your tag values. Putting unbounded values like user IDs, trace IDs, or container UUIDs into indexed tags creates a 'series explosion' that saturates the index memory and slows queries to a crawl.
How does a TSDB compress data so aggressively?
It exploits the fact that time-series data changes slowly and predictably. Timestamps use delta-of-delta encoding (a regular interval collapses to a single bit), floating-point values use XOR-based Gorilla compression, and repeating values use RLE. Facebook's Gorilla paper reported ~12x compression, around 1.37 bytes per point.
Which TSDB should I use?
Prometheus for Kubernetes metrics and alerting; VictoriaMetrics for high-cardinality, long-term Prometheus storage; TimescaleDB (now part of TigerData) when you need SQL joins against relational business data; InfluxDB or QuestDB for extreme write throughput and IoT-scale telemetry.
What's the difference between downsampling and retention?
Retention drops whole time partitions once they age past a TTL — no row-by-row DELETE. Downsampling rolls granular raw data up into coarser summaries (e.g. 10-second points into 1-hour averages) so you keep long-term trends without keeping every raw point.