Skip to main content

🧱 What are DataParts?

In TuringDB, graphs are versioned as a sequence of commits. Each commit represents a snapshot of the graph’s state at a given point in time. But under the hood, each commit is composed of DataParts, the fundamental unit of storage in TuringDB architecture.

πŸ“¦ DataParts Explained

  • Every commit is partitioned into multiple DataParts
  • Nodes and edges are stored within DataParts
  • Once written, a DataPart is immutable
  • Commits reference a collection of DataParts, new and inherited from previous commits
πŸ–ΌοΈ Visualization: Imagine a commit as a big box. Inside it, multiple internal boxes labeled DataPart 1, DataPart 2, etc., each storing a portion of the graph.

⚑ Why DataParts?

TuringDB is fundamentally a read-optimized analytical graph database, but DataParts are our answer to achieving high-performance parallel batch writes and data imports, especially for large-scale ingestion workloads.

πŸ”„ Benefits

  • Write Parallelism Multiple threads or processes can write concurrently to their own private DataPart, without coordination, synchronisation or locking overhead.
  • Batch Import Performance Ingesting millions of nodes and edges becomes scalable and efficient, even in a system built for sub-millisecond analytics.
  • Snapshot Safety Each commit references a set of immutable DataParts, allowing us to maintain consistent snapshots and rollback history without duplication.

🧠 How TuringDB Uses DataParts

Each time you add new data or modify existing node/edge properties:
  • TuringDB creates a new DataPart to store the changes.
  • It reuses existing DataParts from the parent commit whenever possible.
  • This leads to efficient incremental storage, only new or changed data consumes additional memory.
Commit 1
 β”œβ”€β”€ DataPart 1
 β”œβ”€β”€ DataPart 2
 β”œβ”€β”€ DataPart 3
 └── DataPart 4

Commit 2
 β”œβ”€β”€ [references] DataPart 1
 β”œβ”€β”€ [references] DataPart 2
 β”œβ”€β”€ [references] DataPart 3
 β”œβ”€β”€ [references] DataPart 4
 └── [adds]   DataPart 5
DataParts in Commits
πŸ”’ Like git objects, DataParts are immutable and sharable, enabling:
  • Deduplication of unchanged data
  • Consistent time-travel queries
  • Audit-friendly storage history

πŸ“ Tuning for Performance

TuringDB can efficiently read and traverse graphs with up to 200 DataParts per commit. However, for optimal read performance, we aim to consolidate down to a single DataPart per commit.
The fewer the DataParts, the faster the reads, due to improved locality, reduced CPU cache misses, and minimized lookup overhead.

🧭 Roadmap: Intelligent DataPart Merging

We are actively developing policies and algorithms to intelligently merge DataParts in the background. The goal is to:
  • Automatically compact multiple DataParts into fewer ones
  • Detect hot paths and frequently accessed subgraphs
  • Optimize for query throughput and storage locality
In the future, commits will start as multiple data parts for fast ingestion and converge toward compact forms for analytical speed, combining the best of both worlds.

πŸ’‘ Summary

FeatureBenefit
Immutable DataPartsSafe versioning and reuse
Parallel write ingestionHigh-performance batch processing
Shared storage across commitsLower memory usage, fast snapshots
Merge roadmapCompact layout for ultimate read speed
TuringDB uses DataParts to balance high-speed writes, versioned safety, and read-optimized performance, all in a single, cohesive engine.
  • ClickHouse: Parts , A similar model used in high-performance columnar stores to enable immutability, versioning, and efficient compaction.
⌘I