ddia-systems

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Designing Data-Intensive Applications Framework

数据密集型应用设计框架

A principled approach to building reliable, scalable, and maintainable data systems. Apply these principles when choosing databases, designing schemas, architecting distributed systems, or reasoning about consistency and fault tolerance.
这是构建可靠、可扩展且可维护的数据系统的原则性方法。在选择数据库、设计 schema、构建分布式系统,或考量一致性与容错性时,均可应用这些原则。

Core Principle

核心原则

Data outlives code. Applications are rewritten, languages change, frameworks come and go -- but data and its structure persist for decades. Every architectural decision must prioritize the long-term correctness, durability, and evolvability of the data layer above all else.
The foundation: Most applications are data-intensive, not compute-intensive. The hard problems are the amount of data, its complexity, and the speed at which it changes. Understanding the trade-offs between consistency, availability, partition tolerance, latency, and throughput is what separates robust systems from fragile ones.
数据的生命周期比代码更长。 应用会被重写,编程语言会更迭,框架会过时——但数据及其结构会持续存在数十年。所有架构决策都必须将数据层的长期正确性、持久性和可演化性置于首位。
基础认知: 大多数应用是数据密集型而非计算密集型。真正的难题在于数据的体量、复杂度以及变化速度。理解一致性、可用性、分区容错性、延迟和吞吐量之间的权衡,是区分健壮系统与脆弱系统的关键。

Scoring

评分标准

Goal: 10/10. When reviewing or designing data architectures, rate them 0-10 based on adherence to the principles below. A 10/10 means deliberate trade-off choices for data models, storage engines, replication, partitioning, transactions, and processing pipelines; lower scores indicate accidental complexity or ignored failure modes. Always provide the current score and specific improvements needed to reach 10/10.
目标:10/10。 在评审或设计数据架构时,根据对以下原则的遵循程度,为其打0-10分。10分意味着在数据模型、存储引擎、复制、分区、事务和处理管道的选择上进行了审慎的权衡;低分则表明存在意外复杂度或忽略了故障模式。需始终给出当前分数以及达到10分所需的具体改进措施。

The DDIA Framework

DDIA 框架

Seven domains for reasoning about data-intensive systems:
用于分析数据密集型系统的七个领域:

1. Data Models and Query Languages

1. 数据模型与查询语言

Core concept: The data model shapes how you think about the problem. Relational, document, and graph models each impose different constraints and enable different query patterns.
Why it works: Choosing the wrong data model forces application code to compensate for representational mismatch, adding accidental complexity that compounds over time.
Key insights:
  • Relational models excel at many-to-many relationships and ad-hoc queries
  • Document models excel at one-to-many relationships and data locality
  • Graph models excel at highly interconnected data with recursive traversals
  • Schema-on-write (relational) catches errors early; schema-on-read (document) offers flexibility
  • Polyglot persistence -- use different stores for different access patterns -- is often the right answer
  • Impedance mismatch between objects and relations is a real cost; document models reduce it for self-contained aggregates
Code applications:
ContextPatternExample
User profiles with nested dataDocument model for self-contained aggregatesStore profile, addresses, and preferences in one MongoDB document
Social network connectionsGraph model for relationship traversalNeo4j Cypher query:
MATCH (a)-[:FOLLOWS*2]->(b)
for friend-of-friend
Financial ledger with joinsRelational model for referential integrityPostgreSQL with foreign keys between accounts, transactions, and entries
Mixed access patternsPolyglot persistencePostgreSQL for transactions + Elasticsearch for full-text search + Redis for caching
See: references/data-models.md
核心概念: 数据模型会塑造你对问题的思考方式。关系型、文档型和图模型各自施加不同的约束,并支持不同的查询模式。
设计意义: 选择错误的数据模型会迫使应用代码弥补表示不匹配的问题,从而增加意外复杂度,且这种复杂度会随时间推移不断累积。
关键见解:
  • 关系型模型擅长处理多对多关系和即席查询
  • 文档型模型擅长处理一对多关系和数据局部性
  • 图模型擅长处理高度互联的数据及递归遍历
  • 写时 schema(关系型)可尽早捕获错误;读时 schema(文档型)提供灵活性
  • 多语言持久化——针对不同的访问模式使用不同的存储——通常是正确的选择
  • 对象与关系之间的阻抗不匹配是真实存在的成本;文档型模型可减少自包含聚合的这种成本
代码实践:
场景模式示例
包含嵌套数据的用户资料针对自包含聚合的文档模型将用户资料、地址和偏好存储在单个MongoDB文档中
社交网络连接用于关系遍历的图模型Neo4j Cypher查询:
MATCH (a)-[:FOLLOWS*2]->(b)
查找二度好友
需关联查询的财务账本用于引用完整性的关系型模型PostgreSQL,在账户、交易和分录之间设置外键
混合访问模式多语言持久化PostgreSQL处理事务 + Elasticsearch处理全文搜索 + Redis做缓存
参考:references/data-models.md

2. Storage Engines

2. 存储引擎

Core concept: Storage engines make a fundamental trade-off between read performance and write performance. Log-structured engines (LSM trees) optimize writes; page-oriented engines (B-trees) balance reads and writes.
Why it works: Understanding the internals of your database's storage engine lets you predict performance characteristics, choose appropriate indexes, and avoid pathological workloads.
Key insights:
  • LSM trees: append-only writes, periodic compaction, excellent write throughput, higher read amplification
  • B-trees: in-place updates, predictable read latency, write amplification from page splits
  • Write amplification means one logical write causes multiple physical writes -- critical for SSDs with limited write cycles
  • Column-oriented storage dramatically improves analytical query performance through compression and vectorized processing
  • In-memory databases are fast not because they avoid disk, but because they avoid encoding overhead
Code applications:
ContextPatternExample
High write throughputLSM-tree engineCassandra or RocksDB for time-series ingestion at 100K+ writes/sec
Mixed read/write OLTPB-tree enginePostgreSQL B-tree indexes for transactional workloads with point lookups
Analytical queries on large datasetsColumn-oriented storageClickHouse or Parquet files for scanning billions of rows with few columns
Low-latency cachingIn-memory storeRedis for sub-millisecond lookups; Memcached for simple key-value caching
See: references/storage-engines.md
核心概念: 存储引擎在读取性能和写入性能之间做出根本性权衡。日志结构引擎(LSM树)优化写入性能;面向页面的引擎(B树)平衡读写性能。
设计意义: 了解数据库存储引擎的内部原理,可让你预测性能特征、选择合适的索引,并避免异常工作负载。
关键见解:
  • LSM树:仅追加写入,定期合并,写入吞吐量出色,读取放大率较高
  • B树:原地更新,读取延迟可预测,页面分裂会导致写入放大
  • 写入放大指一次逻辑写入会引发多次物理写入——这对写入周期有限的SSD至关重要
  • 列存储通过压缩和向量化处理,大幅提升分析查询性能
  • 内存数据库速度快,并非因为避免了磁盘IO,而是因为避免了编码开销
代码实践:
场景模式示例
高写入吞吐量LSM树引擎Cassandra或RocksDB,用于每秒10万+写入的时间序列数据摄入
混合读写的OLTPB树引擎PostgreSQL B树索引,用于支持点查询的事务型工作负载
大数据集的分析查询列存储ClickHouse或Parquet文件,用于扫描数十亿行数据中的少量列
低延迟缓存内存存储Redis用于亚毫秒级查询;Memcached用于简单的键值缓存
参考:references/storage-engines.md

3. Replication

3. 复制机制

Core concept: Replication keeps copies of data on multiple machines for fault tolerance, scalability, and latency reduction. The core challenge is handling changes to replicated data consistently.
Why it works: Every replication strategy trades off between consistency, availability, and latency. Making this trade-off explicit prevents subtle data anomalies that surface only under load or failure.
Key insights:
  • Single-leader replication: simple, strong consistency possible, but the leader is a bottleneck and single point of failure
  • Multi-leader replication: better write availability across data centers, but conflict resolution is complex
  • Leaderless replication: highest availability, uses quorum reads/writes, but requires careful conflict handling
  • Replication lag causes read-your-writes violations, monotonic read violations, and causality violations
  • Synchronous replication guarantees durability but increases latency; asynchronous replication risks data loss on leader failure
  • CRDTs and last-writer-wins are conflict resolution strategies with very different correctness guarantees
Code applications:
ContextPatternExample
Read-heavy web appSingle-leader with read replicasPostgreSQL primary + read replicas behind pgBouncer for read scaling
Multi-region writesMulti-leader replicationCockroachDB or Spanner for geo-distributed writes with bounded staleness
Shopping cart availabilityLeaderless with mergeDynamoDB with last-writer-wins or application-level merge for cart conflicts
Collaborative editingCRDTs for conflict-free mergingYjs or Automerge for real-time collaborative document editing
See: references/replication.md
核心概念: 复制机制在多台机器上保存数据副本,以实现容错、可扩展性和延迟降低。核心挑战是一致地处理复制数据的变更。
设计意义: 每种复制策略都在一致性、可用性和延迟之间进行权衡。明确这种权衡可避免仅在高负载或故障时才会显现的微妙数据异常。
关键见解:
  • 单主复制:实现简单,可保证强一致性,但主节点会成为瓶颈和单点故障
  • 多主复制:提升多数据中心的写入可用性,但冲突解决复杂度高
  • 无主复制:可用性最高,使用法定人数读写,但需要谨慎处理冲突
  • 复制延迟会导致“读己之写”违规、“单调读”违规和因果关系违规
  • 同步复制保证持久性但增加延迟;异步复制在主节点故障时存在数据丢失风险
  • CRDTs和最后写入者胜出是具有截然不同正确性保证的冲突解决策略
代码实践:
场景模式示例
读密集型Web应用带只读副本的单主复制PostgreSQL主节点 + pgBouncer背后的只读副本,用于读扩展
多区域写入多主复制CockroachDB或Spanner,用于带有限陈旧性的地理分布式写入
购物车可用性带合并的无主复制DynamoDB,使用最后写入者胜出或应用层合并处理购物车冲突
协同编辑用于无冲突合并的CRDTsYjs或Automerge,用于实时协同文档编辑
参考:references/replication.md

4. Partitioning

4. 分区策略

Core concept: Partitioning (sharding) distributes data across multiple nodes so that each node handles a subset of the total data, enabling horizontal scaling beyond a single machine.
Why it works: Without partitioning, a single node becomes the bottleneck for storage capacity and throughput. Effective partitioning distributes load evenly and avoids hotspots.
Key insights:
  • Key-range partitioning supports efficient range scans but risks hotspots on sequential keys
  • Hash partitioning distributes load evenly but destroys sort order and makes range queries expensive
  • Secondary indexes can be partitioned locally (each partition has its own index) or globally (index partitioned separately)
  • Local secondary indexes require scatter-gather queries; global secondary indexes require cross-partition updates
  • Hotspots can occur even with hash partitioning if a single key is extremely popular (celebrity problem)
  • Rebalancing strategies: fixed number of partitions, dynamic splitting, or proportional to node count
Code applications:
ContextPatternExample
Time-series dataKey-range partitioning by time + sourcePartition by
(sensor_id, date)
to avoid write hotspot on current day
User data at scaleHash partitioning on user IDCassandra consistent hashing on
user_id
for even distribution
Global search indexGlobal secondary indexElasticsearch index sharded independently from primary data store
Celebrity/hot-key problemKey splitting with random suffixAppend random digit to hot partition key, fan-out reads across 10 sub-partitions
See: references/partitioning.md
核心概念: 分区(分片)将数据分布到多个节点,使每个节点处理总数据的子集,从而实现超越单台机器的水平扩展。
设计意义: 没有分区,单个节点会成为存储容量和吞吐量的瓶颈。有效的分区可均匀分配负载并避免热点。
关键见解:
  • 键范围分区支持高效的范围扫描,但连续键存在热点风险
  • 哈希分区均匀分配负载,但破坏排序顺序,使范围查询成本高昂
  • 二级索引可按本地分区(每个分区有自己的索引)或全局分区(索引单独分区)
  • 本地二级索引需要散射-聚集查询;全局二级索引需要跨分区更新
  • 即使使用哈希分区,如果单个键极受欢迎(名人问题),仍会出现热点
  • 重平衡策略:固定数量的分区、动态拆分或与节点数量成比例
代码实践:
场景模式示例
时间序列数据按时间+源进行键范围分区
(sensor_id, date)
分区,避免当前日期的写入热点
大规模用户数据按用户ID哈希分区Cassandra对
user_id
使用一致性哈希,实现均匀分布
全局搜索索引全局二级索引Elasticsearch索引与主数据存储独立分片
名人/热点键问题带随机后缀的键拆分向热点分区键追加随机数字,将查询分散到10个子分区
参考:references/partitioning.md

5. Transactions and Consistency

5. 事务与一致性

Core concept: Transactions provide safety guarantees (ACID) that simplify application code by letting you pretend failures and concurrency don't exist -- within the transaction's scope.
Why it works: Without transactions, every piece of application code must handle partial failures, race conditions, and concurrent modifications. Transactions move this complexity into the database where it can be handled correctly once.
Key insights:
  • Isolation levels are a spectrum: read uncommitted, read committed, snapshot isolation (repeatable read), serializable
  • Most databases default to read committed or snapshot isolation -- not serializable -- and application developers must understand the anomalies this permits
  • Write skew occurs when two transactions read the same data, make decisions based on it, and write different records -- no row-level lock prevents this
  • Serializable snapshot isolation (SSI) provides full serializability with optimistic concurrency -- no blocking, but aborts on conflict
  • Two-phase locking provides serializability but causes contention and deadlocks under high concurrency
  • Distributed transactions (two-phase commit) are expensive and fragile; avoid them when possible by designing around single-partition operations
Code applications:
ContextPatternExample
Account balance transferSerializable transaction
BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;
Inventory reservationSELECT FOR UPDATE to prevent write skew
SELECT stock FROM items WHERE id = X FOR UPDATE
before decrementing
Read-heavy dashboardsSnapshot isolation for consistent readsPostgreSQL MVCC provides point-in-time snapshot without blocking writers
Cross-service operationsSaga pattern instead of distributed transactionsCompensating transactions: charge card, reserve inventory, on failure refund card
See: references/transactions.md
核心概念: 事务提供安全保证(ACID),通过让你在事务范围内假装故障和并发不存在,简化应用代码。
设计意义: 没有事务,每段应用代码都必须处理部分故障、竞争条件和并发修改。事务将这种复杂度转移到数据库中,可一次性正确处理。
关键见解:
  • 隔离级别是一个连续谱:读未提交、读已提交、快照隔离(可重复读)、可串行化
  • 大多数数据库默认使用读已提交或快照隔离——而非可串行化——应用开发者必须了解这允许的异常情况
  • 写入倾斜发生在两个事务读取相同数据、基于该数据做出决策并写入不同记录时——行级锁无法防止这种情况
  • 可串行化快照隔离(SSI)通过乐观并发控制提供完全可串行化——无阻塞,但冲突时会中止事务
  • 两阶段锁提供可串行化,但在高并发下会导致 contention 和死锁
  • 分布式事务(两阶段提交)成本高且脆弱;尽可能通过设计单分区操作来避免
代码实践:
场景模式示例
账户余额转账可串行化事务
BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;
库存预留SELECT FOR UPDATE 防止写入倾斜在扣减库存前执行
SELECT stock FROM items WHERE id = X FOR UPDATE
读密集型仪表板快照隔离实现一致读取PostgreSQL MVCC提供时间点快照,且不会阻塞写入者
跨服务操作用Saga模式替代分布式事务补偿事务:先扣款,再预留库存;失败则退款
参考:references/transactions.md

6. Batch and Stream Processing

6. 批处理与流处理

Core concept: Batch processing transforms bounded datasets in bulk; stream processing transforms unbounded event streams continuously. Both are forms of derived data computation.
Why it works: Separating the system of record (source of truth) from derived data (caches, indexes, materialized views) allows each to be optimized independently and rebuilt from the source when requirements change.
Key insights:
  • MapReduce is conceptually simple but operationally awkward; dataflow engines (Spark, Flink) generalize it with arbitrary DAGs
  • Event sourcing stores every state change as an immutable event, enabling full audit trails and temporal queries
  • Change data capture (CDC) turns database writes into a stream that downstream systems can consume
  • Stream-table duality: a stream is the changelog of a table; a table is the materialized state of a stream
  • Exactly-once semantics in stream processing require idempotent operations or transactional output
  • Time windowing (tumbling, hopping, session) is essential for aggregating unbounded streams
Code applications:
ContextPatternExample
Daily analytics pipelineBatch processing with SparkRead day's events from S3, aggregate metrics, write to data warehouse
Real-time fraud detectionStream processing with FlinkConsume payment events from Kafka, apply rules within 5-second tumbling windows
Syncing search indexChange data captureDebezium captures PostgreSQL WAL changes, publishes to Kafka, Elasticsearch consumer updates index
Audit trail / event replayEvent sourcingStore
OrderPlaced
,
OrderShipped
,
OrderRefunded
events; rebuild current state by replaying
See: references/batch-stream.md
核心概念: 批处理批量转换有界数据集;流处理持续转换无界事件流。两者都是衍生数据计算的形式。
设计意义: 将记录系统(可信数据源)与衍生数据(缓存、索引、物化视图)分离,可独立优化两者,并在需求变化时从源数据重建衍生数据。
关键见解:
  • MapReduce概念简单但运维复杂;数据流引擎(Spark、Flink)通过任意DAG对其进行了泛化
  • 事件溯源将每个状态变更存储为不可变事件,支持完整的审计跟踪和时间查询
  • 变更数据捕获(CDC)将数据库写入转换为流,供下游系统消费
  • 流表对偶性:流是表的变更日志;表是流的物化状态
  • 流处理中的恰好一次语义需要幂等操作或事务性输出
  • 时间窗口(滚动窗口、滑动窗口、会话窗口)是聚合无界流的关键
代码实践:
场景模式示例
每日分析管道Spark批处理从S3读取当日事件,聚合指标,写入数据仓库
实时欺诈检测Flink流处理从Kafka消费支付事件,在5秒滚动窗口内应用规则
同步搜索索引变更数据捕获Debezium捕获PostgreSQL WAL变更,发布到Kafka,Elasticsearch消费者更新索引
审计跟踪/事件重放事件溯源存储
OrderPlaced
OrderShipped
OrderRefunded
事件;通过重放事件重建当前状态
参考:references/batch-stream.md

7. Reliability and Fault Tolerance

7. 可靠性与容错性

Core concept: Faults are inevitable; failures are not. A reliable system continues operating correctly even when individual components fail. Design for faults, not against them.
Why it works: Hardware fails, software has bugs, humans make mistakes. Systems that assume perfect operation are brittle. Systems that expect and handle faults gracefully are resilient.
Key insights:
  • A fault is one component deviating from spec; a failure is the system as a whole stopping. Fault tolerance prevents faults from becoming failures
  • Hardware faults are random and independent; software faults are correlated and systematic (more dangerous)
  • Human error is the leading cause of outages -- design systems that minimize opportunity for mistakes and maximize ability to recover
  • Timeouts are the fundamental fault detector in distributed systems -- but choosing the right timeout is hard (too short causes false positives, too long delays recovery)
  • Safety properties (nothing bad happens) must always hold; liveness properties (something good eventually happens) may be temporarily violated
  • Byzantine fault tolerance is rarely needed outside blockchain -- most systems assume non-Byzantine (crash-stop or crash-recovery) models
Code applications:
ContextPatternExample
Service communicationTimeouts + retries with exponential backoff
retry(max=3, backoff=exponential(base=1s, max=30s))
with jitter
Leader electionConsensus algorithm (Raft/Paxos)etcd or ZooKeeper for distributed lock and leader election
Data pipeline reliabilityIdempotent operations + checkpointingKafka consumer commits offset only after successful processing
Graceful degradationCircuit breaker patternHystrix/Resilience4j: open circuit after 50% failures in 10-second window
See: references/fault-tolerance.md
核心概念: 故障不可避免,但失败并非必然。可靠的系统即使在单个组件故障时仍能正确运行。要针对故障进行设计,而非试图避免故障。
设计意义: 硬件会故障,软件有bug,人类会犯错。假设完美运行的系统是脆弱的。预期并能优雅处理故障的系统才具有韧性。
关键见解:
  • 故障是单个组件偏离规格;失败是整个系统停止运行。容错性可防止故障演变为失败
  • 硬件故障是随机且独立的;软件故障是相关且系统性的(更危险)
  • 人为错误是停机的主要原因——设计系统时应尽量减少犯错的机会,并最大化恢复能力
  • 超时是分布式系统中基本的故障检测器——但选择合适的超时时间很难(太短会导致误报,太长会延迟恢复)
  • 安全属性(不会发生坏事)必须始终保持;活性属性(最终会发生好事)可能暂时被违反
  • 拜占庭容错除区块链外很少需要——大多数系统假设非拜占庭(崩溃停止或崩溃恢复)模型
代码实践:
场景模式示例
服务通信超时+指数退避重试
retry(max=3, backoff=exponential(base=1s, max=30s))
并添加抖动
主节点选举共识算法(Raft/Paxos)etcd或ZooKeeper,用于分布式锁和主节点选举
数据管道可靠性幂等操作+检查点Kafka消费者仅在处理成功后提交偏移量
优雅降级断路器模式Hystrix/Resilience4j:10秒窗口内故障达50%时打开断路器
参考:references/fault-tolerance.md

Common Mistakes

常见错误

MistakeWhy It FailsFix
Choosing a database based on popularityDifferent engines have fundamentally different trade-offsMatch storage engine characteristics to your actual read/write patterns
Ignoring replication lagUsers see stale data, phantom reads, or lost updatesImplement read-your-writes consistency; use monotonic read guarantees
Using distributed transactions everywhereTwo-phase commit is slow and fragile; coordinator is a single point of failureDesign for single-partition operations; use sagas for cross-service coordination
Hash partitioning everythingDestroys range query ability; some workloads need sorted accessUse key-range partitioning for time-series; composite keys for locality
Assuming serializable isolationMost databases default to weaker isolation; write skew bugs appear in productionCheck your database's actual default isolation level; use explicit locking where needed
Conflating batch and streamBatch tools on streaming data add latency; stream tools on bounded data waste complexityMatch processing model to data boundedness and latency requirements
Treating all faults as recoverableSome failures (data corruption, Byzantine) require fundamentally different handlingClassify faults and design specific recovery strategies for each class
错误失败原因修复方案
基于流行度选择数据库不同引擎有本质不同的权衡让存储引擎的特性与实际读写模式匹配
忽略复制延迟用户会看到 stale 数据、幻读或丢失更新实现“读己之写”一致性;使用单调读保证
到处使用分布式事务两阶段提交速度慢且脆弱;协调器是单点故障设计单分区操作;用Saga模式进行跨服务协调
所有数据都用哈希分区破坏范围查询能力;某些工作负载需要有序访问时间序列数据使用键范围分区;复合键保证局部性
假设默认是可串行化隔离大多数数据库默认使用较弱的隔离级别;写入倾斜bug会在生产环境中显现检查数据库的实际默认隔离级别;必要时使用显式锁
混淆批处理与流处理批处理工具处理流数据会增加延迟;流处理工具处理有界数据会浪费复杂度让处理模型与数据的有界性和延迟要求匹配
认为所有故障都可恢复某些故障(数据损坏、拜占庭故障)需要完全不同的处理方式对故障进行分类,并为每类故障设计特定的恢复策略

Quick Diagnostic

快速诊断

QuestionIf NoAction
Can you explain why you chose this database over alternatives?Decision was based on familiarity, not requirementsEvaluate data model fit, read/write ratio, consistency needs, and scaling path
Do you know your database's default isolation level?You may have concurrency bugs you haven't found yetCheck documentation; test for write skew and phantom read scenarios
Is your replication strategy explicitly chosen (not defaulted)?You have implicit assumptions about consistency and durabilityDocument trade-offs: sync vs async, failover behavior, lag tolerance
Can your system handle a hot partition key?A single popular entity can bring down the clusterAdd key-splitting strategy or application-level load shedding for hot keys
Do you separate your system of record from derived data?Schema changes or new features require migrating everythingIntroduce CDC or event sourcing to decouple source from derived stores
Are your timeouts and retries tuned, not defaulted?You get cascading failures or unnecessary delaysMeasure p99 latency; set timeouts above p99 but below cascade threshold
Have you tested failover in production conditions?Your recovery plan is theoretical, not validatedRun chaos engineering experiments: kill leaders, partition networks, fill disks
问题如果答案为否行动
你能解释为什么选择该数据库而非其他选项吗?决策基于熟悉度而非需求评估数据模型适配性、读写比、一致性需求和扩展路径
你知道数据库的默认隔离级别吗?你可能存在尚未发现的并发bug查看文档;测试写入倾斜和幻读场景
你的复制策略是明确选择的(而非默认)吗?你对一致性和持久性有隐含假设记录权衡:同步vs异步、故障转移行为、延迟容忍度
你的系统能处理热点分区键吗?单个热门实体可能拖垮集群添加键拆分策略,或在应用层对热点键进行流量削峰
你是否将记录系统与衍生数据分离?Schema变更或新功能需要迁移所有数据引入CDC或事件溯源,将源数据与衍生存储解耦
你的超时和重试策略是经过调优的(而非默认)吗?你会遇到级联失败或不必要的延迟测量p99延迟;将超时设置为高于p99但低于级联阈值
你在生产环境条件下测试过故障转移吗?你的恢复计划只是理论上的,未经过验证运行混沌工程实验:杀死主节点、分区网络、填满磁盘

Reference Files

参考文件

  • data-models.md: Relational vs document vs graph models, schema-on-read vs schema-on-write, query languages, polyglot persistence
  • storage-engines.md: LSM trees vs B-trees, write amplification, compaction, column-oriented storage, in-memory databases
  • replication.md: Single-leader, multi-leader, leaderless replication, replication lag, conflict resolution, CRDTs
  • partitioning.md: Key-range vs hash partitioning, secondary indexes, rebalancing, request routing, hotspots
  • transactions.md: ACID, isolation levels, write skew, two-phase locking, SSI, distributed transactions
  • batch-stream.md: MapReduce, dataflow engines, event sourcing, CDC, stream-table duality, exactly-once semantics
  • fault-tolerance.md: Faults vs failures, reliability metrics, timeouts, consensus, safety and liveness guarantees
  • data-models.md:关系型vs文档型vs图模型,读时schema vs写时schema,查询语言,多语言持久化
  • storage-engines.md:LSM树vs B树,写入放大,合并,列存储,内存数据库
  • replication.md:单主、多主、无主复制,复制延迟,冲突解决,CRDTs
  • partitioning.md:键范围vs哈希分区,二级索引,重平衡,请求路由,热点
  • transactions.md:ACID,隔离级别,写入倾斜,两阶段锁,SSI,分布式事务
  • batch-stream.md:MapReduce,数据流引擎,事件溯源,CDC,流表对偶性,恰好一次语义
  • fault-tolerance.md:故障vs失败,可靠性指标,超时,共识,安全与活性保证

Further Reading

延伸阅读

This skill is based on Martin Kleppmann's comprehensive guide to the principles and practicalities of data systems. For the complete treatment with detailed diagrams and research references:
本技能基于Martin Kleppmann关于数据系统原则与实践的综合指南。如需完整内容,包括详细图表和研究参考,请阅读:

About the Author

关于作者

Martin Kleppmann is a researcher in distributed systems and a former software engineer at LinkedIn and Rapportive. He is a Senior Research Associate at the University of Cambridge and has worked extensively on CRDTs, Byzantine fault tolerance, and local-first software. Designing Data-Intensive Applications (2017) has become the definitive reference for engineers building data systems, praised for making complex distributed systems concepts accessible through clear explanations and practical examples. Kleppmann's research focuses on data consistency, decentralized collaboration, and ensuring correctness in distributed systems. He is also known for his conference talks and educational writing that bridge the gap between academic research and industrial practice.
Martin Kleppmann 是分布式系统研究员,曾在LinkedIn和Rapportive担任软件工程师。他是剑桥大学高级研究员,在CRDTs、拜占庭容错和本地优先软件方面有深入研究。《Designing Data-Intensive Applications》(2017)已成为构建数据系统的工程师的权威参考资料,因其通过清晰的解释和实用示例,让复杂的分布式系统概念变得易于理解而广受赞誉。Kleppmann的研究重点是数据一致性、去中心化协作以及确保分布式系统的正确性。他还以其学术研究与工业实践之间搭建桥梁的会议演讲和教育性文章而闻名。