Dynamo: Amazon's Highly Available Key-value Store (SOSP'07)

Comments

Dynamo successfully builds a highly available and scalable data store, which sacrifices consistency under certain failure scenarios but the data storage is still eventually consistent. The system is complex and thus this paper involves a huge number of details. However, the paper could not specify all of them. Some interesting parts seem to be missing. What are the supporting techniques used in Dynamo? How is it compared to other existing KV distributed store systems? Are there any possible extensions or future works?

This paper assumes that security concerns can be ignored for its internal use. Theoretically, security problems is still possible. At least, how can the administrator detect sybil attacks?

The author tried three partition schemes. The latter versions decouple partitioning and partition placement. The placement is changeable at run time. Consequently, the strategy 3 has a better efficiency. And the system recovers faster and it is easier to archive. However, the key space is partitioned equally into Q partitions, and every node assume Q/S tokens per node. The strategy 3 in Figure 7 is a little confusing. The author should specify that the span each node is responsible for is SIMILAR IN SIZE to each other, instead of exactly EQUAL IN SIZE to each other.

1. Problem

How to build a highly available key-value storage system (which will be applied to productions with demanding applications)?
Assumptions:
1. Query model: Simple KV store with read/write operations on small objects (<= 1 MB).
2. ACID properties: Weak consistency and no isolation guarantees for high availability.
3. Efficiency: Commodity machines. Stringent latency requirements specified by SLAs.
  - SLA (service level agreements): a client/server agreement on clear bounds
    - SLA should NOT be stated in terms of mean/median response time (Some custom with longer history may require more processing and thus the performance may be ignored.)
    - Dynamo: 99.9% of the distribution based on a cost-benefit analysis.
    - Can be customized.
4. No security concerns: because of internal use.

2. Challenges

How to achieve a strongly consistent data access interface?
- Data replication algorithms. Strong consistency and high data availability cannot be achieved simultaneously. Traditionally, consistency comes first.
How to increase availability on failure-prone commodity clusters?
- Optimistic replication techniques -> conflicting changes to be resolved -> WHEN and WHO?
  - WHEN during writes or reads?
    - Traditionally, conflict is resolved during writes and keep the read complexity simple.
    - Dynamo: rejecting writes -> poor customer experience: writes should never be rejected. (always-writable)
  - WHO data store or the application?
    - data store: ONLY last write wins (the choice is limited)
    - the client: more flexible
Others
- Incremental scalability
- Symmetry (P2P)
- Decentralization
- Heterogeneity

3. Solution

Key ideas
- Sacrifice consistency under certain failure scenarios but is eventually-consistent
- object versioning
- application-assisted conflict resolution

3.1 System Architecture

Summary

Problem	Technique	Advantage
Partitioning	Consistent Hashing	Incremental Scalability
High Availability for writes	Vector clocks with reconciliation during reads	Version size is decoupled from update rates.
Handling temporary failures	Sloppy Quorum and hinted handoff	Provides high availability and durability guarantee when some of the replicas are not available.
Recovering from permanent failures	Anti-entropy using Merkle trees	Synchronizes divergent replicas in the background.
Membership and failure detection	Gossip-based membership protocol and failure detection.	Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

System Interface
- [object/list of objects with context] get(key)
- put(key, context, object) // context encodes system metadata about the object
- context and object are opaque
- MD5
Dynamic Partition
- Introduce virtual nodes as a variant of consistent hashing.
Replication
- Every coordinator node stores data locally and replicates them at the following N-1 nodes in the ring.
- The N physical machines forms a preference list.
Data Versioning
(Successful) Execution of put() and get()
- HTTP request
- How to select a node? two ways:
  1. load balancer route it to any random node in the ring. If the node is not in the top N of the requested key’s preference list, the request is forwarded to the first among the top N nodes in the preference list.
  2. partition-aware client library route it directly to the coordinator
- R: min number of nodes that must participate in a successful read.
- W: min number of nodes that must participate in a successful write.
- R+W>N -> quorum-like system
- How to define a successful write?
  - > = W-1 nodes respond
(Unsuccessful) Execution of put() and get()
Membership
- DHT + virtual nodes
  Implementation

Lessons Learned

Dynamo’s configurations
- Business logic specific reconciliation
- Timestamp based reconciliation
- High performance read engine
N, R, W
- N -> durability of each object
- W, R -> availability, durability, and consistency
- Dynamo: (3,2,2)
Performance or Durability?
- Durability can be sacrificed to lower the 99.9% latency by a factor of 5, in which object buffers will be periodically written to the disk by a writer thread.
Load balance
1. T random tokens per node and partition by token value.
  - Schemes for partitioning and partition placement are intertwined. a) adding a new node costs a lot but has a low priority (should not affect the customer performance) -> slow b) key ranges change and Merkle trees (hash trees) need re-calculation. c) randomness -> hard to archive the entire key spaces.
2. T random tokens per node and equal sized partitions.
3. Q/S tokens per node, equal-sized partitions.

4. Conclusion

“Always writable” asynchronous replication
Update may not propagate all replicas
put() creates a new and immutable version of data
get() may return multiple versions of data
Reconcile divergent versions
When: during reads
Who: system itself (syntactic reconciliation) client application (semantic reconciliation)

EOF