Snapshot Sync Proposal

Author: SeongChan Lee <sclee@codechain.io>

Created: 2019-10-08

1. Motivation

The current CodeChain 2.0 implementation requires new nodes to replay the entire history of the blockchain from the genesis and synchronize with the best chain’s state. This initial sync process uses a large amount of resources. As the blockchain grows constantly, the required resources will grow constantly as well.

We introduce snapshot sync in CodeChain to reduce the resources used during the initial block synchronization process such as time, CPU, network bandwidth, disk space, and disk I/O usage.

2. Design

2.1. Non-goal

This proposal doesn’t attempt to solve the nothing-at-stake long range attack with the fabricated snapshot/history, which is an inherent vulnerability in a PoS network. Unlike Bitcoin, most PoS consensus algorithms are not fork-agnostic, so we assume that a fresh node has trusted bootstrap full nodes.

2.2. Overall sync process

  1. A fresh node knows the snapshot_block_header, which is a header that contains a block hash (snapshot_block_hash) and a state root (snapshot_state_root) of the block where a snapshot was taken at. The header can be safely obtained from a snapshot_block_hash from a trusted source.

  2. Download a snapshot

    1. A snapshot is made of multiple chunks. Start to download a chunk whose Merkle root is a snapshot_state_root:
      GetSnapshotChunks([snapshot_block_hash, [merkle_root]])
    2. Build a Merkle tree from a downloaded chunk to reconstruct the chunk root. The chunk can be verified by comparing the reconstructed Merkle root with the Merkle root we know.
    3. Not all paths in a chunk from a Merkle root end with a Leaf node. They can end with a Branch node since the chunk doesn’t contain a full tree. The Branch nodes contain Merkle roots of child chunks. The Merkle roots are queued to fetch further chunks.
    4. Dequeue some Merkle roots from the queue and repeat the previous steps with the root until there is no Merkle root in the queue.
  3. A state is restored. Start block synchronization from the snapshot_block_header to keep up with the tip.

2.3. Snapshot

Snapshot%20Sync1

A diagram of a state snapshot and its chunks. It is a smaller version(children=4, D=3) for illustrative purposes.

2.3.1. Chunk

A chunk(root, D) is a chunk that contains all nodes in a Merkle tree that satisfy all of these conditions:

  • A node N is a descendant of the root
  • depth(root) <= depth(N) < depth(root) + D

The nodes of chunk(root, D) satisfy a condition:

  • The Merkle root of chunk(root, D) is the root.

In this format, we can verify each chunk individually by reconstructing the Merkle tree of the chunk and comparing its Merkle root with the known Merkle root. If you verify chunks from the trusted root to the leaves, you can extend the trust towards the leaves since. It means you can verify the entire state incrementally.

Snapshot%20Sync2

The figure above illustrates how a chunk and a subtree are different. A chunk is a generalization of a Merkle subtree, and thus, all subtrees are also chunks. A subtree contains all nodes of a tree that can be reached from a subtree root, while a chunk contains nodes that can reconstruct a chunk root by building a Merkle tree.

To avoid confusion, we call nodes at the end of a path in a chunk ‘terminal nodes’ (red ones in the figure) and the others, ‘non-terminal nodes’ (blue ones in the figure). A node in a Merkle tree is either a Branch node or a Leaf node. A Leaf node is a node that doesn’t have a child. Therefore, all non-terminal nodes are Branch nodes, but a terminal node can be either a Leaf or a Branch.

The depth of a chunk, which can be retrieved during reconstruction, cannot exceed the predefined constant D. D should neither be too large nor too small. A large D makes a chunk bigger, which will be a great burden for a node to process. A small D makes a chunk smaller, which will cut a state into more chunks and increase network and storage overheads. Thus, D = 4 is proposed.

If D = 4, a chunk can have children at most 16^4 = 65536. If every node is Branch, then the chunk’s size will be ~ (32 bytes per Merkle root * 16 Merkle roots per Branch) * 65536 = 32MB. However, a Leaf node doesn’t have a size limit by itself. The chunk size might grow indefinitely, but they usually contain a UTXO, which has an upper bound on its size (by a block limit). Thus, it won’t be a problem under normal circumstances with trusted bootstrap nodes. (TODO: should we have mitigation?)

We only store terminal nodes in a chunk to reduce duplications since non-terminal nodes can be reconstructed while building a Merkle tree from it.

2.3.2. Comparison to other snapshot sync methods

  • Bitcoin AssumeUTXO: Snapshot proof is not a part of the consensus rule, so you have to sync blocks and replay it anyway.
  • Parity’s Warp Sync: The state is not verified incrementally while downloading chunks unless you trust the snapshot manifest, which is not a part of the consensus rule.
  • Go Ethereum’s Fast Sync: It is implemented with interactively querying nodes of a state which isn’t designed for static content distribution. Chunks easily be distributed over conventional CDNs or Bittorrent or IPFS.

2.4. Taking a snapshot

Start with a state root as a chunk root r. Next, retrieve all nodes that are either a leaf L with depth(r) <= depth(L) < depth(r) + D, or a branch B with depth(B) = depth(r) + D - 1. They are terminal nodes of a chunk, and a set of these nodes become a chunk(r, D). Children of terminal Branch nodes become new chunk roots, and make chunks with its subtree recursively.

roots <- set { state_root }
while let Some(root) = roots.pop_any() {
  c = chunk(root, D);
  assert root == merklize(c)
  store(c)
  roots.insert(all children of terminal Branches in c)
}

2.4.1. Recommendations for snapshot taking interval

In an ideal situation, snapshots should be available for every block. However, there are issues that prevent the ideal situation from happening. Storage is one such issue. Snapshotting tasks might not be able to catch up with the block producing job due to its limited disk throughput. To mitigate this, the node will take a snapshot periodically. The time period should neither be too sparse nor too dense. The period p needs to be defined to minimize the cost function E(snapshot cost) / p + E(block cost) * p. (TBD: The cost should be measured after the implementation)

To maximize the availability of a snapshot, the period should be aligned by default.

Furthermore, the consensus module may have requirements as to which blocks are suitable for snapshots. For example, the current Tendermint PoS module implementation reads the state of the previous term when the term closes, therefore the snapshot must be taken at a term close to avoid looking up the state of the previous term.

For these reasons, the consensus module would suggest which blocks are suitable for the snapshotting. Tendermint PoS may use term id % P to decide the blocks, where P is the period in which snapshots are taken (TBD).

2.5. Messages and formats

The message format follows this spec: https://github.com/CodeChain-io/codechain/blob/master/spec/Block-Synchronization-Extension.md

2.5.1. Chunk format

compress([terminal_node, ...])

RLP list of terminal nodes. It is compressed with snappy.

2.5.2. Deprecated messages

GetStateHead, StateHead, GetStateChunk, StateChunk are deprecated in favor of GetSnapshotChunks, SnapshotChunks

2.5.3. GetSnapshotChunks([block_hash, [merkle_root,…]])

Identifier: 0x0A

2.5.4. SnapshotChunks([[merkle_root, chunk], …])

Identifier: 0x0B

Responses to GetSnapshotChunks([block_hash, [merkle_root,...]]). It is possible to omit a pair of [merkle_root, chunk] when the chunk for the requested Merkle root is not available.

See more

  • CodeChain Snapshot Sync Proposal, 2019-10-16 CodeChain TechTalk - slide
2 Likes

Hi. We are B-Harvest, validating on Cosmos Network and others.
Hope to point out some practical elements related to this issue

  1. how to incentivize nodes to provide such chunks?
  • no node will want to run this chunk providing function because it is very resource spending
  • gives validators duty to provide signature on each chunk
  • several decentralized entity running “chunk downloading center” across the globe funded by community fund
  1. how can we trust each chunk?
  • chunks can be floating around the network, but its origin is unknown, so no chunk is trustable.
  • validators add their signature on the chunk so that they declare that it is genuine
  • those signatures can be combined with several signatures from different validators to strengthen the trustness of a chunk
  1. how is the “height” plays role in this chunking?
  • users will want to download and validate subset of historical data
  • in that case, system should be able to split the group of chunks into predefined height range

I think this feature is a great approach and having good utility. We hope to keep following your development. Thank you!

Hi, B-Harvest! Thank you for the feedback!

I’d like to let you know that the snapshot sync was developed as an optimization to the conventional block sync (it doesn’t replace all of its functionalities yet). Therefore, it has the same incentives as the block sync, which is currently nothing. However, the snapshot sync will be less resource taxing than the full block sync, so it is naturally more incentive than the conventional block sync. Thus, I think we should incentivize the conventional block sync to maintain the soundness of the chain. I haven’t heard of any blockchains that incentivize the block sync particularly. Let me know if you have heard about that.

Also, I want to let you know that taking snapshots is not only for the validators. Any node can take a snapshot of the state and provide that (it is also true for the conventional block sync).

I think it is beyond what this proposal should cover. It’ll be IPFS or something at the end.

Chunks are just a subset of a Merkle tree. If you can trust the entire Merkle tree because you trust the Merkle root, then you can trust chunks in a similar manner. You don’t get random chunks floating around the network. You start by getting a specific chunk corresponding to your merkle root in the trusted header, and extend the trust towards the leaves after you verify the chunk.

I don’t understand what you mean by “height” and “subset of historical data”. If you mean the “depth” in this proposal by “height”, it’s just a limiting factor to prevent a chunk from growing too much. It’s the same ‘depth’ as the tree data structure. If you mean something like Ethereum SPV, like spatially populated trees by “subset of historical data”, I think it is also worth an another proposal.

Why is merkle_root included in the chunk response message? I think it can be reconstructed from the chunk.

Yes right. It is duplicated information, and I put it in there as premature optimization. I think we can eliminate it while matching response format with other messages.

Issues with the Snapshot implementation

There were two significant patches to make CodeChain use the Snapshot Sync. The first one is the reward calculation, which is patched here. The intermediate reward of the nth term was being calculated at the n+1th term, and this required reading all the blocks included in nth term at the n+1th term’s end. The linked PR fix this by saving the nth term’s intermediate reward to the state, and reading the state item in the n+1th term’s end.

The second problem is retrieving the validator set of the block, which is patched here. In the previous implementation, the block’s header contained the signatures for the parent block, and the list of signers was recorded in the state trie of the grandparent block. This was not compatible with our snapshot sync implementation, so we added an additional state item called CurrentValidator. This contains the valid signer of the current block, so we can get the list of validators by simply reading the previous block’s state item.

There is only one unsolved problem that we found, which is related to the timelock of the assets. The current timelock implementation expects the transaction that created the asset to exist when the time-locked asset is spent. However, it’s not always true when snapshot synchronization is introduced. We came up with an idea that can partially solve this, which works when there are no time-locked assets at the era’s end, but it’s not implemented yet.

Snapshot Experiment Result

The benchmark was taken in the corgi network, at block #3197000. Since measuring the exact time and resources required for the full block synchronization is not a trivial task, it is interpolated from the measurement taken from synchronizing the first few thousands of blocks. The estimated time required for the full synchronization is 10h 40m. The snapshot synchronization took 3m 46s on this setting, so we can say we got 169x faster in time for the initial setup.

Measuring the improvements in space consumption requires a bit of calculation. The database constructed with the full block synchronization required 29GB of space, and the initial database constructed with the snapshot synchronization required 2.8MB of space. If we only think about the initial database size, we can say we have made it 10605x smaller.

However, the initial database size is not an important aspect of the real-world situation. In a typical setting, nodes that started with the snapshot sync will go back to the full block sync mode after the snapshot sync is done. We have to take the space consumption “after” the snapshot sync is finished into account. In the corgi network, the total block size in a term is currently about 2.4MB. There are not many state items changing in the corgi network, so the space consumption of the state trie is negligible compared to the blocks. The length of a term is 1 hour, so we have 720 terms in one month. Therefore, the database’s growth speed is roughly 1.7GB/month(=720 * 2.4 / 1024). If we assume that the nodes reset their database with snapshot synchronization every month, we can say that we got 17x of improvement in space.

The cost of the improvement is the increased space requirement of the providers of the snapshots. In the default setting, snapshots are taken every term and it remains in the disk for 1 month. In the configuration used for the benchmark, the generated snapshot occupied 2.3MB(3849 files). There are 720 terms in a month, so the total space overhead required for maintaining the snapshots will be about 1.6GB. For the nodes that were synchronized with a full block synchronization, the overhead is about 5% of space.

In conclusion, the snapshot synchronization reduces the initial setup time from 10h 40m to 3m 46s, which is 169 times faster. For nodes that reset their database with snapshot synchronization once a month, the space requirement is reduced from 29GB to 1.7GB, which is a 17x reduction. The overhead of maintaining 1 month worth of snapshots is roughly 1.6GB in size, which is about 5% of the entire database. Also note that these numbers get better as the blockchain’s size grows, so the numbers will be far better than now as time goes on.