On November 7, 2021 at around 4:00 UTC, a replica in subnet jtdsg fell behind and was unable to catch up again to the latest blockheight. Only after the finalization rate of that subnet was reduced to 0.5 via a subnet management proposal was the replica eventually able to catch up on November 10 at 11:00 UTC.
This blog post explains the underlying issues that caused the replica to not be able to catch up and why slowing down the finalization rate helped to resolve the issue, as well as medium- and longer-term changes to the replica software that will address these issues so that the subnet can operate again with a faster finalization rate.
Catching up to a subnet
When a replica falls behind, there are two mechanisms for it to catch up again. First, it can execute all state changes from the outdated state height it has available until it catches up with the consensus state height of the healthy replicas. This, however, works only if the replica fell not too far behind. In such cases, a second mechanism comes into play: the replica can fetch the state for the latest checkpoint height via the state sync protocol. These checkpoints are created at a regular interval of 500 blocks together with the catch up package (CUP), and are sufficient to (re-)join a subnet as a replica.
The incident in detail
Leading up to the incident, the state got so big that synching the state took longer to complete than a more recent state to sync to becoming available. Thus, the replica could never get the most recent state and got stuck in a loop trying to sync up to the latest state.
One of the main reasons that state sync struggled on subnet jtdsg in particular is that it has a subnet state of roughly 160GB (around the time of the incident), which makes it the Internet Computer’s largest subnet. For comparison, the next largest subnet state is around 65GB, and the NNS subnet is around 5GB.
Before state sync can begin, the healthy replicas need to compute the so-called manifest of the checkpoint, which is, roughly speaking, a special hash of the checkpoint that will be included in the CUP so that catching up replicas are able to verify the authenticity of the state they are catching up to. This step currently scales linearly with the size of the state, and took 5-7 minutes during the incident.
During state sync, a catching up replica reuses any chunks that did not change from its (outdated) checkpoint and only fetches new or modified chunks from the healthy replicas. During the incident, this process lasted around 9 minutes and consisted almost exclusively of copying unchanged chunks from the old checkpoint.
Given the durations of these two steps, the rejoining replica only had the checkpoint available 14 or more minutes after the healthy replicas had started computing the manifest. In comparison, the CUP interval is 500 blocks, hence at a finalization rate of 1 block per second, there was a new checkpoint every 500 seconds ≈ 8 minutes.
Immediate fix: Reducing finalization rate
As an immediate fix, an NNS proposal to reduce the finalization rate in the affected subnet from 1 block per second to 0.5 was made. The affected replica managed to catch up shortly after the proposal was accepted and executed.
The change addressed the issue in two ways. First, a reduced finalization rate doubles the time between checkpoints, hence when a state sync completes it is not already several checkpoints old. This gives more time to the state sync protocol to complete.
Second, the last step of a successful rejoin consists of the rejoining replica executing the latest blocks. On the healthy replicas, the rate of execution is limited by the finalization rate, while on the rejoining replica it is only limited by its own hardware. By limiting the finalization rate on the healthy replicas, the rejoining replica is better able to use this as a means to catch up.
While reducing the finalization rate fixed the issue in the short term, achieving a consistent finalization rate of 1 block per second remains the goal, hence more long-term solutions are needed.
Medium-term changes: Optimizing state sync and manifest computation
A number of improvements to the Internet Computer Protocol and/or implementation are being rolled out to all subnets that address the issue more fundamentally.
One set of optimizations targets the step of computing hashes of individual chunks of the state. We compute these hashes at two places in the state sync process. On healthy replicas, we need to compute the manifest (representing a special hash of the checkpoint). On the rejoining replica, we verify all chunks of the state against the manifest when copying or fetching. In both these cases, the bottleneck was the CPU. By making use of parallelism it was possible to speed up these components by a factor of 2x-3x according to initial tests. Now the limiting factor is disk IO. Parallelizing the manifest computation is included in the version that is currently being rolled out, parallelizing validating chunks against the manifest is scheduled for the next release.
Another set of optimizations target the issue of failing state syncs due to the healthy replicas deleting old checkpoints. A state sync can only continue as long as the healthy replicas still have that checkpoint. For a healthy replica, only the most recent checkpoint is needed for its operation, hence it can freely delete old checkpoints. While the recent incident did not trigger aborted state syncs, it is a problem that can be caused by large subnet states, especially if a replica is joining without an existing state to copy chunks from. In those cases, the entire state needs to be transmitted over the network. Our changes include keeping checkpoints on disk for longer, as well as improved caching behavior by reusing chunks retrieved during earlier, aborted state syncs.
More upcoming changes
We have a slate of additional optimizations that are being worked on but didn’t make the most recent replica version proposal.
To further speed up manifest computation, we will propose to not recompute every chunk hash for each checkpoint. Instead, we will reuse the hashes from earlier manifest computations if the corresponding state chunks did not change. This makes the time complexity of manifest computations proportional to the number of changes since the last checkpoint, rather than the full state size. In a similar vein, we are looking into omitting the hashing of chunks in the rejoining replica in cases where the same hash was computed as part of an earlier state sync.
We are also optimizing parts of the code that need to copy files on disk, as this is where the bottleneck often lies. This is particularly relevant to a rejoining replica with a fairly recent checkpoint already on disk, which was exactly the situation of the recent incident.
Thirdly, we are thinking about how to optimize the logic of when a replica attempts to catch up by replaying blocks vs. triggering a state sync. By not attempting both, we’d reduce the load on the underlying hardware.
With the first part of the parallelization optimizations included in the most recently blessed release and the second part being targeted for the next release, we keep monitoring the effect that our changes have on the efficiency of state syncs. If the parallelization changes bring the effect we measured in testing, we hope to be able to make a proposal to restore the finalization rate of subnet jtdsg after they have been rolled out. Otherwise we will keep iterating by rolling out further changes outlined above or resulting from discussions with the community.