The twelve minutes problem

hashdag · February 26, 2020, 2:43pm

When a block in a DAG is mined on top of a very old state, by some faulty or malicious miner, validating it might consume too many resources. Basically, validating a block amounts to rolling back the current state, undoing all updates that happened outside its context (:= the set of blocks in its past), and checking whether its own txns are consistent with the UTXO set of its context. If the block is mined on an old part of the DAG, this process is prohibitive. This problem came to be known within the core devs as the twelve minutes problem, which is a much better and descriptive name than the opaque void name CPU attack.

To be clear, this is not an attack on consensus, as this outlier block wouldn’t cause a reorganization of the ordering of events (assuming you use a secure DAG algorithm such as GHOSTDAG etc.). But our insistence on checking that every block in the DAG is valid implies that we must rollback and effectively simulate a reorg in order to calculate the UTXO per the context of each and every block, even though we are only interested in the UTXO set of the entire DAG (aka of the virtual block of the DAG); this is both inefficient and opens up a CPU attack vector, even for attackers with a very small hash rate.

This is a problem unique to blockDAGs, because in blockchains you only care about the state of a block if it is part of the canonical chain, which usually implies that the calculation of its state is straightforward.

This problem admits three possible solutions:

Use a [small] finality window. This is the simplest solution, and the “dirtiest” one—it caps the depth of the reorg. However, we want the finality window size to depend on assumptions on the network and on social consensus, rather than on our algorithmic ability to reorg. More on this here [link to be added]. The same goes for a pruning window. So this is not the preferred option.
Allow invalid blocks. I.e., disable the check of whether a block is valid, and, instead, just compute the state of the UTXO set of the virtual block of the DAG. This is essentially going with the data availability (DA) model, where blocks’ data are not validated in the base consensus [other post expanding on this coming soon].
Allow invalid Red blocks. This is my preferred option. I propose a new rule: block X is invalid if its transactions are inconsistent with the UTXO set of its past or it points at an invalid block Y such that Y is in X’s blue set (i.e., the blue set of the DAG that consists of X.past). Specifically, we allow block X to point at block Y and skip checking its validity in case Y is not Blue according to X. In other words, you only need to validate all of your own block’s Blue blocks. The save on the cost comes from that the diff between the state of the virtual and the state of new blue blocks is small (read: constant), by construction. Note that it could be the case the X1 and X2 both point at an invalid block Y, but only X2 is invalidated by that – in case Y is in the blue set of X2 but not in that of X1. Note further that it’s okay for block Z to point at X2 and to not check whether Y – and, by implication, whether X2 – is valid or not, unless X2 is in the blue set of Z (in which case the state of the DAG Z.past would have a small diff to the state of the DAGs Y.past and X2.past).

Intuitively, there are two extremes. One is the DA model (option 2) where you do not validate blocks at all, which is equivalent to saying that you calculate only the UTXO set of the virtual block; I’ll expand on the challenges of this in a separate post, but for now note that it seems incompatible with SPV mode support. The other extreme is you validate all blocks, i.e., calculate the UTXO set of every block, as in our current design, and then you are open to CPU attacks in the same order of magnitude as the finality window (option 1). The middle ground I propose (option 3) is validating the UTXO set of every Blue block, and is one which supports SPV nodes (which calculate the set of Blue blocks via headers), and avoids CPU attacks.

hashdag · February 26, 2020, 2:44pm

As usual, thanks for Alex Tran for helping out with the write-up!

Michael · February 26, 2020, 6:29pm

This post describes an implementation vulnerability in a given consensus definition, and the solutions it suggests are to change those consensus definitions.

In my opinion, this is going two-steps-at-once, while not necessarily required.
Before the current consensus definition is declared infeasible, a few questions are to be asked.

Is the “twelve minutes problem” a requirement bug or implementation bug?
Can this problem be solved by implementation/engineering optimizations?
Does the new consensus suggested completely eliminate the problem?
Should this problem not occur, would the current consensus be preferred over the suggested solutions?

Let us get somewhat deeper into details.

Requirement vs. Implementation

If the root cause of the problem is not the definition of the required consensus algorithm, but the specific implementation under which it was detected, there is no reason to change the requirement (i.e. the consensus algorithm), but only to solve the bug.

In this specific case, most of the time consumed due to a wrong implementation of the algorithm. Should the right implementation be used in the first place, that problem would have been referred to as the (much less impressive) “just-a-bit-more-than-a-second problem”.
Not that this is not a problem in a system running at 1 block/sec rate; but it puts the problem into its right proportions. If that would happen to a block that is even deeper in the past, then it could have increased from 1sec to a few seconds. This is indeed a problem, but it makes sense that there might be other means to solve it.
Which brings us to the next section.

Optimization

Now that we understand that calculation of UTXO diffs for tens of thousands of blocks is within the boundaries of a few seconds, it seems that there might be implementation solutions to it.

For example, suppose that we support up to 10^6 blocks in the past. In that case, using the fixed-yet-naive implementation would involve 10^6 actions of UTXO diff calculations (which, based roughly on ~1sec for ~40,000 blocks, would have taken about 25sec). However, we can optimize the implementation. For example, if we save the accumulated diff of each 1000 blocks group. Calculating each such group takes (roughly, based on the same assumptions) ~25ms, which should occur once every ~1000sec.
Then, when there’s a need to calculate the UTXO set of 999,999 blocks ago, it only involves 1999 diff calculations: 1000 diffs of the accumulated groups, and then 999 more diffs of blocks within the last group.
All in all, this should take about 50ms.

(of course, the diffs are applied in a DAG form and not as plain chain, which will require some modifications to that; the above only presents an idea.)

Theoretically, this might work - but it depends in the actual expected size of the UTXO diffs and whether they do not exceed the RAM limits. But this is an example for an elegant solution that does not involve any change in the consensus definition.

Eliminating the Problem

While the proposed suggestions solve the problem as appeared in practice, we should understand whether they completely eliminate it from appearing. If there might be other circumstances where even the new consensus suggestion requires the calculation of tens of thousands of blocks or more, then the problem is not solved.
For example, if the new consensus suggestion still allows DAG reorgs that might involve this amount of blocks, then the solution is incomplete, at most.

Is it Better?

This is a question I am not feel qualified to answer.
If the new suggested consensus has other reasons to be preferred over the current one, then it makes sense to adopt it (even if it does not completely solve the problem).
If it isn’t, then it is probably not a good idea to update the original requirement just because of some engineering issues that might be solved in the implementation level.

Is it?

For that matter, I believe that if algorithm A is better than B to some extent but not in a meaningful way, but at the same time it is more complex than algorithm B, then the simpler algorithm wins.

hashdag · February 27, 2020, 10:50am

a. The consensus security really stems from the robustness of the UTXO set of the virtual block. So validating red blocks may not be a requirement bug per se, but it does hightlight a requirement redundancy.

b. I believe it is not an implementation bug, and that any implementation would consume \Omega(N) (where N is the finality window) either in time or in space. Of course, a good implementation would take a few seconds, as you said, and not twelve minutes, but even that would (i) consume linear resources, (ii) probably still prohibitive (aside from being redundant, as no direct gain from this computation).

The solution you described does not seem to solve the problem; recall that we are maintaining diffs from the virtual (=from the most updated state) and not from the genesis, and so keeping the UTXO diff snapshot each 1000 blocks would require updating all of these diffs with every new update.

c. Does the new suggested consensus completely eliminate the problem? Yes, it allows for an implementation that is guaranteed to consume O(1) resources, in expectation. Indeed, under the new rule, only real reorgs (and not simulated ones) require rolling back the state; the probability of a reorg at depth N is O(e^{-c\cdot N}), and it consumes O(N) resources, and therefore the expected resource consumption is O(1).

Ideally, our node should be able to support a real reorg at any depth above the finality window; I believe it’s a separate discussion, b/c deep (exponentially unlikely) reorgs justify reverting to other measures, such as reading snapshots of the (full) UTXO from the disk etc.

Michael · February 27, 2020, 1:44pm

Thanks @hashdag for the answers.

I can see that your answers are highly theoretical/research focused, while the problem still, in my opinion, lays in the engineering/implementation domain.

Consuming \Omega(N) resources is usually never a problem in modern machines, unless N is enormously huge or the constant (i.e. the resources per a single item) are considerably high.
Therefore, supporting, say, up to 10^6 blocks (I have no idea what is a reasonable number for that) where the space required per block is 1kilobyte, consumes less than 1GB of memory, which is acceptable for common machines, unless there are additional resources taken by other parts of the system.
Using the same logic, if a single operation requires 1000 CPU instructions, then the number of instructions required for 10^6 blocks is 10^9, which is less than 0.5sec in most modern machines.
The only important thing is to limit the size of N, which should be done by the definition of finality.
Regarding the concern about the need to update the “group diffs” (i.e., the accumulated diff of each group of M blocks, e.g. 1000 blocks): there is no need to update that every block. There is no reason not to do it every M blocks. It does not really matter if you start counting from genesis or from virtual tip. I have argued that getting to the deepest block will require going over all the group diffs and then over all the block diffs within the last group. So if the first group diff is not at the virtual tip but somewhere deeper, just go over the block diffs from virtual tip to the first group diff, then over all group diffs, and finally over all the block diffs within that group. So instead of being limited by 2*M it will be limited by 3*M (assuming that M=\sqrt N). Not really different.
The expected value in that context is irrelevant. It is somewhat like the statistician that has drown to death in a swimming-pool that is one feet deep in average.
If a given situation might kill the process, then the fact that it rarely happens is irrelevant. For example, suppose that we our machine cannot exceed the amount of 4GB RAM, and that we have to choose between two competitive algorithms: one requires 2GB all the time, while the other only requires 1GB most time. However, the 2nd algorithm requires—only for 1sec per hour—36GB. It is easy to see that the expected value of that algorithm is only 1.01GB, which is much less than that of the first, which is 2GB. Nevertheless, the 1st algorithm will run smoothly, while the 2nd will crash out of memory within an hour.
In our case, the resource is CPU time rather than memory, but it’s the same idea.
Validating each new block requires some steady, acceptable amount of CPU. On the other hand, if we will have to validate all those blocks at once during reorg, then the CPU it will consume might be lethal to the system.

hashdag · March 1, 2020, 9:47am

I’m sure you are aware that there is no difference between theory and practice, in theory.

Okay, fair point. I do want to highlight that even 0.5 sec might be prohibitive, because it is in addition to the steady ordinary verification that the node must process.
Also, ideally the finality point should not be decided according to nodes’ capabilities rather according to assumptions on network connectivity and social consensus.
I’m not sure that group diffs save anything on the cost, asymptotically, but then again I’ll put this kind of analysis aside for the boring sake of practicality. Regardless, the point remains that if the diffs are computed with respect to the genesis (or finality point) then they are deterministic and fixed, whereas if they are with respect to the virtual then they must be constantly updated, bringing us back to a linear number of operations per block instead of a constant.
Agreed that expectation inandof itself is not informative enough, I’ll be more explicit then: The probability of a reorg a few minutes backwards is infinitely small, unless there’s a huge network split (malicious or due to some internet corona) or a new 51% attacker joined the system. Yes, our nodes must know how to handle these events as well, but then it’s okay to revert to other measures such as slower verification of new blocks, mining empty blocks, reading snapshots from disk etc.

someone235 · March 2, 2020, 3:02pm

I agree with @hashdag. With the right engineering, there shouldn’t be a problem to handle a deep reorg. Yes, it’ll be slow and very annoying, but the node shouldn’t crash, and will eventually catch up.
If we pre-calculate the UTXO of potential deep reorgs with very low probability, it means we are open to CPU attacks. The cost of the CPU attack is O(blockRate*finalityWindowDuration). If we want to keep the cost constant, it means that every time we increase the block rate, we need to decrease finality window size.
But anyway, I think that we need more accurate benchmarks in order to better evaluate the damage of this attack.

elichai2 · March 3, 2020, 4:05pm

I do not agree with:

“Slow and annoying” can mean the network halts for a few minutes and if it happens often enough it can even halt for hours.

Now assuming some finality window and some implementation you might say “We solved this in 0.5ms” and it’ll be fine. but I do think it is definitely a real problem with a large enough finality window.

I believe that step 2 doesn’t change any consensus definitions, and actually matches more closely the ideas behind the GhostDAG ordering rules. (IMHO from my understanding)

elichai2 · March 3, 2020, 5:26pm

Did some benchmarks on bitcoin core,
ran invalidateblock on 0000000000000000000aebe57993d2526e33b1361e9cce566f16368fba971ba3 (11,012 blocks ago).
from 0000000000000000000aebe57993d2526e33b1361e9cce566f16368fba971ba3 (height=620011, date=‘2020-03-03T16:21:11Z’) back to 00000000000000000010ef34db7a2711710bbbc025d6780d48b42edfdc3b6458 (height=608999 date=‘2019-12-20T20:09:11Z’)

and it took 20:34 minutes to undo the UTXO set back to that part.

First of all this is way better than I’ve thought. under the bitcoin assumption(one sequential chain with 10 minutes per block) this is more than great.

But for Kaspa, If we assume Kaspa’s blocks are 1/3 of a bitcoin block(11,012*3). and assume 1 block per second(33,036 seconds) we get a 9 hour “reorg”. meaning that if there’s a block pointing to another block from 9 hours ago bitcoin core’s code will take 20 minutes to adjust the UTXO set for that.

If these benchmarks are applicable to Kaspa’s code, then any finality window of a day upwards opens an attack vector of halting the network for 20 minutes and more.

Also generated on regtest 11,012 empty blocks and did invalidate block on them and it took ~10 seconds, so it seems to be relative to the size / number of transactions.

hashdag · March 4, 2020, 4:36am

Re “slow and annoying” vs halting the network: Perhaps calculating a deep reorg state should be done on a different thread of execution, with controlled allocation of CPU resources; only requirement here is that block verification time \ll block rate.

Re “change in consensus definition”, indeed technically it’s a change of the consensus rule, but conceptually the consensus algorithm is totally indifferent to a block’s state.

someone235 · March 12, 2020, 7:02am

So I did some benchmarks with the improved restoreUTXO function that @mikez wrote, but this time I tried it with fuller blocks. (1000 outputs per block) and made a chain of 1000 of them.
In this situation it took about 65 MB to save the UTXO diffs in RAM, and for 86400 blocks (a whole day) it would take about 5.6gb. This is why I wanted to try an approach where we load all of the UTXO diffs from disk, so we can save memory.
So restoreUTXO took 4.7 seconds for 1000 blocks (when loading the diffs from disk).
If we assume DAG width of 10 and 86400 blocks per day (1 block per second), a restoreUTXO for the first block in the day will have to apply 8640 diffs, which means it’ll take about 40 seconds.