Updated survey of the data growth in kaspad

All and all we store three components: full header data above the pruning block, the UTXO set of the pruning block, and a proof of correctness for the UTXO set.

We have a fancy pruning mechanism that allows us to remove old block data. At full capacity the size of a block payload is bound by 100kB and the size of a block header is bound by 100 +32\cdot \log_2(\mbox{past size}) bytes. In the distant future where we have a trillion blocks in the network (this will take about 30 thousand years of one block per second) we will have that \log_2(\mbox{past size})\le 40, so let us assume henceforth that \log_2(\mbox{past size})\le 40 forever (being mindful that the following analysis is “only” good for the coming 30 thousand years). This means that the header size is bound by (100+32\cdot40) bytes which is just below 1.5kB. We store three days worth of full block data which, at a rate of one block/second, accumulates to about 26GB (note that this bound assumes that all blocks are at maximum capacity, no assumptions on average number of txns per block).

The UTXO correctness proof requires that we keep additional \log_2(\mbox{number of blocks in the network}) headers (not full blocks). Using again the assumption \log_2(\mbox{past size})\le 40 this adds about 60kB of data, which is completely negligible. Currently we store all block headers, as it requires some care to remove them without accidentally removing headers required for the proof and our dev team hasn’t got around to this yet, this is a completely technical issue which will be resolved in the near future. (There is another detail I swept under the rug, which is that we also have to store the headers of all pruning blocks. This means one header per day. While this technically grows at a rate O(n\cdot\log n) the constant is ridiculously small: it is bound by 1.5kB/day, which are about 570kB a year).

The only thing that grows linearly is the pruning block UTXO set itself. It currently requires a field of a fixed size for every unspent output in the network. It is hard to predict how fast this set grows as this heavily depends on user behavior. We intend to resolve this in the future by means of cryptographic accumulators. An accumulator is a way to represent a large set succinctly such that it is impossible to recover the set itself (due to information theoretic compression bounds), but it is possible to verify that an element is in the set given a proof. This means that every user will only need to store the (proofs of) their own unspent outputs, and the nodes will only have to verify this proof against the accumulator, which is much smaller than the actual number of unspent outputs. The sizes of the accumulator and the proofs depends on the exact solution we will choose.