Sync Monitor / Mining Rule Engine

coderofstuff · February 1, 2025, 6:12am

Introduction

This forum post is a summary of some of the discussions with @michaelsutton @hashdag and @someone235 which can be found at Sync Monitor / Mining Rule Engine - Google Docs and Telegram: Contact @kasparnd

Here we describe two pieces of related work:

Creating a sync monitor/manager such that nodes can continue mining during a network halt under certain good conditions. This monitor answers the question “Should the node mine?”
Creating a mining rule engine that allows mining behavior to adjust to conditions of the node or the network. This rule engine answers the question “How should this node mine?”

Sync Monitor

This monitor determines whether a node should mine given its observable current state. The rusty kaspa node allows for mining if (1) the node has at least one peer and if (2) the sink is recent. These conditions are sufficient for when the network is operating normally - that is, the network is producing blocks.

What happens if network mining halts for a significant period of time to the point all nodes fail the “is sink recent?” condition? To recover from this halted state, there will have to be some manual intervention for mining to continue.

We know the network is operating normally if it is producing blocks. We also know what the expected number of blocks there are per second (1 block per second in mainnet, but soon to become 10 BPS). Using these two bits of information, we can determine just how fast a node’s sync rate is. Specifically:

sync_rate = (observed_blocks_per_second) / (expected_blocks_per_second)

We can formally define that the “network is operating normally” if the sync_rate is above some threshold. If it falls below this threshold, then regardless of the recency of the sink, the node should mine bring the sync_rate back up.

So, the proposed updated rule a sync monitor is as follows:

The node is connected to at least one peer AND
(The sink is recent OR the sync rate is below threshold)

With this rule, we cover the scenario where the network halts. If you can think of more scenarios that should be covered, please feel free to suggest it below.

Mining Rule Engine

This rule engine determines how a given node should mine. It will be a new addition to the node so that it can attempt to recover from some scenarios, such as performance degradation, that are encountered through the course of a node’s operation. As this is new functionality, I will only describe some of the ideas for scenarios we may care about and some possible recovery methods.

Recovery Methods

Consider pointing only to blue blocks
- Useful when header processing takes longer than some threshold (like 0.5sec) and there are constantly red blocks in the mergeset which is a hint that these red blocks might be the cause of the perf issue.
Consider mining empty blocks (no transactions)
- Useful for mining blocks if something about the transactions is what is causing the error such as with the BadMerkleRoot errors during this freeze: Kaspa 20 minute BlockDAG freeze post-mortem | by Ori Newman | Medium. This post-mortem also mentions this recovery method towards the end.

Mining Scenarios

Scenario	Mining Behavior
Node is functioning normally	Node will mine with the standard behavior (as it works today in mainnet)
Node is in IBD and the sink timestamp is recent enough	Node will mine with the standard behavior (as it works today in mainnet)
Node is in IBD and the sink timestamp is not recent enough	No mining is allowed
Node is receiving multiple blocks that fail validation with BadMerkleRoot	If the node continuously receives ONLY blocks that have BadMerkleRoot error, switch to mining only empty blocks (no transactions).
Network mining has halted	Considering the updated sync monitor logic above to track sync rate, this will be a completely unexpected scenario and will have to have manual intervention.
Too many red blocks being merged causing performance degradation	Allow a maximum of only 1 red block to be merged (or do not merge any red blocks) until node performance settles down.

I encourage the reader to propose more scenarios that may need to be handled or recovery methods to handle such scenarios.