-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error(Engine(NotProposer(Mismatch
and Block(InvalidGasLimit(OutOfBounds
#203
Comments
Yes, I tried to sync an additional node with
blockGasLimit when syncing, I think, because that makes no sense.
I believe the error Also, I'm not sure, but calling of unexisted function reportBenign(address, uint256) public {
}
Do you encounter I don't think it's related to upgrading to posdao because your logs show that the error occurs on block In general, the Parity version you are using could have some issues with |
@d10r Could you also attach your config toml for the node? I tried to resync an additional test archive node with |
Is a block gas limit contract configured in the chain spec? If it is, is the contract deployed? No idea regarding the |
Yes, @d10r told that it is fine. The warning in the logs only appears when syncing.
Since it is not critical and doesn't break anything, I guess, we could leave that as it is, right?
It's hard to say because |
I guess the gas limit thing is not critical, no. |
@d10r did you deploy the |
This reverts the configurable step duration map: https://github.com/poanetwork/parity-ethereum/tree/afck-revert-step-d |
@varasev, @afck thx for helping with the investigation. I just tried with branch afck-revert-step-d, that didn't help.
I finally had the opportunity to analyze it a bit further and had a few interesting findings.
I then looked at the logs of the 2 nodes active as validators at that time.
Up to 5134387, block authorship alternated between 0x6cca and 0x57b2
At then, a validator set switch to 0x6cca only was signalled:
and applied a few blocks later
Next, 0x57b2 came back online and started authoring blocks again at 5134469. The chain kept running after that (it's currently at 5372354). However no more validator set change was signalled, although many epoch changes happened since. This leaves me with some question marks:
@varasev does that information give you an idea for what's going on? Btw. if you want to take a closer look yourself:
|
Regarding this message:
The
As far as I know, the message |
I think everything happened as follows:
It downloaded some skipped blocks from But then something went wrong... When you call Could you attach the full logs for the validator nodes between blocks @afck could you please look into |
It means that
If |
Also, please clarify what exactly happened that time with the second node: was it turned off or just lost connection with the external world? My assumption for the latter case is that the second node could continue producing blocks but in its own chain. Then, after the connection reestablished, the blocks were reorganized, and the node went back to the first node's chain, but at that moment something went wrong there. The full logs for those blocks range would be helpful to understand what happened. |
That's what was deployed: https://github.com/lab10-coop/posdao-contracts/tree/dh-staking-ui I can confirm that both nodes were running Your theory of what may happened is consistent with the logs as far as I can see, in fact there was also a reorg on Here are the logs: You may wonder why both nodes where restarted during that time period. That needs a bit more background: Maybe my way to handle this was a bit sloppy, but check the timestamps: I really just wanted to get this done and out of my head before starting into the new year with a few days off with family :-) It wouldn't bother me much to just rewind the chain again in order to fix tau1. But it looks to me like there's some bug in the AuRa consensus which allows this to happen and which is maybe worth figuring out before upgrading the mainnet. With posdao, validator set changes become much more dynamic and automated - this may expose engine issues which were already there before, but were less likely to occur. |
Below I described what I see in the logs regarding a possible reason why The second node produced block Then after We see that the second node after its starting lives in its own fork for some short time because it logs For the blocks And for the same blocks we see the same message on the second node: But the block hashes on both nodes for those blocks are the same: On the second node before the block But on the first node before the block And before the block Maybe at that moment the validator nodes were connected to bootnode (because we see that the block hashes match), but they weren't connected to each other for some reason? (because we see One more thing (maybe answers some questions above): On the first node the block The second node before its restarting imported the block We see the message Since the second node didn't produce any blocks after On the first node, in turn, the new validator set [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2] was not finalized because we don't see After the second node was restarted at |
@afck Am I right that |
That is very plausible. Still, I don't see why that order of events ended up with the status quo of a validator set change request in limbo and the impossibility to sync a new node to that chain. |
As I wrote here, I believe the reason of unexisting The reason of For the blocks
For the second node we see The first validator (0x6cca) for the block The weird thing is that the block It's hard to say why the author of Then we see that the next blocks (5134471, 5134472 and so on) are produced in the right order. Just in case: could you check the authors of |
Anyway, I think the reason of such a collision is that the chain had only two validators, one of them was offline, so it was hard (or buggy?) for the node to follow AuRa rules because of the It is not excluded, that AuRa still has some bugs and not well tested for the cases when the two nodes lose connection to each other, taking into account still opened issues for the same error: https://github.com/paritytech/parity-ethereum/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+Error%28Engine%28NotProposer%28Mismatch Also, as noted in this issue the time desynchronization on the nodes can be one of the reasons. It would be interesting to try to reproduce the case for two validators, when the second of them is turned off, and then |
Regarding my above note:
I repeated the same case with https://github.com/varasev/test-block-reward/tree/87476afef319b729edd2646672c483c84adeaa4c (see node's logs after block #20): https://github.com/varasev/test-block-reward/blob/87476afef319b729edd2646672c483c84adeaa4c/scripts/watch.js#L119-L127 - here at block (I used Parity 2.6.5-beta)
It is the same as in your second validator's logs (lines 227-233) - the network for the second node stopped because the validator set was finalized and after that, there were no available validators anymore (since the second validator wasn't connected to the first validator). I haven't yet tried to reproduce the case for two validators, when the second of them is turned off, and then |
@d10r Could you please attach the full bootnode's logs for the same block range (5134350...5134500)? If you have more than one bootnode, please attach the logs of all of them. |
You are right - the first validator initially had a port
For the second validator the port |
I wonder whether it has anything to do with openethereum#11107 (comment).
The "benign misbehavior" is not really about the connection status, but about skipped blocks (but I don't see the "
That's my understanding as well.
When I tried on Wednesday, I got lost in the code, so let me try step-by-step:
So it looks to me like that log message can only appear for a block in which the Edit: Sorry, that was only
So I guess |
I tried to reproduce the similar case, but unfortunately didn't see the same behaviour. The test scenario (requires Parity >= 2.6.5-beta): git clone -b two-validators-bug https://github.com/varasev/test-block-reward
cd test-block-reward
npm i
npm start At the block To repeat the test, perform I'll also try to test that with https://github.com/poanetwork/parity-ethereum/releases/tag/parity-v2.5.9-posdao-v0.1.2 (this will require |
blockGasLimit
and Error(Engine(NotProposer(Mismatch
I believe the main reason of this collision is that the validator set was not finalized. If the finalization was successful, we wouldn’t see the wrong order of block authors (for blocks 5134468-5134470) and wouldn’t see the Again: If we look inside the 0x6cca.log, we will see that the validator set finalization happened on block Note, that the block And further the chain uses the block So, the finalization happened in the block The question is why didn't the finalization happen in the chain after the second node was restarted and reorged to the block Seems the first node remembered somehow that Or maybe the first validator ( |
Or the reason was just in clocks desynchronization on the nodes: openethereum#10607 (comment) |
I tried to reproduce the similar case for The test scenario: git clone -b two-validators-bug-parity-2-5-9 https://github.com/varasev/test-block-reward
cd test-block-reward
npm i
mkdir parity-ethereum
cd parity-ethereum
wget https://github.com/poanetwork/parity-ethereum/releases/download/parity-v2.5.9-posdao-v0.1.2/parity-macos.zip
unzip parity-macos.zip
cd ..
npm start At the block To repeat the test, perform @d10r So, I think in your case it's better to rewind the chain as you suggested and then add one more validator so that the chain would have three validators to be more sustainable. Also, it's better to make sure the nodes are time synchronized. |
@afck Do you mean I tried to launch the test scenario (you can repeat it): #203 (comment) - it shows that The For |
I hadn't spotted that Clock de-synchronisation is rather unlikely. Both nodes were running on the same machine at the time (thus, even network induced delays were minimal).
They match.
node
Unfortunately, the logs of connected bootnode and rpc node were already cut off at a later point (journalctl limits). So, for tau1 I'll just rewind again. Btw. I've never seen Thanks to your failed reproduction test I feel a bit more comfortable. |
@varasev: Yes, I meant the |
I just tried to do that - it works fine, but only for the node without state pruning (i.e. which has In order to rewind, launch Short docs for |
@varasev that's weird. When I try to reset, I get this:
This is with a config for rpc with trace (e.g. for explorer):
Does it work for you with a node configured for AuRa? |
I tried Has your node always worked with |
yes, this was with I just tried with https://github.com/varasev/test-block-reward and Parity 2.6.5-beta. Works for me too. |
Ok, found it. Looks like |
Yeah, really. Last time I tried to rewind 50 blocks, but now retried again with 65 blocks and see the limit as well. |
@afck I've now build Parity from upstream master (2.8.0-beta - commit 9c94dcb) and tried if it can sync ARTIS tau1 with POSDAO active. It could, but once it's sync, it then tends to get stuck with this error msg:
Afterwards, it's stuck. The blocks it gets stuck at are correct ones (not of some stale branch) - I checked the block hashes. E.g. the one of the err msg pasted here is this one. There's a pattern for when it happens: That's always 2 blocks before an epoch change. Does that give you any ideas? When running Parity 2.5.9 with posdao changes from the poa repo on this chain, I've never seen this happen. |
It happens on the latest two blocks of the epoch because of this condition: https://github.com/lab10-coop/posdao-contracts/blob/fafc47ac803dcd271f75dea33ef61a8c5ad628bf/contracts/TxPermission.sol#L224 I think maybe we need to turn off the checking on |
Error(Engine(NotProposer(Mismatch
and Block(InvalidGasLimit(OutOfBounds
@d10r To temporarily avoid the bug, you can upgrade the |
As far as I understand, the block gas limit contract breaks light client compatibility… and maybe also warp sync? |
@afck maybe we missed something in upstream? Something that we have in |
Just tried without using warp sync, behaves the same. I've now tried this Parity binary with So, I'll just upgrade |
So, you see the error
Which exactly version of the Parity binary do you mean here?
Although, the |
Have you tried with |
right
9c94dcb8 (which was the HEAD of master a few days ago), built myself in release config, because there wasn't yet a beta release with all posdao PRs.
yes. I don't have an explanation either.
No. Will try and report back. |
Tried in
In this case. it happens at block 21 (reproducible).
@varasev couldn't it be caused by some refactoring, e.g. this, which the latest Parity relies upon? |
Doesn't the |
|
@d10r I tried to launch
for all the nodes in the config directory. The tests use our latest POSDAO contracts from https://github.com/poanetwork/posdao-contracts. Were these steps to reproduce correct from my side? |
@d10r Yes, sorry, seems that's the reason for the error. Your |
Thx @varasev . I suspected something like this, but knowing exactly gives more peace of mind. |
@d10r reports the issues related to our Parity fork (in
aura-pos
branch):1.
When syncing, the log is full of such msgs:I get those err messages only when a client is catching up. As soon as it synced to the last block, they stop:
I wonder if it's something to be ignored on nodes catching up (would probably make sense then to suppress the err msg altogether) or if it signals a real problem.
2.
I'm also having troubles with syncing nodes getting stuck, like this:I was able to successfully sync "normal" full nodes. This is happening with a node configured as archive node (for blockscout).
The text was updated successfully, but these errors were encountered: