Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error(Engine(NotProposer(Mismatch and Block(InvalidGasLimit(OutOfBounds #203

Open
varasev opened this issue Jan 6, 2020 · 50 comments
Open

Comments

@varasev
Copy link

varasev commented Jan 6, 2020

@d10r reports the issues related to our Parity fork (in aura-pos branch):

1. When syncing, the log is full of such msgs:

2019-12-30 23:12:12 UTC IO Worker #3 ERROR client  Failed to call contract at 0x83c8ce4c7ddf5192504e1e4199f3aa816d3ea6c167fd395895521ded263b5754's child
2019-12-30 23:12:12 UTC IO Worker #3 ERROR engine  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0x83c8ce4c7ddf5192504e1e4199f3aa816d3ea6c167fd395895521ded263b5754\'s child"

I get those err messages only when a client is catching up. As soon as it synced to the last block, they stop:

2020-01-02 10:40:50  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0x40a71faff37edc4a54887a8dd536ed33628cef049108913701e77e96c098d44d\'s child"
2020-01-02 10:40:50  Failed to call contract at 0x858181bfc32f973ecc316aaf1b8197ee20c3f2ca7f4d0d7df6577c4bf3af8705's child
2020-01-02 10:40:50  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0x858181bfc32f973ecc316aaf1b8197ee20c3f2ca7f4d0d7df6577c4bf3af8705\'s child"
2020-01-02 10:40:50  Failed to call contract at 0x42fc4cdd0bc406380f4eeb0166384e37221fc62f984d5eab2c9fcad17896fdf6's child
2020-01-02 10:40:50  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0x42fc4cdd0bc406380f4eeb0166384e37221fc62f984d5eab2c9fcad17896fdf6\'s child"
2020-01-02 10:40:50  Failed to call contract at 0xd42bb8b60c5de4b09bf10d7ce487efba989a2b66fec1c7cfed7e084ad33d113b's child
2020-01-02 10:40:50  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0xd42bb8b60c5de4b09bf10d7ce487efba989a2b66fec1c7cfed7e084ad33d113b\'s child"
2020-01-02 10:40:56  Imported #5161056 0x793f…d0e3 (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB) + another 15 block(s) containing 0 tx(s)
2020-01-02 10:41:00  Imported #5161057 0xa800…8357 (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-01-02 10:41:06  Imported #5161058 0x0aea…61e2 (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-01-02 10:41:10  Imported #5161059 0xcfa2…0a83 (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-01-02 10:41:13     2/25 peers   89 KiB chain 9 MiB db 0 bytes queue 105 KiB sync  RPC:  0 conn,    0 req/s,    0 µs

I wonder if it's something to be ignored on nodes catching up (would probably make sense then to suppress the err msg altogether) or if it signals a real problem.

2. I'm also having troubles with syncing nodes getting stuck, like this:

Jan 02 10:26:17 rpc-trace parity[7383]: 2020-01-02 10:26:17  Failed to call contract at 0x05942b556e1c3b563061d80f1c5f77475aed6e153dd38c94eb3e5d8559c249ab's child
Jan 02 10:26:17 rpc-trace parity[7383]: 2020-01-02 10:26:17  Failed to call blockGasLimit. Not changing the block gas limit. "Failed to call contract at 0x05942b556e1c3b563061d80f1c5f77475aed6e153dd38c94eb3e5d8559c249ab\'s child"
Jan 02 10:26:17 rpc-trace parity[7383]: 2020-01-02 10:26:17  Validator 0x57b2…fef7 could not be reported Transaction error (Transaction has invalid signature: Crypto error (Invalid address).)
Jan 02 10:26:17 rpc-trace parity[7383]: 2020-01-02 10:26:17  Stage 4 block verification failed for #5134470 (0x0594…49ab)
Jan 02 10:26:17 rpc-trace parity[7383]: Error: Error(Engine(NotProposer(Mismatch { expected: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, found: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7 })), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })
Jan 02 10:26:17 rpc-trace parity[7383]: 2020-01-02 10:26:17
Jan 02 10:26:17 rpc-trace parity[7383]: Bad block detected: Error(Engine(NotProposer(Mismatch { expected: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, found: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7 })), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })
Jan 02 10:26:17 rpc-trace parity[7383]: RLP: f90249f90244a02ee0274b8aeb6d561013394848d638a70b481700c752fe48f9938e73ecada1eda01dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d493479457b2b56d6cdc609fcdd086a70b02031e72eefef7a0f7d9878c722a2a3d47d6424e5b033e8bd772443a2c00f7a126950f74f581c579a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421b901000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000090fffffffffffffffffffffffffffffffd834e58868398968080845e0bb1f39fde830205098f5061726974792d457468657265756d86312e33362e30826c698412cf2397b84180129169720b99cbf931fdd8ded13dd4b2c11db3d29fa57e01766f4f17eeee1520797bc855674c0a2d1880b9b52bb48d6b5ade00e0fff315c981b9f8e0e8537b01c0c0
Jan 02 10:26:17 rpc-trace parity[7383]: Header: Header { parent_hash: 0x2ee0274b8aeb6d561013394848d638a70b481700c752fe48f9938e73ecada1ed, timestamp: 1577824755, number: 5134470, author: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7, transactions_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, uncles_hash: 0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, extra_data: [222, 131, 2, 5, 9, 143, 80, 97, 114, 105, 116, 121, 45, 69, 116, 104, 101, 114, 101, 117, 109, 134, 49, 46, 51, 54, 46, 48, 130, 108, 105], state_root: 0xf7d9878c722a2a3d47d6424e5b033e8bd772443a2c00f7a126950f74f581c579, receipts_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, log_bloom: 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, gas_used: 0, gas_limit: 10000000, difficulty: 340282366920938463463374607431768211453, seal: [[132, 18, 207, 35, 151], [184, 65, 128, 18, 145, 105, 114, 11, 153, 203, 249, 49, 253, 216, 222, 209, 61, 212, 178, 193, 29, 179, 210, 159, 165, 126, 1, 118, 111, 79, 23, 238, 238, 21, 32, 121, 123, 200, 85, 103, 76, 10, 45, 24, 128, 185, 181, 43, 180, 141, 107, 90, 222, 0, 224, 255, 243, 21, 201, 129, 185, 248, 224, 232, 83, 123, 1]], hash: Some(0x05942b556e1c3b563061d80f1c5f77475aed6e153dd38c94eb3e5d8559c249ab) }

I was able to successfully sync "normal" full nodes. This is happening with a node configured as archive node (for blockscout).

@varasev
Copy link
Author

varasev commented Jan 6, 2020

I get those err messages only when a client is catching up. As soon as it synced to the last block, they stop.

Yes, I tried to sync an additional node with posdao-test-setup and see the same messages about blockGasLimit when syncing. It seems that something is wrong in the gas_limit_override function:

fn gas_limit_override(&self, header: &Header) -> Option<U256> {
- that shouldn't try to call blockGasLimit when syncing, I think, because that makes no sense.

I'm also having troubles with syncing nodes getting stuck, like this

I believe the error Error(Engine(NotProposer(Mismatch is not related to blockGasLimit and may be related to one of these issues:

https://github.com/paritytech/parity-ethereum/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+Error%28Engine%28NotProposer%28Mismatch

https://github.com/paritytech/parity-ethereum/issues?q=is%3Aissue+Error%28Engine%28NotProposer%28Mismatch+is%3Aclosed

Also, I'm not sure, but calling of unexisted reportBenign function could be a reason of Error(Engine(NotProposer(Mismatch. You can try to add an empty reportBenign to ValidatorSet contract and try to resync again:

function reportBenign(address, uint256) public {
}

This is happening with a node configured as archive node (for blockscout).
I'm now trying again having it isolated (allowing connection to a bootnode only).

Do you encounter Error(Engine(NotProposer(Mismatch each time after resyncing of the archive node on the same block #5134470? (I see its number in your logs)

I don't think it's related to upgrading to posdao because your logs show that the error occurs on block 5134470 which is far away from the hard fork block 5007800 defined in your spec https://github.com/lab10-coop/tau1/blob/master/spec.json

In general, the Parity version you are using could have some issues with Error(Engine(NotProposer(Mismatch because that's a bit obsolete (v2.5.9) in comparison to the latest one in upstream's repo (v2.6.8 at the moment). We are trying to include POSDAO features to upstream's repo (because it's hard to support own fork) and succeeded with some of them (including randomness in openethereum#10946): https://hackmd.io/@poaafck/UpstreamPosdao Now we're waiting for approval openethereum#11245 and openethereum#10928

@varasev
Copy link
Author

varasev commented Jan 6, 2020

@d10r Could you also attach your config toml for the node? I tried to resync an additional test archive node with posdao-test-setup, but didn't catch Error(Engine(NotProposer(Mismatch.

@afck
Copy link
Collaborator

afck commented Jan 7, 2020

Is a block gas limit contract configured in the chain spec? If it is, is the contract deployed?
I do think that syncing nodes need to call blockGasLimit as well: They need to verify that the block creator used the correct limit, as returned by the contract. Otherwise they need to dismiss the block as invalid.

No idea regarding the Mismatch yet. It sounds like there's some new bug in computing the expected proposer for a given step? Possibly related to the configurable step duration?

@varasev
Copy link
Author

varasev commented Jan 7, 2020

Is a block gas limit contract configured in the chain spec? If it is, is the contract deployed?

Yes, @d10r told that it is fine. The warning in the logs only appears when syncing.

They need to verify that the block creator used the correct limit, as returned by the contract. Otherwise they need to dismiss the block as invalid.

Since it is not critical and doesn't break anything, I guess, we could leave that as it is, right?

No idea regarding the Mismatch yet. It sounds like there's some new bug in computing the expected proposer for a given step? Possibly related to the configurable step duration?

It's hard to say because Lab10 team doesn't use configurable step duration in their spec. Let's wait for @10r answer about if the error is observed after each resync or sometimes only.

@afck
Copy link
Collaborator

afck commented Jan 7, 2020

I guess the gas limit thing is not critical, no.
Does the chain spec contain the correct starting block for the gas limit contract, i.e. a block at which the contract was already deployed?

@varasev
Copy link
Author

varasev commented Jan 7, 2020

@d10r did you deploy the blockGasLimit contract before the block 5007800?

@afck
Copy link
Collaborator

afck commented Jan 8, 2020

This reverts the configurable step duration map: https://github.com/poanetwork/parity-ethereum/tree/afck-revert-step-d
It's my prime suspect regarding the Mismatch bug. Could you try whether it syncs successfully with this branch?

@d10r
Copy link

d10r commented Jan 14, 2020

@varasev, @afck thx for helping with the investigation.

I just tried with branch afck-revert-step-d, that didn't help.
Also, I have to correct my statement about "normal" (non-archive) full nodes correctly syncing. Turns out that's the case only when using snapshots. If --no-warp is set, syncing fails just like in archive node configuration.

@d10r did you deploy the blockGasLimit contract before the block 5007800?

yes

I finally had the opportunity to analyze it a bit further and had a few interesting findings.
For a start, here's a log extract of a node trying to sync (with the Failed to call ... entries removed):

2020-01-14 13:00:19  Applying validator set change signalled at block 5132400
2020-01-14 13:00:20  Block gas limit was changed from 10000000 to 2000000.
2020-01-14 13:00:20  Block gas limit was changed from 2000000 to 10000000.
2020-01-14 13:00:20  Signal for transition within contract. New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7]
2020-01-14 13:00:20  Applying validator set change signalled at block 5133100
2020-01-14 13:00:20  Block gas limit was changed from 10000000 to 2000000.
2020-01-14 13:00:20  Block gas limit was changed from 2000000 to 10000000.
2020-01-14 13:00:20  Signal for transition within contract. New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7]
2020-01-14 13:00:20  Applying validator set change signalled at block 5133800
2020-01-14 13:00:21  Syncing #5134310 0xfeed…85c9   856.20 blk/s   50.0 tx/s    6.9 Mgas/s      0+28367 Qed  #5162684    3/25 peers   5 MiB chain 10 MiB db 42 MiB queue 14 MiB sync  RPC:  0 conn,    0 req/
s,    0 µs
2020-01-14 13:00:21  Signal for transition within contract. New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2]
2020-01-14 13:00:21  Applying validator set change signalled at block 5134431
2020-01-14 13:00:21  Validator 0x57b2…fef7 could not be reported Transaction error (Transaction has invalid signature: Crypto error (Invalid address).)
2020-01-14 13:00:21  Stage 4 block verification failed for #5134470 (0x0594…49ab)
Error: Error(Engine(NotProposer(Mismatch { expected: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, found: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7 })), State { next_error: None, backtrace: InternalBacktrac
e { backtrace: None } })
2020-01-14 13:00:21  
Bad block detected: Error(Engine(NotProposer(Mismatch { expected: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, found: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7 })), State { next_error: None, backtrace: Int
ernalBacktrace { backtrace: None } })
RLP: f90249f90244a02ee0274b8aeb6d561013394848d638a70b481700c752fe48f9938e73ecada1eda01dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d493479457b2b56d6cdc609fcdd086a70b02031e72eefef7a0f7d9878c722a
2a3d47d6424e5b033e8bd772443a2c00f7a126950f74f581c579a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421b90100000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000090fffffffffffffffffffffffffffffffd834e58868398968080845e0bb1f39fde830205098f5061726974792d457468657265756d86312e33362e
30826c698412cf2397b84180129169720b99cbf931fdd8ded13dd4b2c11db3d29fa57e01766f4f17eeee1520797bc855674c0a2d1880b9b52bb48d6b5ade00e0fff315c981b9f8e0e8537b01c0c0
Header: Header { parent_hash: 0x2ee0274b8aeb6d561013394848d638a70b481700c752fe48f9938e73ecada1ed, timestamp: 1577824755, number: 5134470, author: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7, transactions_ro
ot: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, uncles_hash: 0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, extra_data: [222, 131, 2, 5, 9, 143, 80, 97, 114, 10
5, 116, 121, 45, 69, 116, 104, 101, 114, 101, 117, 109, 134, 49, 46, 51, 54, 46, 48, 130, 108, 105], state_root: 0xf7d9878c722a2a3d47d6424e5b033e8bd772443a2c00f7a126950f74f581c579, receipts_root: 0x56e81f1
71bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, log_bloom: 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, gas_used: 0, gas_limit: 10000000, difficulty: 340282366920938463463374607431768211453, seal: [[132, 18, 207, 35, 151], [184, 65, 128, 18, 145, 105, 114, 11, 153, 203, 249, 49, 253, 216, 222, 209, 61, 212, 178, 193, 29, 179, 210, 159, 165, 126, 1, 118, 111, 79, 23, 238, 238, 21, 32, 121, 123, 200, 85, 103, 76, 10, 45, 24, 128, 185, 181, 43, 180, 141, 107, 90, 222, 0, 224, 255, 243, 21, 201, 129, 185, 248, 224, 232, 83, 123, 1]], hash: Some(0x05942b556e1c3b563061d80f1c5f77475aed6e153dd38c94eb3e5d8559c249ab) }
Uncles: 
Transactions:

2020-01-14 13:00:26  Syncing #5134469 0x2ee0…a1ed    32.00 blk/s    1.6 tx/s    0.2 Mgas/s      0+    0 Qed  #5126248    1/25 peers   5 MiB chain 10 MiB db 0 bytes queue 5 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-14 13:00:31  Syncing #5134469 0x2ee0…a1ed     0.00 blk/s    0.0 tx/s    0.0 Mgas/s      0+    0 Qed  #5126210    1/25 peers   5 MiB chain 10 MiB db 0 bytes queue 5 KiB sync  RPC:  0 conn,    0 req/s,    0 µs

I then looked at the logs of the 2 nodes active as validators at that time.
As can be seen above, the epoch duration is set to 700 blocks, with the previous epoch change taking place at 5133800. That's where a validator set change is signalled.
The next epoch change was due at 5134500. However, before reaching that, a validator set change is signalled at 5134431. This change was probably caused by one of the validators going offline for a few minutes during that epoch. Here's a list of the block authors around there:

5134380: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134381: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134382: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134383: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134384: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134385: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134386: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134387: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134388: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134389: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134390: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134391: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134392: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
...
5134462: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134463: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134464: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134465: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134466: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134467: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134468: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134469: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134470: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134471: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134472: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134473: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2
5134474: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7
5134475: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2

Up to 5134387, block authorship alternated between 0x6cca and 0x57b2
Then 0x57b2 went offline until 5134469. During that period, 0x6cca correctly reported 0x57b2 as misbehaving:

Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 WARN engine  Reported benign validator misbehaviour 0x57b2…fef7

At then, a validator set switch to 0x6cca only was signalled:

Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 INFO engine  Signal for transition within contract. New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2]
Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 INFO import  Imported #5134431 0x89f3…bb7b (1 txs, 0.07 Mgas, 0 ms, 0.68 KiB)

and applied a few blocks later

Dec 31 20:36:58 tau1 parity[25579]: 2019-12-31 20:36:58 UTC Verifier #2 INFO engine  Applying validator set change signalled at block 5134431
Dec 31 20:36:58 tau1 parity[25579]: 2019-12-31 20:36:58 UTC Verifier #2 INFO import  Imported #5134455 0x71e9…3b75 (1 txs, 0.21 Mgas, 291 ms, 0.94 KiB)

Next, 0x57b2 came back online and started authoring blocks again at 5134469.
That's also the block at which syncing stops.

The chain kept running after that (it's currently at 5372354). However no more validator set change was signalled, although many epoch changes happened since.
When querying the ValidatorSet contract with getValidators(), it still returns both 0x6cca and 0x57b2. That seems to suggest that the signalled change to 0x6cca was never finalized.

This leaves me with some question marks:

  • Is this kind of validator set change signalling, triggered by reporting, expected behaviour in posdao?
  • What is the reason for the last validator set switch not finalizing? I'd have expected that to happen as soon as 0x57b2 came back online.
  • How can it be that nodes (bootnode and rpc node) live syncing as it happened accepted the order of events, while those trying to catch-up-sync later reject them? Is there any possible explanation for that other than there being a bug in the consensus code?
  • How could the switch to a one-node validator set even be finalized at 5134455? Doesn't that need 2f+1 nodes in AuRa?
  • How can the active validator set be different from the last one applied?
  • Why did 0x57b2 author 2 consecutive blocks 5134469 and 5134470? The block timestamps show no gap (5 second deltas), suggesting it was the expected order.

@varasev does that information give you an idea for what's going on?
Is there anything else I could check?

Btw. if you want to take a closer look yourself:

@varasev
Copy link
Author

varasev commented Jan 15, 2020

Regarding this message:

Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 WARN engine Reported benign validator misbehaviour 0x57b2…fef7

The benign reports cannot be a reason for validator removal from validator set (at least, we don't use reportBenign in our contracts). Please, clarify which exactly contracts did you use at that moment? Do you have reportBenign function in your ValidatorSet contract?

It is weird that we see

Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 INFO engine  Signal for transition within contract. New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2]
Dec 31 20:32:40 tau1 parity[25380]: 2019-12-31 20:32:40 UTC IO Worker #0 INFO import  Imported #5134431 0x89f3…bb7b (1 txs, 0.07 Mgas, 0 ms, 0.68 KiB)

at the same time (20:32:40) as benign report. (an explanation is below).

As far as I know, the message Signal for transition within contract appears when the node catches InitiateChange event.

@varasev
Copy link
Author

varasev commented Jan 15, 2020

I think everything happened as follows:

  1. At some point (block 5 134 388) 0x57b2 went offline for some reason.

  2. Then at the end of staking epoch 0x57b2 was removed by Random contract from validator set as malicious (because the Random contract requires all validators to be online at the latest collection round of staking epoch) - it was approximately on the block 5 134 431 - at that block InitiateChange event with New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2] was emitted.

    Since the network only had two validators and one of them was offline, the validator set couldn't be finalized, so we don't see Applying validator set change log message right after InitiateChange is emitted.

  3. Then a bit later 0x57b2 became online at block ~5 134 455 and wrote to logs that it is ready to apply the validator set change:

Dec 31 20:36:58 tau1 parity[25579]: 2019-12-31 20:36:58 UTC Verifier #2 INFO engine  Applying validator set change signalled at block 5134431
Dec 31 20:36:58 tau1 parity[25579]: 2019-12-31 20:36:58 UTC Verifier #2 INFO import  Imported #5134455 0x71e9…3b75 (1 txs, 0.21 Mgas, 291 ms, 0.94 KiB)

It downloaded some skipped blocks from 5 134 455 to 5 134 469 and at that moment the network had two validators of two, so Parity should have called ValidatorSet.finalizeChange for the New list: [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2].

But then something went wrong...

When you call getValidators(), you still see the list of two validators: the validator set was not changed because finalizeChange was not called for some reason.

Could you attach the full logs for the validator nodes between blocks 5 134 350 and 5 134 500?

@afck could you please look into Parity's code to check whether Applying validator set change signalled at block N log message guarantees that ValidatorSet.finalizeChange function is called at block N (not on some later block)? As far as I understand, Artis uses our old fork from here: https://github.com/poanetwork/parity-ethereum/tree/parity-v2.5.9-posdao-v0.1.2

@varasev
Copy link
Author

varasev commented Jan 15, 2020

However no more validator set change was signalled, although many epoch changes happened since.

It means that ValidatorSet.emitInitiateChange is not called because the finalizeChange for the previous InitiateChange was not called by miners.

What does ValidatorSet.validatorsToBeFinalized() return at the moment? I see it returns ["0x6ccaA51F295652dC33F4d8ce12379eac3594f3D2"]

If finalizeChange was called, the _finalizeValidators would be empty at the moment, but ValidatorSet.validatorsToBeFinalized() shows it is not empty.

@varasev
Copy link
Author

varasev commented Jan 15, 2020

Also, please clarify what exactly happened that time with the second node: was it turned off or just lost connection with the external world?

My assumption for the latter case is that the second node could continue producing blocks but in its own chain. Then, after the connection reestablished, the blocks were reorganized, and the node went back to the first node's chain, but at that moment something went wrong there.

The full logs for those blocks range would be helpful to understand what happened.

@d10r
Copy link

d10r commented Jan 15, 2020

That's what was deployed: https://github.com/lab10-coop/posdao-contracts/tree/dh-staking-ui
Forked from claim-reward branch, added epoch rewards (for stakers and for a sustainability fund), changed MAX_VALIDATORS from 19 to 7.

I can confirm that both nodes were running parity-v2.5.9-posdao-v0.1.2.

Your theory of what may happened is consistent with the logs as far as I can see, in fact there was also a reorg on 0x57b2.

Here are the logs:
0x6cca.log
0x57b2.log

You may wonder why both nodes where restarted during that time period. That needs a bit more background:
When we upgraded tau1 to posdao, 4 previous validators were set as initial validators. The upgrade went fine, all looked good for several days. One validator was kicked out of the set after some downtime (router issues), but that was it.
Then one of us started testing the staking logic. He set up an an additional node with a new mining key. However he made a copy'n paste mistake in the node config, setting different addresses for unlocking and for engine signer. Next he added that to the candidates. Since the initial validators were still without stake, adding a staked pool had the effect of the other validators (except one, because the flag FIRST_VALIDATOR_IS_UNREMOVABLE was set) being removed from the active validator set. However, with one of the 2 remaining validators failing to author blocks, the chain couldn't finalize anymore. It forked into 3 branches.
Since this was during the xmas days, I didn't initially have access to the mal-configured node, there was no state to be lost and the priority was to just have the testnet repaired without much effort or delay, I decided to "rewind" the chain by a few epochs (= hours in this config) and re-sync all nodes to that new branch.
In order to do so, I first isolated those two validator nodes, such that they couldn't fetch higher blocks and get stuck on a dead fork that way.
The reboots you see in the log were done in order to re-configure the validator nodes to leave that isolation. By that time the new chain was longer than the previous one and I considered it safe to have them listen on public ports again, and to sync bootnodes, rpc etc.
This did indeed all work out as expected. I did however initially not notice that those downtimes had again caused issuesl, because none of the nodes got stuck or forked away. Only when trying to re-sync the archive node did I first notice - because with that config, warp sync is off.

Maybe my way to handle this was a bit sloppy, but check the timestamps: I really just wanted to get this done and out of my head before starting into the new year with a few days off with family :-)

It wouldn't bother me much to just rewind the chain again in order to fix tau1. But it looks to me like there's some bug in the AuRa consensus which allows this to happen and which is maybe worth figuring out before upgrading the mainnet. With posdao, validator set changes become much more dynamic and automated - this may expose engine issues which were already there before, but were less likely to occur.

@varasev
Copy link
Author

varasev commented Jan 16, 2020

Below I described what I see in the logs regarding a possible reason why finalizeChange was not called. Regarding the blocks 5134469 and 5134470 I haven't yet analyze the logs - will try to do that a bit later.

The second node produced block 5134387 and then was stopped by you. Then you stopped the first node after it produced the next block 5134388.

Then after 46 seconds you started the first node again, but the second node was still offline and you started it ~11 minutes later.

We see that the second node after its starting lives in its own fork for some short time because it logs Reported benign validator misbehaviour 0x6cca…f3d2 (this means it doesn't see the first node) and produces the block 5134388 with different hash 0xef49…bc4b (against the hash 0x96ff…1d5d on the first node), but then in one second it reorgs to the right hash (Reorg to #5134388 0x96ff…1d5d) and downloads skipped blocks.

For the blocks 5134454 and 5134455 we see on the first node the message Reported benign validator misbehaviour 0x57b2…fef7 (this means the first node still doesn't see the second one).

And for the same blocks we see the same message on the second node: Reported benign validator misbehaviour 0x6cca…f3d2 (this means the second node still doesn't see the first one).

But the block hashes on both nodes for those blocks are the same: Imported #5134454 0xc723…d6ae and Imported #5134455 0x71e9…3b75 (how could the hashes match if the nodes were not connected to each other?)

On the second node before the block 5134455 we see Applying validator set change signalled at block 5134431 and after that, the second node doesn't produce any blocks for some reason (a possible reason I described below).

But on the first node before the block 5134455 we see the same message: Applying validator set change signalled at block 5134431.

And before the block 5134455 we see that both nodes were not connected to each other (because both display Reported benign validator misbehaviour).

Maybe at that moment the validator nodes were connected to bootnode (because we see that the block hashes match), but they weren't connected to each other for some reason? (because we see Reported benign validator misbehaviour on both nodes)

One more thing (maybe answers some questions above):

On the first node the block 5134455 was imported twice with different hashes (first 0xfee0…3fae and then 0x71e9…3b75). The first time the block was produced by the first node (because it is under IO Worker message), the second time the block was produced by the second node (because it is under Verifier message).

The second node before its restarting imported the block 5134455 with the hash 0x71e9…3b75, but after restarting reorged to the block 5134455 with hash 0xfee0…3fae from the first node: 2019-12-31 20:39:07 UTC Reorg to #5134455 0xfee0…3fae (0x71e9…3b75 #5134454 0xc723…d6ae )

We see the message Applying validator set change signalled at block 5134431 on the both nodes for Imported #5134455 0x71e9…3b75, but don't see the same message for the block 5134455 with hash 0xfee0…3fae on the first node.

Since the second node didn't produce any blocks after Imported #5134455 0x71e9…3b75, I guess the finalizeChange for new validator set [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2] was called for the block #5134455 0x71e9…3b75 (on the second node) - that's why on the second node after the block 5134455 we don't see Imported ... messages (before restarting the second node).

On the first node, in turn, the new validator set [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2] was not finalized because we don't see Applying validator set change signalled at block 5134431 message for the block #5134455 0xfee0…3fae.

After the second node was restarted at 20:38:57, it was reorged to #5134455 0xfee0…3fae (at which finalizeChange was not called because validator nodes weren't connected to each other) and both nodes continued to produce the blocks.

@varasev
Copy link
Author

varasev commented Jan 16, 2020

@afck Am I right that IO Worker message in logs tells us that the block was produced by our validator node, and the Verifier message tells us about block that was produced by another validator?

@d10r
Copy link

d10r commented Jan 16, 2020

Maybe at that moment the validator nodes were connected to bootnode (because we see that the block hashes match), but they weren't connected to each other for some reason?

That is very plausible.
If I remember right, the nodes were restarted with changed p2p port, in order to get them connected to a bootnode again. Before, they were in isolation config, connected to each other only. It took them a few blocks to get re-connected.
0x57b2 was restarted twice, in both cases followed by a re-org.

Still, I don't see why that order of events ended up with the status quo of a validator set change request in limbo and the impossibility to sync a new node to that chain.

@varasev
Copy link
Author

varasev commented Jan 16, 2020

As I wrote here, I believe the reason of unexisting finalizeChange call is that it was actually called on the second node on the block #5134455 0x71e9…3b75, but wasn't called on the first node which block #5134455 0xfee0…3fae was chosen as main. Since the chain switched to the block #5134455 0xfee0…3fae, finalizeChange was not called after all.

The reason of Error(Engine(NotProposer(Mismatch { expected: 0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2, found: 0x57b2b56d6cdc609fcdd086a70b02031e72eefef7 })) for the block 5134470 is that the new syncing node expects that 5134470 should be produced by 0x6cca instead of 0x57b2. The validator set is [0x6cca, 0x57b2], so since 5134468 is produced by 0x6cca, the block 5134469 should be produced by 0x57b2, and the next block 5134470 should be produced by 0x6cca.

For the blocks 5134469 and 5134470, we see the next picture in logs:

...
20:38:30 - Imported #5134465 (produced by 0x6cca in 10 seconds, because 0x57b2 is offline)
20:38:40 - Imported #5134466 (produced by 0x6cca in 10 seconds, because 0x57b2 is offline)
20:38:50 - Imported #5134467 (produced by 0x6cca in 10 seconds, because 0x57b2 is offline)
20:39:00 - Imported #5134468 (produced by 0x6cca in 10 seconds, because 0x57b2 is offline)

20:39:01 - at this moment 0x57b2 starts

20:39:08 - Imported #5134469 (produced by 0x57b2, but this fact is not written into logs)
20:39:16 - Imported #5134470 (produced by 0x57b2. It still doesn't see the first validator 0x6cca)

20:39:20 - Imported #5134471 (produced by 0x6cca in 5 seconds)
20:39:25 - Imported #5134472 (produced by 0x57b2 in 5 seconds)
...

For the second node we see 20:39:16 UTC Reported benign validator misbehaviour 0x6cca…f3d2 for Imported #5134470 0x0594…49ab. It means that at that moment the second validator (0x57b2) still didn't see the first one (0x6cca).

The first validator (0x6cca) for the block 5134469 expects the second one (0x57b2) to produce the block, because 0x6cca produced the previous block 5134468. For the block 5134469 the first validator doesn't show Reported benign validator misbehaviour 0x57b2…fef7, so seems the first validator connected to the second at that block.

The weird thing is that the block 5134470 should have been produced by 0x6cca, but in fact, it was produced by 0x57b2. I guess the reason is that the second validator has Reported benign validator misbehaviour 0x6cca…f3d2 for the block 5134470. So, the first validator have seen the second one, but the second validator haven't seen the first one at the moment of block 5134470. But this conflict somehow was resolved in favor of the second validator.

It's hard to say why the author of 5134470 is 0x57b2, but this is definitely related to node's restart and sync.

Then we see that the next blocks (5134471, 5134472 and so on) are produced in the right order.

Just in case: could you check the authors of 5134469 and 5134470 separately on the first node, and then on the second one? I mean open RPC on the first node and check there, then do the same on the second node.

@varasev
Copy link
Author

varasev commented Jan 16, 2020

Anyway, I think the reason of such a collision is that the chain had only two validators, one of them was offline, so it was hard (or buggy?) for the node to follow AuRa rules because of the online-offline switches. I heard somewhere that AuRa should be running at least three validators to be sustainable.

It is not excluded, that AuRa still has some bugs and not well tested for the cases when the two nodes lose connection to each other, taking into account still opened issues for the same error: https://github.com/paritytech/parity-ethereum/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+Error%28Engine%28NotProposer%28Mismatch

Also, as noted in this issue the time desynchronization on the nodes can be one of the reasons.

It would be interesting to try to reproduce the case for two validators, when the second of them is turned off, and then InitiateChange is emitted to switch to one-item validator set (containing the first validator), and then turn on the second validator, and see what happens with finalization and the order of block authors. I will try to do that with the simple setup https://github.com/varasev/test-block-reward

@varasev
Copy link
Author

varasev commented Jan 16, 2020

Regarding my above note:

Since the second node didn't produce any blocks after Imported #5134455 0x71e9…3b75, I guess the finalizeChange for new validator set [0x6ccaa51f295652dc33f4d8ce12379eac3594f3d2] was called for the block #5134455 0x71e9…3b75 (on the second node) - that's why on the second node after the block 5134455 we don't see Imported ... messages (before restarting the second node).

I repeated the same case with https://github.com/varasev/test-block-reward/tree/87476afef319b729edd2646672c483c84adeaa4c (see node's logs after block #20): https://github.com/varasev/test-block-reward/blob/87476afef319b729edd2646672c483c84adeaa4c/scripts/watch.js#L119-L127 - here at block #20 our one-validator network is switching to unexisting validator and then we see the following picture in the logs:

(I used Parity 2.6.5-beta)

2020-01-16 17:07:55  Applying validator set change signalled at block 21
2020-01-16 17:07:55  Running AuRa with a single validator implies instant finality. Use a database?
2020-01-16 17:07:55  Running AuRa with a single validator implies instant finality. Use a database?
2020-01-16 17:07:55  Imported #21 0x53c6…1a96 (1 txs, 0.03 Mgas, 2 ms, 0.71 KiB)
2020-01-16 17:08:11     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:08:41     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:09:11     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:09:41     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:10:11     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:10:41     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-01-16 17:11:11     1/25 peers      9 KiB chain  223 KiB db  0 bytes queue    1 KiB sync  RPC:  0 conn,    0 req/s,    0 µs

It is the same as in your second validator's logs (lines 227-233) - the network for the second node stopped because the validator set was finalized and after that, there were no available validators anymore (since the second validator wasn't connected to the first validator).

I haven't yet tried to reproduce the case for two validators, when the second of them is turned off, and then InitiateChange is emitted to switch to one-item validator set (containing the first validator), and then the second validator is turned on. Will do a bit later.

@varasev
Copy link
Author

varasev commented Jan 17, 2020

@d10r Could you please attach the full bootnode's logs for the same block range (5134350...5134500)? If you have more than one bootnode, please attach the logs of all of them.

@varasev
Copy link
Author

varasev commented Jan 17, 2020

If I remember right, the nodes were restarted with changed p2p port

You are right - the first validator initially had a port 30333, but then the port was changed to 30303:

Dec 31 20:34:42 tau1 parity[25579]: 2019-12-31 20:34:42 UTC IO Worker #0 INFO network  Public node URL: enode://a0e417fc9d3494bbd81a3b7c35deb0a564011a28b1b99a35bd6a0471d246af54809a851a849b2c85a9963d5a360a22c4aeb306737ccec4739f5fc01182527cff@5.9.14.80:30303

For the second validator the port 30343 was unchanged.

@afck
Copy link
Collaborator

afck commented Jan 17, 2020

I wonder whether it has anything to do with openethereum#11107 (comment).
Anyway, I agree that this looks like a bug in the consensus code.

how could the hashes match if the nodes were not connected to each other?

The "benign misbehavior" is not really about the connection status, but about skipped blocks (but I don't see the "step gap" message), something mysterious about an invalid seal (I don't understand the comments there, but it sounds like it might be related), something about invalid empty steps, or wrong proposer.

Am I right that IO Worker message in logs tells us that the block was produced by our validator node, and the Verifier message tells us about block that was produced by another validator?

That's my understanding as well.

whether Applying validator set change signalled at block N log message guarantees that ValidatorSet.finalizeChange function is called at block N (not on some later block)?

When I tried on Wednesday, I got lost in the code, so let me try step-by-step:

So it looks to me like that log message can only appear for a block in which the finalizeChange emitInitiateChange has been called, or in which the contract address was changed.

Edit: Sorry, that was only emitInitiateChange! finalizeChange is actually called in SafeContract::on_epoch_begin, which is called in AuthorityRound::on_new_block, which is called in OpenBlock::new, which is called in many places. The relevant one is probably this one in Client::prepare_open_block, where is_epoch_begin is set if Chain::epoch_transition is Some. I assume that is the case if Chain::insert_epoch_transition has been called for that block before. That happens in the constructor and here in Importer::check_epoch_end, which is called in Importer::commit_block, which is called in two import methods.

Importer::check_epoch_end calls AuthorityRound::is_epoch_end

So I guess finalizeChange is called in the first block from whose perspective the one containing the InitiateChange event is finalized?
This should all use block hashes, so a fork or reorg should not confuse that mechanism; but it's all pretty convoluted, so I'm not sure about anything here…

@varasev
Copy link
Author

varasev commented Jan 17, 2020

I haven't yet tried to reproduce the case for two validators, when the second of them is turned off, and then InitiateChange is emitted to switch to one-item validator set (containing the first validator), and then the second validator is turned on. Will do a bit later.

I tried to reproduce the similar case, but unfortunately didn't see the same behaviour.

The test scenario (requires Parity >= 2.6.5-beta):

git clone -b two-validators-bug https://github.com/varasev/test-block-reward
cd test-block-reward
npm i
npm start

At the block 10 the second node is turned off, then at block 15 the InitiateChange event is emitted (to switch validator set to one-item), and then at block 25 the second node starts again (and finalization happens). The logs will be available in parity_data1/log, parity_data2/log, and in the console (by scripts/watch.js).

To repeat the test, perform npm restart. To stop and clear - npm run clear.

I'll also try to test that with https://github.com/poanetwork/parity-ethereum/releases/tag/parity-v2.5.9-posdao-v0.1.2 (this will require spec.json with different format).

@varasev varasev changed the title Some questions regarding blockGasLimit and Error(Engine(NotProposer(Mismatch Error(Engine(NotProposer(Mismatch Jan 17, 2020
@varasev
Copy link
Author

varasev commented Jan 17, 2020

I believe the main reason of this collision is that the validator set was not finalized. If the finalization was successful, we wouldn’t see the wrong order of block authors (for blocks 5134468-5134470) and wouldn’t see the Mismatch error for the block 5134470.

Again:

If we look inside the 0x6cca.log, we will see that the validator set finalization happened on block #5134455 0x71e9…3b75 produced by the second validator 0x57b2 (see the line 548).

Note, that the block 5134455 was already produced by the first validator before (with hash 0xfee0…3fae) - see the line 542.

And further the chain uses the block #5134455 0xfee0…3fae, not the block #5134455 0x71e9…3b75, because we see that the second validator reorged to #5134455 0xfee0…3fae (see 0x57b2.log, the line 275).

So, the finalization happened in the block #5134455 0x71e9…3b75 (I wrote about that here), but didn't happen in the block #5134455 0xfee0…3fae.

The question is why didn't the finalization happen in the chain after the second node was restarted and reorged to the block #5134455 0xfee0…3fae?

Seems the first node remembered somehow that finalizeChange has already been called (for the block #5134455 0x71e9…3b75 - see the line 548 in 0x6cca.log), so it didn't call the finalizeChange again at block ~5134469 when the second node went back online.

Or maybe the first validator (0x6cca) shouldn't have taken into account the validator set change applying which happened on block #5134455 0x71e9…3b75 (produced by 0x57b2) because of the benign misbehaviour of 0x57b2 (see the lines 547-548 in 0x57b2.log).

@varasev
Copy link
Author

varasev commented Jan 17, 2020

Or the reason was just in clocks desynchronization on the nodes: openethereum#10607 (comment)

@varasev
Copy link
Author

varasev commented Jan 17, 2020

I haven't yet tried to reproduce the case for two validators, when the second of them is turned off, and then InitiateChange is emitted to switch to one-item validator set (containing the first validator), and then the second validator is turned on. Will do a bit later.

I tried to reproduce the similar case for parity-v2.5.9-posdao-v0.1.2, but unfortunately didn't see the same behaviour as well.

The test scenario:

git clone -b two-validators-bug-parity-2-5-9 https://github.com/varasev/test-block-reward
cd test-block-reward
npm i
mkdir parity-ethereum
cd parity-ethereum
wget https://github.com/poanetwork/parity-ethereum/releases/download/parity-v2.5.9-posdao-v0.1.2/parity-macos.zip
unzip parity-macos.zip
cd ..
npm start

At the block 10 the second node is turned off, then at block 15 the InitiateChange event is emitted (to switch validator set to one-item), and then at block 100 the second node starts again (and finalization happens). The logs will be available in parity_data1/log, parity_data2/log, and in the console (by scripts/watch.js).

To repeat the test, perform npm restart. To stop and clear - npm run clear.

@d10r So, I think in your case it's better to rewind the chain as you suggested and then add one more validator so that the chain would have three validators to be more sustainable. Also, it's better to make sure the nodes are time synchronized.

@varasev
Copy link
Author

varasev commented Jan 18, 2020

So it looks to me like that log message can only appear for a block in which the emitInitiateChange has been called, or in which the contract address was changed.

@afck Do you mean Applying validator set change signalled at block N message?

I tried to launch the test scenario (you can repeat it): #203 (comment) - it shows that Applying validator set change signalled ... message definitely appears when finalizeChange is called.

The emitInitiateChange is called (and thus InitiateChange event is emitted) much earlier. For example, for that test case the InitiateChange event is emitted at block 15 and then, after the second node starts again at block 100, the Applying validator set change signalled ... message appears.

For InitiateChange event we have another message: Signal for transition within contract ...

@d10r
Copy link

d10r commented Jan 18, 2020

I hadn't spotted that 0x6cca saw two versions of block #5134455 and that the rejected version was the one triggering the message Applying validator set change. This could be an important hint.

Clock de-synchronisation is rather unlikely. Both nodes were running on the same machine at the time (thus, even network induced delays were minimal).

Just in case: could you check the authors of 5134469 and 5134470 separately on the first node, and then on the second one? I mean open RPC on the first node and check there, then do the same on the second node.

They match.
node 0x6cca:

root@tau1:~# curl -s -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x4E5885", false],"id":1}' http://localhost:8545 | jq '.result.author'
"0x57b2b56d6cdc609fcdd086a70b02031e72eefef7"
root@tau1:~# curl -s -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x4E5886", false],"id":1}' http://localhost:8545 | jq '.result.author'
"0x57b2b56d6cdc609fcdd086a70b02031e72eefef7"

node 0x57b2:

root@tau1:~# curl -s -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x4E5885", false],"id":1}' http://localhost:8585 | jq '.result.author'
"0x57b2b56d6cdc609fcdd086a70b02031e72eefef7"
root@tau1:~# curl -s -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x4E5886", false],"id":1}' http://localhost:8585 | jq '.result.author'
"0x57b2b56d6cdc609fcdd086a70b02031e72eefef7"

Unfortunately, the logs of connected bootnode and rpc node were already cut off at a later point (journalctl limits).

So, for tau1 I'll just rewind again. Btw. I've never seen parity db reset <blockNr> working with AuRa. Has it for you?

Thanks to your failed reproduction test I feel a bit more comfortable.
Sure, we can avoid running the chain with such a low validator count. But since posdao has full authority over the validator set, I can imagine several scenarios were the validator set is quickly and automatically shrinked down to a problematic level anyway, e.g. caused by widespread network issues, a DoS attack on the chain causing block authoring timeouts, some client bug etc.
Which leaves the question if in this case I was just very (un)lucky with the timing of node restarts to end up in this situation. Because - as far as I understand - the reason for the missing application of validator set change signalled at #5134431 is still a mystery and shouldn't have happened even with this unfortunate order of events.
I'll take a snapshot of the status quo in case we get an idea for something else to test.

@afck
Copy link
Collaborator

afck commented Jan 20, 2020

@varasev: Yes, I meant the Applying validator set change signalled at block N message, and you're right that emitInitiateChange is called earlier, not in the same block. The message appears when the block with the InitiateChange event is finalized, I think, so several blocks later.

@varasev
Copy link
Author

varasev commented Jan 20, 2020

Btw. I've never seen parity db reset <blockNr> working with AuRa. Has it for you?

I just tried to do that - it works fine, but only for the node without state pruning (i.e. which has --pruning=archive or pruning = "archive" in its config).

In order to rewind, launch parity db reset <numberOfBlocks> --config ./node.toml, where <numberOfBlocks> is a number of blocks which should be removed. For example, if the node is at block 1000 at the moment and you specify <numberOfBlocks> as 100, its state will be rewound to the block 900. (actually <numberOfBlocks> must be less or equal to 64)

Short docs for db reset command: https://wiki.parity.io/CLI-Sub-commands.html#db-reset

@d10r
Copy link

d10r commented Jan 22, 2020

@varasev that's weird. When I try to reset, I get this:

$ ./parity -c node.toml db reset 100
Loading config file from node.toml
2020-01-22 12:30:58  Configured for tau1.artis using AuthorityRound engine
2020-01-22 12:30:58  unable to get mut ref for engine for shutdown.
Attempting to reset to block with pruned state

This is with a config for rpc with trace (e.g. for explorer):

...
[footprint]
tracing = "on"
pruning = "archive"
fat_db = "on"
...

Does it work for you with a node configured for AuRa?

@varasev
Copy link
Author

varasev commented Jan 22, 2020

I tried db reset with https://github.com/varasev/test-block-reward and Parity v2.6.5-beta, but I didn't set tracing = "on" and fat_db = "on" in [footprint] (I set only pruning = "archive"). I will try https://github.com/varasev/test-block-reward with https://github.com/poanetwork/parity-ethereum/releases/tag/parity-v2.5.9-posdao-v0.1.2 and tracing = "on", pruning = "archive", fat_db = "on" a bit later.

Has your node always worked with pruning = "archive"? Or you launched it with state pruning (as a default), but later set pruning = "archive"?

@d10r
Copy link

d10r commented Jan 22, 2020

yes, this was with pruning = "archive" enabled from the start.

I just tried with https://github.com/varasev/test-block-reward and Parity 2.6.5-beta. Works for me too.
Now trying to figure out what the relevant difference is.

@d10r
Copy link

d10r commented Jan 22, 2020

Ok, found it. Looks like NUM can't be larger than 64.
However it seems to be possible to rewind further by just repeatedly executing it.

@varasev
Copy link
Author

varasev commented Jan 23, 2020

Ok, found it. Looks like NUM can't be larger than 64.

Yeah, really. Last time I tried to rewind 50 blocks, but now retried again with 65 blocks and see the limit as well.

@d10r
Copy link

d10r commented Feb 4, 2020

@afck I've now build Parity from upstream master (2.8.0-beta - commit 9c94dcb) and tried if it can sync ARTIS tau1 with POSDAO active.

It could, but once it's sync, it then tends to get stuck with this error msg:

2020-02-03 20:51:00  Imported #5142895 0x2999…a46b (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-02-03 20:51:05  Imported #5142896 0x6f3c…be7d (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-02-03 20:51:10  Imported #5142897 0x3969…8af6 (0 txs, 0.00 Mgas, 4 ms, 0.57 KiB)
2020-02-03 20:51:15  Stage 1 block verification failed for 0x5b20…c50b: Block(InvalidGasLimit(OutOfBounds { min: Some(10000000), max: Some(10000000), found: 2000000 }))
2020-02-03 20:51:15  
Bad block detected: Invalid gas limit: Value 2000000 out of bounds. Min=10000000, Max=10000000
RLP: f90249f90244a0396906e5483d3ee900dc6b9435e11c174ae053007fef83887e2a39575eea8af6a01dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347948e1ca34f5718996f2531464f1323f959d8a8dd21a06121da59a41d12fe6730fdf9a863a1422830a372c839ed28f42aba3a5c5a9607a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421a056e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421b901000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000090fffffffffffffffffffffffffffffffe834e7972831e848080845e3879b39fde830205098f5061726974792d457468657265756d86312e33362e30826c698412d81857b841d861f1750fb6e125906dd8eefcb1ed5ffda3589a742e4635d5b21409fb502df146342c04538ed150afc278480be233c90e8555c2fed0c132fefab35d82fe2c1c00c0c0
Header: Header { parent_hash: 0x396906e5483d3ee900dc6b9435e11c174ae053007fef83887e2a39575eea8af6, timestamp: 1580759475, number: 5142898, author: 0x8e1ca34f5718996f2531464f1323f959d8a8dd21, transactions_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, uncles_hash: 0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, extra_data: [222, 131, 2, 5, 9, 143, 80, 97, 114, 105, 116, 121, 45, 69, 116, 104, 101, 114, 101, 117, 109, 134, 49, 46, 51, 54, 46, 48, 130, 108, 105], state_root: 0x6121da59a41d12fe6730fdf9a863a1422830a372c839ed28f42aba3a5c5a9607, receipts_root: 0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421, log_bloom: 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, gas_used: 0, gas_limit: 2000000, difficulty: 340282366920938463463374607431768211454, seal: [[132, 18, 216, 24, 87], [184, 65, 216, 97, 241, 117, 15, 182, 225, 37, 144, 109, 216, 238, 252, 177, 237, 95, 253, 163, 88, 154, 116, 46, 70, 53, 213, 178, 20, 9, 251, 80, 45, 241, 70, 52, 44, 4, 83, 142, 209, 80, 175, 194, 120, 72, 11, 226, 51, 201, 14, 133, 85, 194, 254, 208, 193, 50, 254, 250, 179, 93, 130, 254, 44, 28, 0]], hash: Some(0x5b207419adbfe6d34bc63a1f82f9196c827e5341912cc7c71ea45b5acc59c50b) }
Uncles: 
Transactions:2020-02-03 20:51:20    #5007914    0/25 peers   437 KiB chain 50 MiB db 0 bytes queue 7 MiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-02-03 20:51:50    #5007914    0/25 peers   437 KiB chain 50 MiB db 0 bytes queue 7 MiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-02-03 20:52:20    #5007914    0/25 peers   437 KiB chain 50 MiB db 0 bytes queue 7 MiB sync  RPC:  0 conn,    0 req/s,    0 µs

Afterwards, it's stuck.
What's weird is how I can fix it: by rewinding one block with db reset 1. If I then restart Parity, it manages to sync to the latest block again and continues from there until it gets stuck again.

The blocks it gets stuck at are correct ones (not of some stale branch) - I checked the block hashes. E.g. the one of the err msg pasted here is this one.

There's a pattern for when it happens:
5140098
5141498
5142898

That's always 2 blocks before an epoch change. Does that give you any ideas?

When running Parity 2.5.9 with posdao changes from the poa repo on this chain, I've never seen this happen.

@varasev
Copy link
Author

varasev commented Feb 4, 2020

That's always 2 blocks before an epoch change

It happens on the latest two blocks of the epoch because of this condition: https://github.com/lab10-coop/posdao-contracts/blob/fafc47ac803dcd271f75dea33ef61a8c5ad628bf/contracts/TxPermission.sol#L224

I think maybe we need to turn off the checking on InvalidGasLimit error when we have blockGasLimitContractTransitions activated or force its bounds to the value returned by the contract.

@varasev varasev changed the title Error(Engine(NotProposer(Mismatch Error(Engine(NotProposer(Mismatch and Block(InvalidGasLimit(OutOfBounds Feb 4, 2020
@varasev
Copy link
Author

varasev commented Feb 4, 2020

@d10r To temporarily avoid the bug, you can upgrade the TxPermission contract by making the blockGasLimit function always return 10 000 000.

@afck
Copy link
Collaborator

afck commented Feb 4, 2020

As far as I understand, the block gas limit contract breaks light client compatibility… and maybe also warp sync?
(I don't know how it could be implemented in a light client-compatible way… we'd somehow need to include a proof of the gas limit in the header.)
But that still doesn't explain this bug: all involved nodes are full nodes, are they?

@varasev
Copy link
Author

varasev commented Feb 4, 2020

When running Parity 2.5.9 with posdao changes from the poa repo on this chain, I've never seen this happen.

@afck maybe we missed something in upstream? Something that we have in aura-pos.

@d10r
Copy link

d10r commented Feb 4, 2020

Just tried without using warp sync, behaves the same.
Only full nodes involved, yes.

I've now tried this Parity binary with posdao-test-setup.
When using the version of the posdao-contracts we have deployed, it also gets stuck.
With the last version of posdao-contracts, the issue seems to have gone.

So, I'll just upgrade posdao-contracts.

@varasev
Copy link
Author

varasev commented Feb 5, 2020

Just tried without using warp sync, behaves the same.
Only full nodes involved, yes.

So, you see the error Block(InvalidGasLimit(OutOfBounds regardless of whether --no-warp cli flag (or warp = false in toml) is used or not, right?

I've now tried this Parity binary with posdao-test-setup.
When using the version of the posdao-contracts we have deployed, it also gets stuck.

Which exactly version of the Parity binary do you mean here?

With the last version of posdao-contracts, the issue seems to have gone.
So, I'll just upgrade posdao-contracts.

Although, the TxPermission.blockGasLimit() implementations in your contract and in the original contract are the same:

Yours: https://github.com/lab10-coop/posdao-contracts/blob/fafc47ac803dcd271f75dea33ef61a8c5ad628bf/contracts/TxPermission.sol#L219-L228

Original: https://github.com/poanetwork/posdao-contracts/blob/c1604515a09bf15042e9b4c4cfb23a1ff44771ab/contracts/TxPermission.sol#L220-L229

@afck
Copy link
Collaborator

afck commented Feb 5, 2020

Have you tried with --pruning archive?

@d10r
Copy link

d10r commented Feb 5, 2020

So, you see the error Block(InvalidGasLimit(OutOfBounds regardless of whether --no-warp cli flag (or warp = false in toml) is used or not, right?

right

Which exactly version of the Parity binary do you mean here?

9c94dcb8 (which was the HEAD of master a few days ago), built myself in release config, because there wasn't yet a beta release with all posdao PRs.

Although, the TxPermission.blockGasLimit() implementations in your contract and in the original contract are the same

yes. I don't have an explanation either.

Have you tried with --pruning archive?

No. Will try and report back.

@d10r
Copy link

d10r commented Feb 5, 2020

Tried in posdao-test-setup with --pruning=archive. Didn't make a difference.
That's the Parity log of an active validator node:

2020-02-05 11:06:16  IO Worker #2 INFO import  Imported #19 0xc73f…0518 (0 txs, 0.00 Mgas, 3 ms, 0.55 KiB)
2020-02-05 11:06:16  IO Worker #2 TRACE miner  update_sealing: imported internally sealed block
2020-02-05 11:06:20  IO Worker #0 TRACE miner  update_sealing
2020-02-05 11:06:20  IO Worker #0 TRACE miner  requires_reseal: sealing enabled
2020-02-05 11:06:20  IO Worker #0 TRACE engine  step_proposer: Fetched proposer for step 395224295: 0x522d…89b2
2020-02-05 11:06:20  IO Worker #0 TRACE engine  Not preparing block: not a proposer for step 395224295. (Our address: 0x75df…9441)
2020-02-05 11:06:20  IO Worker #0 TRACE miner  requires_reseal: should_disable_sealing=false; forced=true, has_local=false, internal=NotReady, had_requests=false
2020-02-05 11:06:20  IO Worker #0 TRACE engine  step_proposer: Fetched proposer for step 395224295: 0x522d…89b2
2020-02-05 11:06:20  IO Worker #0 TRACE engine  Not preparing block: not a proposer for step 395224295. (Our address: 0x75df…9441)
2020-02-05 11:06:20  Verifier #6 TRACE engine  Setting gas limit to 10000000 for block 20.
2020-02-05 11:06:20  Verifier #6 TRACE engine  Multi ValidatorSet retrieved for block 0.
2020-02-05 11:06:20  Verifier #6 DEBUG engine  Expected topics for header 0x9607…193a: [0x55252fa6eee4741b4e24a74a70e9c11fd2c2281df8d6ea13126ff845f7825c89, 0xc73fa0c0a60df13349f0a26055233d03848a4b82a2ef587df0faa36e2f860518]
2020-02-05 11:06:20  Verifier #6 TRACE engine  Multi ValidatorSet retrieved for block 0.
2020-02-05 11:06:20  Verifier #6 TRACE miner  chain_new_blocks
2020-02-05 11:06:20  Verifier #6 TRACE miner  minimal_gas_price: recalibrating...
2020-02-05 11:06:20  Verifier #6 DEBUG miner  minimal_gas_price: Got gas price! 1000000000
2020-02-05 11:06:20  Verifier #6 TRACE miner  update_sealing
2020-02-05 11:06:20  Verifier #6 TRACE miner  requires_reseal: sealing enabled
2020-02-05 11:06:20  Verifier #6 TRACE engine  step_proposer: Fetched proposer for step 395224295: 0x522d…89b2
2020-02-05 11:06:20  Verifier #6 TRACE engine  Not preparing block: not a proposer for step 395224295. (Our address: 0x75df…9441)
2020-02-05 11:06:20  Verifier #6 TRACE miner  requires_reseal: should_disable_sealing=false; forced=true, has_local=false, internal=NotReady, had_requests=false
2020-02-05 11:06:20  Verifier #6 TRACE engine  step_proposer: Fetched proposer for step 395224295: 0x522d…89b2
2020-02-05 11:06:20  Verifier #6 TRACE engine  Not preparing block: not a proposer for step 395224295. (Our address: 0x75df…9441)
2020-02-05 11:06:20  Verifier #6 INFO import  Imported #20 0x9607…193a (0 txs, 0.00 Mgas, 8 ms, 0.55 KiB)
2020-02-05 11:06:24  IO Worker #0 TRACE miner  update_sealing
2020-02-05 11:06:24  IO Worker #0 TRACE miner  requires_reseal: sealing enabled
2020-02-05 11:06:24  IO Worker #0 TRACE engine  step_proposer: Fetched proposer for step 395224296: 0xbbca…4c78
2020-02-05 11:06:24  IO Worker #0 TRACE engine  Not preparing block: not a proposer for step 395224296. (Our address: 0x75df…9441)
2020-02-05 11:06:24  IO Worker #0 TRACE miner  requires_reseal: should_disable_sealing=false; forced=true, has_local=false, internal=NotReady, had_requests=false
2020-02-05 11:06:24  IO Worker #0 TRACE engine  step_proposer: Fetched proposer for step 395224296: 0xbbca…4c78
2020-02-05 11:06:24  IO Worker #0 TRACE engine  Not preparing block: not a proposer for step 395224296. (Our address: 0x75df…9441)
2020-02-05 11:06:28  IO Worker #2 TRACE miner  update_sealing
2020-02-05 11:06:28  IO Worker #2 TRACE miner  requires_reseal: sealing enabled
2020-02-05 11:06:28  IO Worker #2 TRACE engine  step_proposer: Fetched proposer for step 395224297: 0x75df…9441
2020-02-05 11:06:28  IO Worker #2 TRACE miner  requires_reseal: should_disable_sealing=false; forced=true, has_local=false, internal=Ready, had_requests=false
2020-02-05 11:06:28  IO Worker #2 TRACE engine  step_proposer: Fetched proposer for step 395224297: 0x75df…9441
2020-02-05 11:06:28  IO Worker #2 TRACE miner  update_sealing: preparing a block
2020-02-05 11:06:28  IO Worker #2 TRACE miner  prepare_block: No existing work - making new block
2020-02-05 11:06:28  IO Worker #2 TRACE engine  Setting gas limit to 10000000 for block 21.
2020-02-05 11:06:28  IO Worker #2 ERROR miner  Failed to prepare engine transactions for new block: Engine(Custom("Randomness error in advance(): LoadFailed(DecodeFailed(Error(Msg(\"please ensure the contract and method you\\\'re calling exist! failed to decode empty bytes. if you\\\'re using jsonrpc this is likely due to jsonrpc returning `0x` in case contract or method don\\\'t exist\"), State { next_error: None, backtrace: InternalBacktrace })))")). This is likely an error in chain specification or on-chain consensus smart contracts.

In this case. it happens at block 21 (reproducible).
I also tried to first run a few epochs with Parity 2.5.9, then switch to 2.8.0. Again, it got stuck at epoch_start + 21.
The settings are:

STAKING_EPOCH_DURATION=76
STAKE_WITHDRAW_DISALLOW_PERIOD=10
COLLECT_ROUND_LENGTH=38

@varasev couldn't it be caused by some refactoring, e.g. this, which the latest Parity relies upon?

@afck
Copy link
Collaborator

afck commented Feb 5, 2020

Doesn't the --pruning=archive log look different?
This looks like the gas limit contract now works, but this time the randomness contract failed in the first reveal phase.

@varasev
Copy link
Author

varasev commented Feb 5, 2020

couldn't it be caused by some refactoring, e.g. this, which the latest Parity relies upon?

No, I don't think so. (possibly yes, see my comment below) Please post here full steps to reproduce so that we could repeat the error.

@varasev
Copy link
Author

varasev commented Feb 6, 2020

@d10r I tried to launch posdao-test-setup with the latest Parity and with our aura-pos v0.1.4. Both work fine and I didn't manage to catch the error you reported in the last comment. When testing I set

[footprint]
tracing = "on"
fat_db = "on"
pruning = "archive"

for all the nodes in the config directory.

The tests use our latest POSDAO contracts from https://github.com/poanetwork/posdao-contracts.

Were these steps to reproduce correct from my side?

@varasev
Copy link
Author

varasev commented Feb 7, 2020

@varasev couldn't it be caused by some refactoring, e.g. this, which the latest Parity relies upon?

@d10r Yes, sorry, seems that's the reason for the error. Your RandomAuRa contract doesn't have revealNumber function. We have renamed revealSecret to revealNumber in poanetwork/posdao-contracts@ec39011 - that's why the upstream Parity doesn't see the revealNumber function in your contract.

@d10r
Copy link

d10r commented Feb 12, 2020

Thx @varasev . I suspected something like this, but knowing exactly gives more peace of mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants