-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized broadcast #309 #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make sure you apply rustfmt; the formatting is off in some places. (And Clippy in the end: Both are required by our CI.)
src/broadcast/broadcast.rs
Outdated
let p = self.echos.get(self.our_id()).unwrap().clone(); | ||
// Could also have another error type if it panics. | ||
let (_, p) = self.echos.get(self.our_id()).unwrap(); | ||
let p = p.clone().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not assume that we have our own Echo
here. (See my earlier comment, which probably became outdated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will pass p
as an argument here.
@afck the |
That's weird: Do the Broadcast tests pass? If they do, but the higher-level tests fail, that shows that we need to extend the Broadcast tests (#396). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I also couldn't find the cause for the failing tests so far.)
src/broadcast/broadcast.rs
Outdated
@@ -28,14 +29,23 @@ pub struct Broadcast<N> { | |||
coding: Coding, | |||
/// If we are the proposer: whether we have already sent the `Value` messages with the shards. | |||
value_sent: bool, | |||
/// Whether we have already multicast `Echo`. | |||
/// Whether we have already send `Echo` left nodes and right nodes who haven't sent `CanDecode`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "sent
"
Maybe: "Whether we have already sent `Echo` to all nodes who haven't sent `CanDecode`.
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is still missing a "to". How do you visualise "left" and "right" on a circular list, I wonder? Can you have a one-dimensional counterpart of these terms instead? Possibly predecessor and successor?
src/broadcast/broadcast.rs
Outdated
/// Whether we have already output a value. | ||
decided: bool, | ||
/// The proofs we have received via `Echo` messages, by sender ID. | ||
echos: BTreeMap<N, Proof<Vec<u8>>>, | ||
/// /// Estimate as to how many nodes we think are faulty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate "///
"
I'd also clarify that this is only for performance, not for liveness and correctness.
Maybe: "Number of faulty nodes to optimize performance for.
"?
src/broadcast/broadcast.rs
Outdated
@@ -81,6 +91,7 @@ impl<N: NodeIdT> Broadcast<N> { | |||
let data_shard_num = netinfo.num_nodes() - parity_shard_num; | |||
let coding = | |||
Coding::new(data_shard_num, parity_shard_num).map_err(|_| Error::InvalidNodeCount)?; | |||
let g = netinfo.num_faulty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this fault_estimate
, too, instead of g
.
src/broadcast/broadcast.rs
Outdated
if self.echos.get(self.our_id()) == Some(&p) { | ||
|
||
if let Some(EchoContent::Full(proof)) = self.echos.get(self.our_id()) { | ||
if *proof == p { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also check whether the hash matches, if we only have an EchoContent::Hash
. So I'd actually just compare the hashes whenever this is Some(echo_content)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only have an EchoContent::Hash
, then we can't return from the if let
statement right? We need to update the echos
map with the actual full proof. So it would have to be
match self.echos.get(self.our_id()) {
Some(val) if val.hash() != p.root_hash() => {
return Ok(Fault::new(sender_id.clone(), FaultKind::MultipleValues).into())
}
Some(EchoContent::Full(proof)) if *proof == p => {
warn!(
"Node {:?} received Value({:?}) multiple times from {:?}.",
self.our_id(),
HexProof(&p),
sender_id
);
return Ok(Step::default());
},
_ => ()
};
i.e. return ()
if we have a Some(EchoContent::Hash)
that matches with p.root_hash()
or None
so that the rest of the function continues.
src/broadcast/broadcast.rs
Outdated
.unwrap_or(&BTreeSet::new()) | ||
.clone(); | ||
values.insert(sender_id.clone()); | ||
self.can_decodes.insert(*hash, values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.can_decodes
.entry(*hash)
.or_default()
.insert(sender_id.clone());
src/broadcast/broadcast.rs
Outdated
if !self.netinfo.is_validator() { | ||
return Ok(Step::default()); | ||
} | ||
let echo_msg = Message::Echo(p.clone()); | ||
let step: Step<_> = Target::All.message(echo_msg).into(); | ||
let mut step = Step::default(); | ||
// `N - 2f + g` node ids to the left of our_id (excluding our_id and proposer_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposer isn't currently excluded, is it?
(I think it's better that way: That causes additional complications and should probably be a separate PR, if at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No its not 😅 Forgot to change the comment.
@afck I think i found the problem. The current implementation uses Since |
Right, I hadn't thought of that! Slightly more optimized: Only send a full What do you think, @vkomenda? |
.take(self.netinfo.num_correct() - self.netinfo.num_faulty() + self.fault_estimate) | ||
.skip(1); | ||
for id in left { | ||
let msg = Target::Node(id.clone()).message(echo_msg.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cloning of message content makes me wish for a different Target
. At least a target that allows sending the same content to a list of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; maybe Target
should simply have the two variants AllExcept
(blacklist) and Nodes
(whitelist). That should cover all our cases.
Anyway, let's do that later, in a separate issue: #408
What if the new variant allowed sending the same message content to a given list of nodes plus any observers not on that list? This would allow to reduce cloning of message content inside Broadcast although later that content would have to be cloned anyway when sending it to each of those nodes. But at least we would have eliminated With such a new message variant, we would change the sender queue slightly to marshal the content to the correct list of nodes that includes any observers not known to Broadcast. |
Yes, that's true, especially for broadcast, where some of the messages are relatively large. And I don't even think they necessarily have to be cloned later: The application logic can just serialize it once (instead of the current duplicated serialization), and then send those same bytes to all the recipients. No need to clone anything. On the other hand, maybe it's not Anyway, I think that's beyond the scope of this PR and needs some design and discussion first. I'd just add |
Take care to filter the non-observers from |
So just to clarify, what you guys are suggesting is having an additional
|
No, just But that's a good point: I'm not sure whether we have the information in |
My mistake. I thought you meant something like But again, I don't think |
Right, that sounds like my initial reading of the new
Anyhow, if |
Another thing we could possibly do would be to have an additional function in the fn is_node_validator(&self) -> bool; Every implementation of the |
Sounds dubious. |
I don't think so. I'd prefer if
|
You've answered a slightly different question but I'm with you on this one :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I think the only things that are left to do now are rebasing on master, running the simulation with and without this optimization to verify that it's worth the added complexity, and updating the module documentation.
I'm wondering whether we should remove Target::All
now, since it's equivalent to AllExcept(Vec::new())
. (But probably not in this PR; it's already huge!)
@afck ran the simulation with the new code. I'm getting ~20% improvement in average message size handled by a node for message sizes > 100 bytes. I think the reason for this sub optimal improvement is because in many instances, a node receives |
Awesome, thanks! That makes sense and I think 20% already means it's worth making this optimization! 🎉 |
Added a few minor optimizations to ensure more
Note: This could lead to possible retransmissions of These are the message size improvements that I got with and without the optimization: |
I don't agree with those two changes: |
You are right. I overcomplicated the issue for no reason. I just needed to send Should I reset the changes and force push the fix or revert and apply the fix on top of the commit? |
Sure, feel free to force-push to remove the latest commit. |
b469f9e
to
9281f40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 👍
Most of my comments are just nit-picks and no blockers for merging.
But I think can_decode_sent
does need to be stored by hash for the algorithm to be correct.
.take(self.netinfo.num_correct() - self.netinfo.num_faulty() + self.fault_estimate) | ||
.skip(1); | ||
for id in left { | ||
let msg = Target::Node(id.clone()).message(echo_msg.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; maybe Target
should simply have the two variants AllExcept
(blacklist) and Nodes
(whitelist). That should cover all our cases.
Anyway, let's do that later, in a separate issue: #408
src/broadcast/broadcast.rs
Outdated
let mut step = Step::default(); | ||
|
||
// Upon receiving `N - 2f` `Echo`s with this root hash, send `CanDecode` | ||
if !self.can_decode_sent && self.count_echos_full(&hash) >= self.coding.data_shard_count() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to store can_decode_sent
by hash?
Because we might have to send more than one of them, e.g. if the proposer is faulty, it might send a value v1 with hash h1 to N - f nodes and value v2 with hash h2 to f + 1 nodes, with the overlapping nodes all malicious. So you may end up receiving f + 1 Echo
s with h2, even though in the end v1 will get accepted. In that case, you should send CanDecode
with both hashes eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. Will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you! 🎉
@vkomenda: Please also take a final look, and merge if you're happy with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests failed for me! tests/dynamic_honey_badger.proptest-regressions
contains:
cc 53b9134103d3f553062b32b38cdd359c8d5e3c68ab1311afcc888f2f9edff3ba # shrinks to cfg = TestConfig { dimension: NetworkDimension { size: 14, faulty: 0 }, total_txs: 20, batch_size: 10, contribution_size: 1, seed: [241, 107, 204, 130, 53, 111, 85, 7, 89, 3, 240, 65, 183, 26, 167, 233] }
We need to find out whether this is due to the Broadcast optimization before merging.
Yet another failing
|
Great, thanks! Yes, we should focus on the failing broadcast test first then, since it's probably the reason for the failing DHB test. 👍 |
@vkomenda: If I put that line in |
Strange... That is my |
@afck the test is failing because the time_limit was exceeded. Removed the time_limit and the tests passed. |
It make sense that it takes a bit longer: We are sending a few more messages now. @vkomenda: Are you using @pawanjay176's current |
Yes. I'm also running the tests with |
That failing test actually passes without |
You're right, thanks! I can reproduce it now. |
OK, I found the bug: When sending the full --- a/src/broadcast/broadcast.rs
+++ b/src/broadcast/broadcast.rs
@@ -134,6 +134,7 @@ impl<N: NodeIdT> Broadcast<N> {
if !self.netinfo.is_node_validator(sender_id) {
return Err(Error::UnknownSender);
}
+ debug!("{}: Handling {:?} from {:?}", self, &message, sender_id);
match message {
Message::Value(p) => self.handle_value(sender_id, p),
Message::Echo(p) => self.handle_echo(sender_id, p),
@@ -443,21 +444,20 @@ impl<N: NodeIdT> Broadcast<N> {
let echo_msg = Message::Echo(p);
let mut step = Step::default();
- if let Some(senders) = self.can_decodes.get(hash) {
- // Remaining node ids to the right of our_id
- // after arranging all node ids in a circular list.
- let right = self
- .netinfo
- .all_ids()
- .cycle()
- .skip_while(|x| *x != self.our_id())
- .skip(self.netinfo.num_correct() - self.netinfo.num_faulty() + self.fault_estimate)
- .take_while(|x| *x != self.our_id());
- let msgs = right
- .filter(|id| !senders.contains(id))
- .map(|id| Target::Node(id.clone()).message(echo_msg.clone()));
- step.messages.extend(msgs);
- }
+ let senders = self.can_decodes.get(hash);
+ // Remaining node ids to the right of our_id
+ // after arranging all node ids in a circular list.
+ let right = self
+ .netinfo
+ .all_ids()
+ .cycle()
+ .skip_while(|x| *x != self.our_id())
+ .skip(self.netinfo.num_correct() - self.netinfo.num_faulty() + self.fault_estimate)
+ .take_while(|x| *x != self.our_id());
+ let msgs = right
+ .filter(|id| senders.map_or(true, |s| !s.contains(id)))
+ .map(|id| Target::Node(id.clone()).message(echo_msg.clone()));
+ step.messages.extend(msgs);
Ok(step)
} |
(With this change, it passed one hour of the broadcast tests in a loop. 🎉) |
Awesome! Will push the change. |
It would eventually have failed with normal crypto, too, just not with the same seed. The crypto implementations create different keys, so the order of the keys is different, which changes iteration order in some places, I think. |
Ohh I thought it would be same for same seeds. Makes sense now. |
WIP for #309