-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silently drop PDUs we fail to parse from federation, instead of failing entire transaction #17893
base: develop
Are you sure you want to change the base?
Silently drop PDUs we fail to parse from federation, instead of failing entire transaction #17893
Conversation
changelog.d/17893.doc
Outdated
@@ -0,0 +1 @@ | |||
Fix a bug where all messages from a server could be blocked because of one bad event. Contributed by @morguldir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more context for the issue you ran into specifically? Some issue that should be linked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think there's an issue for it, but basically unless the room is v11, the version key is actually unprotected. Unfortunately there was no cs api protection for redacting the create event in conduwuit, so a silly bot creator accidentally redacted everything in the room, making that particular room version fall back to "1" like the spec says
Once i finally returned 200 OK, all the pdus and edus that had been failing started to trickle in again at last
I think matrix.org was also in the room at the time, so it's similarly affected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're talking about the behavior described here:
Room Version 11
[New in this version] [...] The
m.room.create
event now keeps the entirecontent
property.
And the m.room.create
event defines the content.room_version
.
For my own reference, this was added in MSC2176 and included in room version 11 via MSC3820.
I see how this PR allows valid events from a transaction to be received while ignoring invalid ones.
In the scenario where a room falls back to a v1 room, does the room_version
we have stored in Synapse get updated?
If the room_version
doesn't get updated, this just makes sure that the server can receive the final "valid" events in the room before it broke (people started sending v1 events). Is there an issue tracking whether Synapse should update the room_version
after the m.room.create
redaction?
If the room_version
does get updated to v1, then events are only accepted according to the order received (before/after receiving the redaction forcing the fallback).
Perhaps we should have a test for this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there was no cs api protection for redacting the create event
Does this exist in Synapse somewhere already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cs api errors with
synapse/synapse/handlers/message.py
Lines 1855 to 1856 in 4be3bd4
if original_event.type == EventTypes.Create: | |
raise AuthError(403, "Redacting create events is not permitted") |
i tried to send a redaction from construct (conduwuit would update the room version and fail to send the redaction as well), when it was a partial state room it seemed to error somewhat, but for a normal room it didn't seem like anything happened, the event didn't show in my timeline, but it was persisted on disk without errors.
I did an initial sync too, and the create event was still there, so i assume it just gets ignored? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scenario where a room falls back to a v1 room, does the room_version we have stored in Synapse get updated?
Also to be clear, the scenario is mostly on the conduwuit side in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more context for the issue you ran into specifically? Some issue that should be linked?
Seeing the sytest failure now there actually is an for it already, matrix-org/synapse#7543 / #7543
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! I've linked the issues in the PR description at the top
The Sytest says this:
# Room version 6 states that homeservers should strictly enforce canonical JSON
# on PDUs. Test that a transaction to `send` with a PDU that has bad data will
# be handled properly.
#
# This enforces that the entire transaction is rejected if a single bad PDU is
# sent. It is unclear if this is the correct behavior or not.
#
# See https://github.com/matrix-org/synapse/issues/7543
I feel like given the nature of how PUT /_matrix/federation/v1/send/{txnId}
is spec'ed, it makes sense that some PDU's can fail to process while others succeed. The response is literally a map of events to any potential errors when processing that specific event.
Since the Sytest test also says "It is unclear if this is the correct behavior or not.", my interpretation could be correct. Probably should get a few more eyeballs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morguldir I had this on the To-Discuss for the weekly meeting today but we ran out of time to discuss. This is the first thing on the board to discuss next week though 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the reviews so far 🙏
@@ -56,7 +56,11 @@ | |||
SynapseError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morguldir Are you up for adding a test for this? Probably something in tests/federation/test_federation_server.py
, make a request like this and then assert that the other PDU's in the transaction besides the corrupted one were persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is b6b5d81 what you had in mind? See also matrix-org/complement#743
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I completely missed your Complement test in my initial review. The Synapse test is looking good 👍
changelog.d/17893.doc
Outdated
@@ -0,0 +1 @@ | |||
Fix a bug where all messages from a server could be blocked because of one bad event. Contributed by @morguldir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're talking about the behavior described here:
Room Version 11
[New in this version] [...] The
m.room.create
event now keeps the entirecontent
property.
And the m.room.create
event defines the content.room_version
.
For my own reference, this was added in MSC2176 and included in room version 11 via MSC3820.
I see how this PR allows valid events from a transaction to be received while ignoring invalid ones.
In the scenario where a room falls back to a v1 room, does the room_version
we have stored in Synapse get updated?
If the room_version
doesn't get updated, this just makes sure that the server can receive the final "valid" events in the room before it broke (people started sending v1 events). Is there an issue tracking whether Synapse should update the room_version
after the m.room.create
redaction?
If the room_version
does get updated to v1, then events are only accepted according to the order received (before/after receiving the redaction forcing the fallback).
Perhaps we should have a test for this scenario?
Co-authored-by: Eric Eastwood <[email protected]>
Comment updates, docstring, remove extra logs Co-authored-by: Eric Eastwood <[email protected]>
event_id
field
event_id
fieldevent_id
field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to merge in the meantime while matrix-org/matrix-spec#2027 is figured out.
It only improves today's situation, which is for Synapse to fail with a 500 upon receiving such transactions.
If the above spec issue ends up deciding that such events should be valid, then a future PR can modify the behaviour to accept these events. But we should still fix the 500 now.
Seeing that matrix-org/complement#743 only adds a new test, it's fine to merge this PR now.
changelog.d/17893.bugfix
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a spec issue to clarify whether events for v3+ rooms containing an event_id
field should be considered valid: matrix-org/matrix-spec#2027
event_id
fieldevent_id
field, instead of entire transaction
event_id
field, instead of entire transactionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morguldir it seems that this sytest test is currently failing with this PR. That is due to the test expecting the homeserver to return a 400 when a bad event is included in a transaction, whereas we're now just silently dropping them (and returning a 200).
Could you create a PR to update that test please? I consider that a blocking change to merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, perl didn't live up to the bad rumors so far :D, added in matrix-org/sytest#1391
Although according to halfie it sounds like you need to push the branch to sytest if you want CI to pass matrix-org/synapse#12754 👀
# If we see an `event_id` in a newer room version, then it's an invalid event | ||
# and we should reject it. | ||
assert ( | ||
"event_id" not in event_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert has always been there, this just adds an error message and a comment
I'm retracting my approval at the moment as discussion has continued on matrix-org/matrix-spec#2027 (comment) and is trending away from this PR's approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err sorry, I obviously need more coffee. The exact opposite is true of what I said in #17893 (comment)
Actually, it's partially true. This PR does two things, the latter of which the title fails to mention:
- Silently drop PDUs we fail to parse from federation
- Report an error back to the homeserver if an
event_id
was provided in an incoming v3+ room event.
Discussion in #17893 (comment) is advocating to also silently drop events which contain an event_id
field. So dropping these lines and removing the second change from this PR.
@morguldir curious for your thoughts on that change?
My reasoning for the error was that synapse would yell at you loudly before with a 500 for the whole transaction before, but there's not really anything another server can do with that information (a loud error) Returning an error for that specific pdu at least lets the sending server output a warning or similar From the discussion in #7543, i think the main problem is that most of the time it's impossible to calculate the event id, but here it was already provided. I'm not sure if there's a drawback to informing about the event not being processed In the end i'm mostly worried about returning a 200 response every time though, regardless if it includes the error or not |
Related matrix-org/synapse#7543 / #7543
Complement test: matrix-org/complement#743
Pull Request Checklist