Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: mls 1on1 race condition [WPB-15395] #3237

Merged
merged 6 commits into from
Jan 22, 2025

Conversation

Garzas
Copy link
Contributor

@Garzas Garzas commented Jan 21, 2025

BugWPB-15395 [Android] 1:1 conversation throwing decryption errors after mls was enabled for the team


PR Submission Checklist for internal contributors

  • The PR Title

    • conforms to the style of semantic commits messages¹ supported in Wire's Github Workflow²
    • contains a reference JIRA issue number like SQPIT-764
    • answers the question: If merged, this PR will: ... ³
  • The PR Description

    • is free of optional paragraphs and you have filled the relevant parts to the best of your ability

What’s New in This PR?

This PR addresses a race condition during 1:1 MLS conversation creation and improves logging throughout the MLS-related codebase.

Issues

  • Race condition: When a user opens a 1:1 conversation at the same time a slow sync process is attempting to (re)join the same MLS conversation, we could end up with errors like:
    • ConversationAlreadyExists
    • Cannot execute operation because a pending commit exists
  • Lack of logs: Troubleshooting such concurrency issues was complicated by insufficient logging in some MLS-related operations.

Causes

  • Concurrent initiation of MLS commits:
    • The slow sync step attempts to join existing MLS conversations.
    • At the same time, the UI notifies that a conversation is open (NotifyConversationIsOpenUseCase), which also tries to resolve or establish a 1:1 MLS group.
    • Because both operations could run in parallel, we ended up with a “pending commit” blocking any subsequent commits.

Solutions

  1. Waiting for Slow Sync

    • In NotifyConversationIsOpenUseCaseImpl, we now explicitly wait for SlowSyncStatus to reach Complete before reevaluating the protocol for a 1:1 conversation.
    • This ensures no concurrent MLS commits are triggered during slow sync.
  2. Improved Logging

    • Added detailed log statements (kaliumLogger.d(...)) in JoinExistingMLSConversationUseCase, MLSConversationDataSource, and NotifyConversationIsOpenUseCaseImpl.
    • These logs help to trace when a commit bundle is sent or processed, and when establishing/joining self and 1:1 MLS groups.
  3. Failure Handling (Wipe on Add Member Failure)

    • In MLSConversationDataSource, if adding members to an MLS group fails, we attempt a conversation wipe.
    • This helps to avoid a stuck state if the MLS conversation remains in a “pending commit” state and can’t be recovered via normal commits.

@Garzas Garzas self-assigned this Jan 21, 2025
@echoes-hq echoes-hq bot added the echoes: product-roadmap/bug Work contributing to resolve a bug not critical enough to have raised an incident. label Jan 21, 2025
Copy link
Contributor

github-actions bot commented Jan 21, 2025

Test Results

3 347 tests  ±0   3 239 ✅ ±0   5m 25s ⏱️ +13s
  572 suites ±0     108 💤 ±0 
  572 files   ±0       0 ❌ ±0 

Results for commit 4afcd7c. ± Comparison against base commit ff16a2c.

♻️ This comment has been updated with latest results.

@MohamadJaara
Copy link
Member

question: in this case do we need it to wait, imo just returning without doing anything is enough since the slow sync will migrate this 1:1 anyway

@datadog-wireapp
Copy link

datadog-wireapp bot commented Jan 21, 2025

Datadog Report

Branch report: fix/mls-1on1-race-condition
Commit report: ae022e4
Test service: kalium-jvm

✅ 0 Failed, 3239 Passed, 108 Skipped, 1m 1.45s Total Time

Copy link
Contributor

@yamilmedina yamilmedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just would be great to add tags to the logs

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 54.52%. Comparing base (ff16a2c) to head (4afcd7c).

Files with missing lines Patch % Lines
...nc/receiver/conversation/MLSWelcomeEventHandler.kt 60.00% 3 Missing and 1 partial ⚠️
...re/conversation/NotifyConversationIsOpenUseCase.kt 81.25% 2 Missing and 1 partial ⚠️
...um/logic/feature/conversation/ConversationScope.kt 0.00% 2 Missing ⚠️
...conversation/JoinExistingMLSConversationUseCase.kt 80.00% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##           release/candidate    #3237   +/-   ##
==================================================
  Coverage              54.52%   54.52%           
==================================================
  Files                   1250     1250           
  Lines                  36524    36554   +30     
  Branches                3696     3698    +2     
==================================================
+ Hits                   19913    19932   +19     
- Misses                 15213    15222    +9     
- Partials                1398     1400    +2     
Files with missing lines Coverage Δ
...gic/data/conversation/MLSConversationRepository.kt 84.05% <100.00%> (+0.16%) ⬆️
...conversation/JoinExistingMLSConversationUseCase.kt 80.95% <80.00%> (+0.46%) ⬆️
...um/logic/feature/conversation/ConversationScope.kt 0.00% <0.00%> (ø)
...re/conversation/NotifyConversationIsOpenUseCase.kt 90.32% <81.25%> (-9.68%) ⬇️
...nc/receiver/conversation/MLSWelcomeEventHandler.kt 85.93% <60.00%> (-3.16%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff16a2c...4afcd7c. Read the comment docs.

@@ -454,7 +456,15 @@ internal class MLSConversationDataSource(
retryOnStaleMessage = true,
allowPartialMemberList = false,
cipherSuite = cipherSuite
).map { Unit }
).onFailure {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt' this code path also called when adding users to an existing conversation? We certainly don't want wipe any conversation then. We only want to do when we fail to establish the initial MLS group.

resolveConversationIfOneOnOne(event.conversationId)
}
.flatMapLeft {
if (it is MLSFailure.ConversationAlreadyExists) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add warning here then? "Discarding welcome since the conversation already exists "

@Garzas Garzas requested a review from typfel January 21, 2025 19:08
@Garzas Garzas added this pull request to the merge queue Jan 22, 2025
Merged via the queue into release/candidate with commit 9bf7218 Jan 22, 2025
21 checks passed
@Garzas Garzas deleted the fix/mls-1on1-race-condition branch January 22, 2025 06:47
github-actions bot pushed a commit that referenced this pull request Jan 22, 2025
* fix: race condition during 1on1 mls creation, more logs

* scope init fix

* test fix

* added ignore on welcome message when conv already exist, reverted wipe on member join

* added warning
github-merge-queue bot pushed a commit that referenced this pull request Jan 24, 2025
* fix: race condition during 1on1 mls creation, more logs

* scope init fix

* test fix

* added ignore on welcome message when conv already exist, reverted wipe on member join

* added warning

Co-authored-by: Jakub Żerko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
echoes: product-roadmap/bug Work contributing to resolve a bug not critical enough to have raised an incident. 🚨 Potential breaking changes 👕 size: S type: bug / fix 🐞
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants