Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Element-web's change to sync based presence breaks synapse process #16039

Closed
MTRNord opened this issue Jul 31, 2023 · 18 comments
Closed

Element-web's change to sync based presence breaks synapse process #16039

MTRNord opened this issue Jul 31, 2023 · 18 comments
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@MTRNord
Copy link
Contributor

MTRNord commented Jul 31, 2023

Description

Prior issue over at element-web which has been closed as being a synapse issue rather than element-web's: element-hq/element-web#25874

Since element-web switched to doing presence using /sync vs using the presence API my synapse is using extremely high amounts of CPU due to presence requests happening. It also seems to kill both fed senders from the amount of work they have to do (I assume it is just the metrics endpoint timing out since kubernetes doesnt see them as broken).

image
image

The suspicion it being caused by the element-web change is because the peaks align perfectly with the time I start my element-web clients.

Steps to reproduce

  • I believe just opening develop.element.io is enough.

Homeserver

matrix.midnightthoughts.space

Synapse Version

{"server_version":"1.87.0","python_version":"3.11.4"}

Installation Method

Docker (matrixdotorg/synapse)

Database

Postgresql 15 server with active + standby replication nodes

Workers

Single process

Platform

The server is running on a mixed arch (x86_64+arm64) kubernetes cluster.

Configuration

The config is at https://git.nordgedanken.dev/kubernetes/gitops/src/branch/main/apps/production/synapse-midnightthoughts.yaml#L30-L41 mainly there are 4 (3 non working) appservices

Relevant log output

Attached as a file as the log is too long for a comment

Anything else that would be useful to know?

There is an element-web rageshake available hooked to the other issue.

The pasted log is the last hour. Specifically interesting is the high amount of replication around presence_set_state.

@MTRNord
Copy link
Contributor Author

MTRNord commented Jul 31, 2023

@realtyem
Copy link
Contributor

Hey there, did you happen to catch how long ago this started?

@MTRNord
Copy link
Contributor Author

MTRNord commented Jul 31, 2023

It got especially bad after the 25.7.23 where it made a major overall increase:

image
image

Interesting enough the major peaks for cpu usage seem to be on the sync rather than presence. But I am not sure if thats because it includes the presence into it since it was set via sync?

image

Because in terms of request amount the peaks are presence replication and not sync:

image

Also worth noting is that fed sender seems to heavily be doing a single DB op now:

image

Not sure if that one is related or not.

If needed I could hook up the tracing too and export a trace maybe.

@MTRNord
Copy link
Contributor Author

MTRNord commented Jul 31, 2023

Also that db op seems to miss cache quite a lot:

image

@MTRNord
Copy link
Contributor Author

MTRNord commented Jul 31, 2023

And I guess part of the issue seems that the fed sender worker is not sending presence out anymore?

image

@MTRNord
Copy link
Contributor Author

MTRNord commented Jul 31, 2023

Since I have a sliding sync proxy too #15980 may be related too?

@HarHarLinks
Copy link
Contributor

this might also be not just simply be the different api used, but could be due to a conflict between different clients. i've noticed some strange behaviour with element desktop nightly idling on windows (yellow activity indicator) and then going online from schildichat android 1.6.5.sc70-test5 F-Droid which leads to the activity indicator to oscillate between yellow and green.

@cremesk
Copy link
Contributor

cremesk commented Aug 2, 2023

i see exactly the same problem on envs.net (without sliding sync proxy). that's why I temporarily disabled presence.

image
image

@davidmehren
Copy link

We also noticed significantly increased CPU usage of the federation-sender and generic workers along with about 10x the database load after matrix-org/matrix-react-sdk#11223 was merged and landed in Element-Nightly.

image

@t3chguy
Copy link
Member

t3chguy commented Aug 2, 2023

Sounds like Synapse may have a special-case for no set_presence even though the spec says parameter omission should be same as online

If this parameter is omitted then the client is automatically marked as online when it uses this API.

@t3chguy
Copy link
Member

t3chguy commented Aug 2, 2023

@HarHarLinks that was also reported as element-hq/element-web#25900

@DMRobertson
Copy link
Contributor

Sounds like Synapse may have a special-case for no set_presence even though the spec says parameter omission should be same as online

This is not the case as far as I can see:

set_presence = parse_string(
request,
"set_presence",
default="online",
allowed_values=self.ALLOWED_PRESENCE,
)

@rda0
Copy link

rda0 commented Aug 2, 2023

We also noticed significantly increased CPU usage of the presence and generic workers (serving ^/_matrix/client/(r0|v3)/sync$ + (^|&)since= and ^/_matrix/client/(api/v1|r0|v3)/events$), increased nginx request rate and db transactions.

  • Since: 01.08.23 13:20 UTC (15:20 local time)
  • Synapse 1.87.0
  • no sliding sync proxy

Firefox_Screenshot_2023-08-02T09-29-34 314Z

Firefox_Screenshot_2023-08-02T09-29-09 041Z

Firefox_Screenshot_2023-08-02T09-29-04 714Z

It looks good after disabling presence:

presence:
  enabled: false

@actx-1
Copy link

actx-1 commented Aug 3, 2023

this might also be not just simply be the different api used, but could be due to a conflict between different clients. i've noticed some strange behaviour with element desktop nightly idling on windows (yellow activity indicator) and then going online from schildichat android 1.6.5.sc70-test5 F-Droid which leads to the activity indicator to oscillate between yellow and green.

I've noticed the same thing here. Specifically, when I have one instance of Element idle while I'm using another one, they seem to argue about whether I'm online or away, and get constant sync requests returned only containing presence updated by the other instance. Closing the idle instance stopped the activity indicator flickering.

@t3chguy
Copy link
Member

t3chguy commented Aug 4, 2023

This isn't an issue with Element Web. Any client sending set_presence=unavailable running at the same time as one not setting set_presence or sending set_presence=online will flip-flop like this as Synapse will keep toggling the user's presence at each alternating /sync call - this is a bug and not conforming to the spec. See #16057

@seal61
Copy link

seal61 commented Aug 4, 2023

Even after I updated element to 1.11.38 where this issue should be fixed I still see a lot of those errors in the synapse log, that could be related to presence:

tail /var/log/synapse/homeserver.log 2023-08-04 12:51:21,199 - synapse.storage.txn - 722 - WARNING - GET-41319781 - [TXN OPERROR] {update_presence-680f10} could not serialize access due to concurrent update 0/5

Any ideas how to solve this without disabling presence at all?

Applying the element update at least really fixed the high cpu load for me.

@MTRNord
Copy link
Contributor Author

MTRNord commented Aug 4, 2023

One observation I made since ele-web reverted the change on their side is that synapse seems to not free the resources fully up until you restart it. Both memory stayed high and cpu only partially dropped. The federation-sender workers only recovered after a restart (I assume that cleared the pending queue for replication)

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience labels Aug 9, 2023
@clokep
Copy link
Member

clokep commented Aug 9, 2023

I believe this is a duplicate of #16057.

@clokep clokep closed this as completed Aug 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests