Consequences of increased invocation frequency #239

lars-t-hansen · 2025-01-31T07:17:01Z

Starting a discussion.

Currently we run sonar once every 300s and exfiltrate via curl randomly within a 240s window, allowing for ~6 failed exfiltrations before the data are dropped on the floor. Curl sets up an https connection for the exfiltration, sends the data, and takes the connection down again. On the really big iron the connection goes via some cluster-wide proxy. This already accounts for ~7 incoming connections to naic-monitor.uio.no per second with Betzy, Fram, Saga, Fox, and the ml nodes reporting.

If we were to increase the frequency to once every 30s as is done a number of other places (trailblazing turtle and also the slurm-monitor that our partners at Simula have built) then we need to rethink this architecture probably:

firing up several shells, sonar, and curl for every invocation is a fair bit of work
setting up a new https connection for every invocation seems like a bit too much
we want to be sure that data are represented in a compact form
the load of connections on naic-monitor.uio.no may be such that it is not able to serve them all well

I'm not much of an expert on these things, we can probably throw hardware at the problem but we can also rethink a bit how it's done. Some options:

sonar could remain resident on the node, daemon-like, and be driven by a simple script to trigger ps, sysinfo, sacct invocations at appropriate times
sonar could perform its own exfiltration over a semi-permanent or at least reused connection
there could be an on-cluster mqtt or similar broker to act as a message intermediary so that there are fewer connections to naic-monitor
intra-cluster messages from sonar to the broker could be optimized wrt representation, encryption
we could choose to use a compact message format, eg protobuf or a bespoke binary encoding, rather than json or csv

bast · 2025-01-31T12:52:39Z

I like the idea of leveraging existing libraries/protocol solutions to act as intermediary where something that has been written for this is managing the connections and also compresses data for us.

lars-t-hansen added the question Further information is requested label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consequences of increased invocation frequency #239

Consequences of increased invocation frequency #239

lars-t-hansen commented Jan 31, 2025 •

edited

Loading

bast commented Jan 31, 2025

Consequences of increased invocation frequency #239

Consequences of increased invocation frequency #239

Comments

lars-t-hansen commented Jan 31, 2025 • edited Loading

bast commented Jan 31, 2025

lars-t-hansen commented Jan 31, 2025 •

edited

Loading