Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consequences of increased invocation frequency #239

Open
lars-t-hansen opened this issue Jan 31, 2025 · 1 comment
Open

Consequences of increased invocation frequency #239

lars-t-hansen opened this issue Jan 31, 2025 · 1 comment
Labels
question Further information is requested

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Jan 31, 2025

Starting a discussion.

Currently we run sonar once every 300s and exfiltrate via curl randomly within a 240s window, allowing for ~6 failed exfiltrations before the data are dropped on the floor. Curl sets up an https connection for the exfiltration, sends the data, and takes the connection down again. On the really big iron the connection goes via some cluster-wide proxy. This already accounts for ~7 incoming connections to naic-monitor.uio.no per second with Betzy, Fram, Saga, Fox, and the ml nodes reporting.

If we were to increase the frequency to once every 30s as is done a number of other places (trailblazing turtle and also the slurm-monitor that our partners at Simula have built) then we need to rethink this architecture probably:

  • firing up several shells, sonar, and curl for every invocation is a fair bit of work
  • setting up a new https connection for every invocation seems like a bit too much
  • we want to be sure that data are represented in a compact form
  • the load of connections on naic-monitor.uio.no may be such that it is not able to serve them all well

I'm not much of an expert on these things, we can probably throw hardware at the problem but we can also rethink a bit how it's done. Some options:

  • sonar could remain resident on the node, daemon-like, and be driven by a simple script to trigger ps, sysinfo, sacct invocations at appropriate times
  • sonar could perform its own exfiltration over a semi-permanent or at least reused connection
  • there could be an on-cluster mqtt or similar broker to act as a message intermediary so that there are fewer connections to naic-monitor
  • intra-cluster messages from sonar to the broker could be optimized wrt representation, encryption
  • we could choose to use a compact message format, eg protobuf or a bespoke binary encoding, rather than json or csv
@lars-t-hansen lars-t-hansen added the question Further information is requested label Jan 31, 2025
@bast
Copy link
Member

bast commented Jan 31, 2025

I like the idea of leveraging existing libraries/protocol solutions to act as intermediary where something that has been written for this is managing the connections and also compresses data for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants