You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we run sonar once every 300s and exfiltrate via curl randomly within a 240s window, allowing for ~6 failed exfiltrations before the data are dropped on the floor. Curl sets up an https connection for the exfiltration, sends the data, and takes the connection down again. On the really big iron the connection goes via some cluster-wide proxy. This already accounts for ~7 incoming connections to naic-monitor.uio.no per second with Betzy, Fram, Saga, Fox, and the ml nodes reporting.
If we were to increase the frequency to once every 30s as is done a number of other places (trailblazing turtle and also the slurm-monitor that our partners at Simula have built) then we need to rethink this architecture probably:
firing up several shells, sonar, and curl for every invocation is a fair bit of work
setting up a new https connection for every invocation seems like a bit too much
we want to be sure that data are represented in a compact form
the load of connections on naic-monitor.uio.no may be such that it is not able to serve them all well
I'm not much of an expert on these things, we can probably throw hardware at the problem but we can also rethink a bit how it's done. Some options:
sonar could remain resident on the node, daemon-like, and be driven by a simple script to trigger ps, sysinfo, sacct invocations at appropriate times
sonar could perform its own exfiltration over a semi-permanent or at least reused connection
there could be an on-cluster mqtt or similar broker to act as a message intermediary so that there are fewer connections to naic-monitor
intra-cluster messages from sonar to the broker could be optimized wrt representation, encryption
we could choose to use a compact message format, eg protobuf or a bespoke binary encoding, rather than json or csv
The text was updated successfully, but these errors were encountered:
I like the idea of leveraging existing libraries/protocol solutions to act as intermediary where something that has been written for this is managing the connections and also compresses data for us.
Starting a discussion.
Currently we run sonar once every 300s and exfiltrate via curl randomly within a 240s window, allowing for ~6 failed exfiltrations before the data are dropped on the floor. Curl sets up an https connection for the exfiltration, sends the data, and takes the connection down again. On the really big iron the connection goes via some cluster-wide proxy. This already accounts for ~7 incoming connections to naic-monitor.uio.no per second with Betzy, Fram, Saga, Fox, and the ml nodes reporting.
If we were to increase the frequency to once every 30s as is done a number of other places (trailblazing turtle and also the slurm-monitor that our partners at Simula have built) then we need to rethink this architecture probably:
I'm not much of an expert on these things, we can probably throw hardware at the problem but we can also rethink a bit how it's done. Some options:
The text was updated successfully, but these errors were encountered: