-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using Oracle Cloud bare metal for green reviews cluster #166
Comments
As well as progressing the migration we want to reduce the footprint of the cluster - see #67 These 2 goals are conflicting because the lowest spec Oracle Cloud bare metal instances (BM.Standard3.64 ) have much higher CPU and memory compared to the lowest spec Equinix Metal instances we use (m3.small.x86).
AIUI this is because the Oracle bare metal instances are a physical server that otherwise would host multiple VMs. My proposal is we migrate the cluster with all nodes including the benchmarking node hosted on VMs. Kepler will run in estimation mode because RAPL is not available however the metrics are the same so we can continue pipeline development. Once the pipeline can provision the benchmark node on demand we can replace the VM with a bare metal instance. @leonardpahlke @nikimanoledaki @AntonioDiTuri @locomundo @dipankardas011 WDYT to this approach? Any concerns? |
This sounds like a good idea. We could use KVM https://github.com/dmacvicar/terraform-provider-libvirt or similar to created VMs & perhaps smth like talos linux to bootstrap the cluster easily. A second option would be to reserve the server just for a short timeframe once a day or so and do all benchmarking as a batch. To maintain the batch we would need to have a Queue running.. some kind of key value store or smth similar. This could be filled via the CI/CD pipeline over the day. Measurements in persistent storage (down the road), the Grafana dashboard, Flux and other tooling that we would like to host all the time could be served in a non bare metal minimal instance. This could perhaps be a cheaper and resource efficient solution since we may just need to run the bare metal machine for 1h a day or 1h every two days. However this means we do not have quick response times.. (we could set the bare metal machine to spin up every few hours to improve it or even run it all the time if that's down the road a problem. Quick draw up: |
Thank you @leonardpahlke I've used KVM in the past but not with libvirt. I'll do some reading up on that.
I do like this option and batching the work so we only run the high spec node for 1-2 hours a day. Adding an in-cluster queue is nice and we can also batch the 3 configurations of Falco we currently test. With Equinix it takes about 10 mins to boot a node and I expect similar for Oracle. I don't see the delay as a problem as generating the review results is not time sensitive. For the other nodes we can use VMs and burstable may even be an option if they provide sufficient performance. |
Thanks @rossf7 for putting together the material and @leonardpahlke for guidance. |
Thanks @AntonioDiTuri yes we should investigate using a managed service for the queue as well as the long term storage. Ideally we would provision the whole cluster on demand. My main doubt is how we integrate with LFX Insights because we need more details there. Also if we want to have an interactive UI like a Grafana dashboard we'd need to host it somewhere. This is why I like the idea from @leonardpahlke in #166 to have an architecture diagram for the full pipeline including LFX Insights. I think it will help with making long term decisions about the architecture. |
Currently the green reviews cluster consists of physical servers from Equinix Metal using credits donated to CNCF.
Equinix Metal is being sunset in June 2026 and we're investigating whether we can use Oracle Cloud Infrastructure as an alternative.
Goals
Non-Goals
Region Selection
We have access to the IAD region US East (Ashburn) which is linked to the credits donated to CNCF.
We investigated using the AMS Amsterdam region but that is not possible right now. Figures for renewable energy and PUE are published in the latest 2024 report.
Source: https://www.oracle.com/a/ocom/docs/corporate/citizenship/clean-cloud-oci.pdf
Bare Metal Instances
Oracle Cloud bare metal instances have access to RAPL and are compatible with our stack (ubuntu 24.04 / k3s / kepler).
Unlike Equinix Metal there is no ACPI power monitor so kepler cannot measure idle power but this is not critical since we are primarily interested in measuring dynamic power.
Oracle Cloud use the term Shape for instance types. The smallest bare metal shape is sufficient for our current use case. Although the spec is much higher than the current Equinix Metal instances we use.
https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-standard
VM Instances
The benchmarking node will continue to be a physical server to have access to RAPL.
For the control plane and the internal node with flux and prometheus we could use virtual machines including burstable instances since a physical server is not necessary.
https://docs.oracle.com/en-us/iaas/Content/Compute/References/burstable-instances.htm
This option is not possible with Equinix Metal. The downside would be that energy measurements for components running on these nodes would be in estimation mode.
EDIT: Updated region selection section.
The text was updated successfully, but these errors were encountered: