Investigate using Oracle Cloud bare metal for green reviews cluster #166

rossf7 · 2025-02-26T15:19:33Z

Currently the green reviews cluster consists of physical servers from Equinix Metal using credits donated to CNCF.

Equinix Metal is being sunset in June 2026 and we're investigating whether we can use Oracle Cloud Infrastructure as an alternative.

Goals

Continue using bare metal servers with RAPL enabled for collecting energy metrics
Retire Equinix Metal servers and user accounts
Select Oracle Cloud region primarily powered by renewable energy
Reduce footprint of green reviews cluster where possible

Non-Goals

Provision servers on demand (will be addressed in [Proposal] Efficient and Greener way to use k8s cluster for benchmarking tasks #67)

Region Selection

We have access to the IAD region US East (Ashburn) which is linked to the credits donated to CNCF.

We investigated using the AMS Amsterdam region but that is not possible right now. Figures for renewable energy and PUE are published in the latest 2024 report.

Source: https://www.oracle.com/a/ocom/docs/corporate/citizenship/clean-cloud-oci.pdf

Bare Metal Instances

Oracle Cloud bare metal instances have access to RAPL and are compatible with our stack (ubuntu 24.04 / k3s / kepler).

Unlike Equinix Metal there is no ACPI power monitor so kepler cannot measure idle power but this is not critical since we are primarily interested in measuring dynamic power.

Oracle Cloud use the term Shape for instance types. The smallest bare metal shape is sufficient for our current use case. Although the spec is much higher than the current Equinix Metal instances we use.

Shape	OCPU	Memory (GB)	Local Disk	Max Network Bandwidth	Max VNICs Total: Linux	Max VNICs Total: Windows
BM.Standard3.64	64	1024	Block storage only	2 x 50 Gbps	256	129 (1 on the first physical NIC, 128 on the second)

https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-standard

VM Instances

The benchmarking node will continue to be a physical server to have access to RAPL.

For the control plane and the internal node with flux and prometheus we could use virtual machines including burstable instances since a physical server is not necessary.

https://docs.oracle.com/en-us/iaas/Content/Compute/References/burstable-instances.htm

This option is not possible with Equinix Metal. The downside would be that energy measurements for components running on these nodes would be in estimation mode.

EDIT: Updated region selection section.

rossf7 · 2025-03-05T12:10:19Z

As well as progressing the migration we want to reduce the footprint of the cluster - see #67

These 2 goals are conflicting because the lowest spec Oracle Cloud bare metal instances (BM.Standard3.64 ) have much higher CPU and memory compared to the lowest spec Equinix Metal instances we use (m3.small.x86).

64 OCPU vs 1 x Intel Xeon E-2378G E-2378G (8 cores @ 2.80GHz)
1024 GB RAM vs 64 GB

AIUI this is because the Oracle bare metal instances are a physical server that otherwise would host multiple VMs.

My proposal is we migrate the cluster with all nodes including the benchmarking node hosted on VMs. Kepler will run in estimation mode because RAPL is not available however the metrics are the same so we can continue pipeline development.

Once the pipeline can provision the benchmark node on demand we can replace the VM with a bare metal instance.

@leonardpahlke @nikimanoledaki @AntonioDiTuri @locomundo @dipankardas011 WDYT to this approach? Any concerns?

leonardpahlke · 2025-03-05T14:20:26Z

This sounds like a good idea. We could use KVM https://github.com/dmacvicar/terraform-provider-libvirt or similar to created VMs & perhaps smth like talos linux to bootstrap the cluster easily.

A second option would be to reserve the server just for a short timeframe once a day or so and do all benchmarking as a batch. To maintain the batch we would need to have a Queue running.. some kind of key value store or smth similar. This could be filled via the CI/CD pipeline over the day. Measurements in persistent storage (down the road), the Grafana dashboard, Flux and other tooling that we would like to host all the time could be served in a non bare metal minimal instance. This could perhaps be a cheaper and resource efficient solution since we may just need to run the bare metal machine for 1h a day or 1h every two days. However this means we do not have quick response times.. (we could set the bare metal machine to spin up every few hours to improve it or even run it all the time if that's down the road a problem. Quick draw up:

rossf7 · 2025-03-05T14:47:08Z

Thank you @leonardpahlke I've used KVM in the past but not with libvirt. I'll do some reading up on that.

A second option would be to reserve the server just for a short timeframe once a day or so and do all benchmarking as a batch. To maintain the batch we would need to have a Queue running.. some kind of key value store or smth similar.

I do like this option and batching the work so we only run the high spec node for 1-2 hours a day. Adding an in-cluster queue is nice and we can also batch the 3 configurations of Falco we currently test.

With Equinix it takes about 10 mins to boot a node and I expect similar for Oracle. I don't see the delay as a problem as generating the review results is not time sensitive.

For the other nodes we can use VMs and burstable may even be an option if they provide sufficient performance.
https://docs.oracle.com/en-us/iaas/Content/Compute/References/burstable-instances.htm

AntonioDiTuri · 2025-03-05T15:25:47Z

Thanks @rossf7 for putting together the material and @leonardpahlke for guidance.
It looks like a good plan. I am wondering if we can use some oracle queue service like https://www.oracle.com/cloud/queue/ to avoid even the VM up all the time. What do you think?

rossf7 · 2025-03-05T15:45:22Z

It looks like a good plan. I am wondering if we can use some oracle queue service like https://www.oracle.com/cloud/queue/ to avoid even the VM up all the time.

Thanks @AntonioDiTuri yes we should investigate using a managed service for the queue as well as the long term storage.

Ideally we would provision the whole cluster on demand. My main doubt is how we integrate with LFX Insights because we need more details there. Also if we want to have an interactive UI like a Grafana dashboard we'd need to host it somewhere.

This is why I like the idea from @leonardpahlke in #166 to have an architecture diagram for the full pipeline including LFX Insights. I think it will help with making long term decisions about the architecture.

rossf7 added area/cluster kind/feature New feature or request priority/important-soon labels Feb 26, 2025

rossf7 self-assigned this Feb 26, 2025

rossf7 added this to Green Reviews Feb 26, 2025

github-project-automation bot moved this to Backlog in Green Reviews Feb 26, 2025

rossf7 moved this from Backlog to In progress in Green Reviews Feb 26, 2025

rossf7 changed the title ~~Investigate using OCI bare metal for green reviews cluster~~ Investigate using Oracle Cloud bare metal for green reviews cluster Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using Oracle Cloud bare metal for green reviews cluster #166

Investigate using Oracle Cloud bare metal for green reviews cluster #166

rossf7 commented Feb 26, 2025 •

edited

Loading

rossf7 commented Mar 5, 2025

leonardpahlke commented Mar 5, 2025

rossf7 commented Mar 5, 2025

AntonioDiTuri commented Mar 5, 2025

rossf7 commented Mar 5, 2025

Investigate using Oracle Cloud bare metal for green reviews cluster #166

Investigate using Oracle Cloud bare metal for green reviews cluster #166

Comments

rossf7 commented Feb 26, 2025 • edited Loading

Goals

Non-Goals

Region Selection

Bare Metal Instances

VM Instances

rossf7 commented Mar 5, 2025

leonardpahlke commented Mar 5, 2025

rossf7 commented Mar 5, 2025

AntonioDiTuri commented Mar 5, 2025

rossf7 commented Mar 5, 2025

rossf7 commented Feb 26, 2025 •

edited

Loading