Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using Oracle Cloud bare metal for green reviews cluster #166

Open
rossf7 opened this issue Feb 26, 2025 · 5 comments
Open

Investigate using Oracle Cloud bare metal for green reviews cluster #166

rossf7 opened this issue Feb 26, 2025 · 5 comments
Assignees

Comments

@rossf7
Copy link
Contributor

rossf7 commented Feb 26, 2025

Currently the green reviews cluster consists of physical servers from Equinix Metal using credits donated to CNCF.

Equinix Metal is being sunset in June 2026 and we're investigating whether we can use Oracle Cloud Infrastructure as an alternative.

Goals

  • Continue using bare metal servers with RAPL enabled for collecting energy metrics
  • Retire Equinix Metal servers and user accounts
  • Select Oracle Cloud region primarily powered by renewable energy
  • Reduce footprint of green reviews cluster where possible

Non-Goals

Region Selection

We have access to the IAD region US East (Ashburn) which is linked to the credits donated to CNCF.

We investigated using the AMS Amsterdam region but that is not possible right now. Figures for renewable energy and PUE are published in the latest 2024 report.

Source: https://www.oracle.com/a/ocom/docs/corporate/citizenship/clean-cloud-oci.pdf

Bare Metal Instances

Oracle Cloud bare metal instances have access to RAPL and are compatible with our stack (ubuntu 24.04 / k3s / kepler).

Unlike Equinix Metal there is no ACPI power monitor so kepler cannot measure idle power but this is not critical since we are primarily interested in measuring dynamic power.

Oracle Cloud use the term Shape for instance types. The smallest bare metal shape is sufficient for our current use case. Although the spec is much higher than the current Equinix Metal instances we use.

Shape OCPU Memory (GB) Local Disk Max Network Bandwidth Max VNICs Total: Linux Max VNICs Total: Windows
BM.Standard3.64 64 1024 Block storage only 2 x 50 Gbps 256 129 (1 on the first physical NIC, 128 on the second)

https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-standard

VM Instances

The benchmarking node will continue to be a physical server to have access to RAPL.

For the control plane and the internal node with flux and prometheus we could use virtual machines including burstable instances since a physical server is not necessary.

https://docs.oracle.com/en-us/iaas/Content/Compute/References/burstable-instances.htm

This option is not possible with Equinix Metal. The downside would be that energy measurements for components running on these nodes would be in estimation mode.

EDIT: Updated region selection section.

@rossf7 rossf7 self-assigned this Feb 26, 2025
@rossf7 rossf7 moved this from Backlog to In progress in Green Reviews Feb 26, 2025
@rossf7 rossf7 changed the title Investigate using OCI bare metal for green reviews cluster Investigate using Oracle Cloud bare metal for green reviews cluster Feb 26, 2025
@rossf7
Copy link
Contributor Author

rossf7 commented Mar 5, 2025

As well as progressing the migration we want to reduce the footprint of the cluster - see #67

These 2 goals are conflicting because the lowest spec Oracle Cloud bare metal instances (BM.Standard3.64 ) have much higher CPU and memory compared to the lowest spec Equinix Metal instances we use (m3.small.x86).

  • 64 OCPU vs 1 x Intel Xeon E-2378G E-2378G (8 cores @ 2.80GHz)
  • 1024 GB RAM vs 64 GB

AIUI this is because the Oracle bare metal instances are a physical server that otherwise would host multiple VMs.

My proposal is we migrate the cluster with all nodes including the benchmarking node hosted on VMs. Kepler will run in estimation mode because RAPL is not available however the metrics are the same so we can continue pipeline development.

Once the pipeline can provision the benchmark node on demand we can replace the VM with a bare metal instance.

@leonardpahlke @nikimanoledaki @AntonioDiTuri @locomundo @dipankardas011 WDYT to this approach? Any concerns?

@leonardpahlke
Copy link
Member

This sounds like a good idea. We could use KVM https://github.com/dmacvicar/terraform-provider-libvirt or similar to created VMs & perhaps smth like talos linux to bootstrap the cluster easily.

A second option would be to reserve the server just for a short timeframe once a day or so and do all benchmarking as a batch. To maintain the batch we would need to have a Queue running.. some kind of key value store or smth similar. This could be filled via the CI/CD pipeline over the day. Measurements in persistent storage (down the road), the Grafana dashboard, Flux and other tooling that we would like to host all the time could be served in a non bare metal minimal instance. This could perhaps be a cheaper and resource efficient solution since we may just need to run the bare metal machine for 1h a day or 1h every two days. However this means we do not have quick response times.. (we could set the bare metal machine to spin up every few hours to improve it or even run it all the time if that's down the road a problem. Quick draw up:

image

@rossf7
Copy link
Contributor Author

rossf7 commented Mar 5, 2025

Thank you @leonardpahlke I've used KVM in the past but not with libvirt. I'll do some reading up on that.

A second option would be to reserve the server just for a short timeframe once a day or so and do all benchmarking as a batch. To maintain the batch we would need to have a Queue running.. some kind of key value store or smth similar.

I do like this option and batching the work so we only run the high spec node for 1-2 hours a day. Adding an in-cluster queue is nice and we can also batch the 3 configurations of Falco we currently test.

With Equinix it takes about 10 mins to boot a node and I expect similar for Oracle. I don't see the delay as a problem as generating the review results is not time sensitive.

For the other nodes we can use VMs and burstable may even be an option if they provide sufficient performance.
https://docs.oracle.com/en-us/iaas/Content/Compute/References/burstable-instances.htm

@AntonioDiTuri
Copy link
Contributor

Thanks @rossf7 for putting together the material and @leonardpahlke for guidance.
It looks like a good plan. I am wondering if we can use some oracle queue service like https://www.oracle.com/cloud/queue/ to avoid even the VM up all the time. What do you think?

@rossf7
Copy link
Contributor Author

rossf7 commented Mar 5, 2025

It looks like a good plan. I am wondering if we can use some oracle queue service like https://www.oracle.com/cloud/queue/ to avoid even the VM up all the time.

Thanks @AntonioDiTuri yes we should investigate using a managed service for the queue as well as the long term storage.

Ideally we would provision the whole cluster on demand. My main doubt is how we integrate with LFX Insights because we need more details there. Also if we want to have an interactive UI like a Grafana dashboard we'd need to host it somewhere.

This is why I like the idea from @leonardpahlke in #166 to have an architecture diagram for the full pipeline including LFX Insights. I think it will help with making long term decisions about the architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

No branches or pull requests

3 participants