Skip to content

Latest commit

 

History

History
394 lines (301 loc) · 17.9 KB

1.0.0-release.md

File metadata and controls

394 lines (301 loc) · 17.9 KB

Bottlerocket-update-operator Deep Dive

Authors: Tianhao Geng (@gthao313) & John McBride (@jpmcb)

Bottlerocket-update-operator (affectionately nick-named “Brupop”) is a dedicated kubernetes controller suite for keeping your bottlerocket hosts up to date with the latest releases! Now, Kubernetes operators who manage clusters with Bottlerocket nodes can be confident that they are getting the latest releases with the latest security upgrades and newest features.

Installation

Before we can install the update operator, we need to install cert-manager to the cluster:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.8.2/cert-manager.yaml

Cert-manager is needed to generate certificates that are used by the Brupop API server to ensure secure and trusted connections are being made from individual upgrade agents.

We can then install the Bottlerocket update operator using the recommended configuration defined in the yaml manifest:

kubectl apply -f ./bottlerocket-update-operator.yaml

This yaml file can be found at the root of the code repository or as a static file in the newest release. This will create the required namespace, custom resource definition, roles, deployments, service accounts, etc.

Label nodes

By default, the update operator is constrained to limit its pods to Bottlerocket nodes based on their labels. This way, Kubernetes operators can declaratively define the Bottlerocket nodes that exist for a cluster. These labels are not applied on nodes automatically and will need to be set on each Bottlerocket node using kubectl. The agent relies on each node's updater components and schedules its pods based on their supported interface. The node indicates its updater interface version in a label called bottlerocket.aws/updater-interface-version

The most current version of the updater interface is 2.0.0. Here is the sample command to label a single node:

kubectl label node NODE_NAME bottlerocket.aws/updater-interface-version=2.0.0

If all nodes in the cluster are running Bottlerocket, you can label all at the same time by running the following command:

kubectl label node $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}') bottlerocket.aws/updater-interface-version=2.0.0

Observing state

The current state of the cluster from the perspective of the update operator can be summarized by querying the Kubernetes API for BottlerocketShadow objects, the Custom Resource for the update operator. This view will inform you of the current Bottlerocket version of each node managed by the update operator, as well as the current ongoing update status of any node with an update in-progress.

Deep dive

Before we can get too deep into the update operator, let’s explore what makes bottlerocket a unique solution for container runtimes like Kubernetes. Bottlerocket is a Linux-based open-source operating system that is purpose-built by Amazon Web Services for running containers. Bottlerocket focuses on security and maintainability, providing a reliable, consistent, and safe platform for container-based workloads. This is a reflection of what we've learned building operating systems and services at Amazon. Bottlerocket includes only the essential software required to run containers, and is built with standard open-source components. Bottlerocket specific additions focus on reliable updates and on the API. Instead of making configuration changes manually, you can change settings with an API call, and these changes are automatically migrated through updates.

Some notable security features include:

Immutable rootfs backed by dm-verity

Bottlerocket uses dm-verity for its root filesystem image. This provides transparent integrity checking of the underlying block device using a cryptographic digest. The root filesystem is marked as read-only and cannot be directly modified by userspace processes. The kernel is configured to restart if corruption is detected. That allows the system to fail fast if the underlying block device is unexpectedly modified.

Stateless tmpfs for /etc

Bottlerocket uses tmpfs, a memory-backed filesystem, for /etc.

Direct modification of system configuration files such as /etc/resolv.conf is not supported. This makes OS updates more reliable, as it is not necessary to account for local edits that might have changed the behavior of system services. It also makes it harder for an attacker to modify these files in a way that persists across a reboot.

SELinux enabled in enforcing mode

Bottlerocket runs SELinux in enforcing mode by default and does not allow users to disable it.

SELinux is a Linux Security Module (LSM) that provides a mechanism for mandatory access control (MAC). Processes that run as root with full capabilities are still subject to the mandatory policy restrictions.

The policy in Bottlerocket has the following objectives:

  1. Prevent most components from directly modifying the API settings.
  2. Block most components from modifying the container archives saved on disk.
  3. Stop containers from directly modifying the layers for other running containers.

The policy is currently aimed at hardening the OS against persistent threats. Continued enhancements to the policy will focus on mitigating the impact of OS vulnerabilities

No shell or interpreters installed

Bottlerocket does not have a shell and interpreted languages such as Python are not available as packages.

Shells and interpreters enable administrators to write code that combines other programs on the system in new ways. However, these properties can also be exploited by an attacker to pivot from a vulnerability that grants local code execution. The lack of a shell also serves as a forcing function to ensure that new code for the OS is written in a preferred language such as Rust or Go. These languages offer built-in protection against memory safety issues such as buffer overflows.

Automated security updates

Bottlerocket is designed for reliable security updates that can be applied through automation (such as Brupop).

This is achieved through the following mechanisms:

  • Two partition sets and an active/passive flip to swap OS images, You might know of this as A/B (Seamless) OS updates
  • Declarative API with modeled settings for runtime configuration
  • Variants to silo backwards-incompatible or breaking changes

Using partition sets and modeled settings removes the dependency on correct local state for reliable updates. There is no package manager database or shared filesystem tree that can become corrupted and make the process non-deterministic.

Our philosophy is that the right time for an unexpected major version update to the kernel or orchestrator agent is "never." For this reason, any major updates to these parts are included in a new variant of Bottlerocket. Variant crossgrading is not currently supported.

A glance at the Bottlerocket API system

Bottlerocket is intended to be an API-first operating system - direct user interaction with Bottlerocket is usually through the API. The API server handles requests, the remaining components make sure the system is up to date, and that requests are persisted and applied correctly. As opposed to more traditional Operating Systems which rely on modifying configuration files, the API system is the bridge between the user and the underlying system. It aims to simplify common configuration, improve reliability, and reduce the need for the user to log in and debug.

Data store is a key/value store that serves as the central storage location for the API system and tools using the API.

The update API

Users can access the API through the apiclient binary available in Bottlerocket which may be accessed through a control-channel like SSM or the admin container. Rust code can also use the apiclient library to make requests to the Unix-domain socket of the apiserver.

An important feature that Brupop uses here is the Update API. The Bottlerocket API includes methods for checking and starting system updates. apiclient knows how to handle those update APIs for you. Let’s look at a few common uses of the apiclient when applying updates:

To see what updates are available

apiclient update check

If an update is available, it will show up in the chosen_update field. The available_updates field will show the full list of available versions, including older versions, because Bottlerocket supports safely rolling back.

To apply the latest update

apiclient update apply

The next time you reboot, you'll start up in the new version, and system configuration will be automatically migrated.

To reboot right away

apiclient reboot

The system will automatically roll back if it's unable to boot.

Brupop uses many of these mechanisms to apply updates automatically to Bottlerocket nodes in your cluster. Now, let’s take a closer look at Brupop itself.

The Update Operator Design

The Bottlerocket update operator is a Kubernetes operator that coordinates Bottlerocket updates on hosts in a kubernetes cluster.

Brupop aims to:

  • Control the rate at which updates are deployed — Since bottlerocket requires reboot for update, the workload of the updated hosts will be distributed on the rest of the cluster, we don't want to overload the cluster and constrain resources.
  • Safely abort any work being performed on Nodes before they are updated — Before any node is updated, the system tells Kubernetes not to schedule additional worker Pods there. Then we safely remove existing pods using an operation that we call a "drain".
  • Prevent dangerous updates from spreading across all hosts by catching errors — If something bad happens during the update, Brupop backs off attempting to continue and prevents potentially bricking the entire cluster.

Keep in mind that the end goal is to safely use the Bottlerocket API on each of our hosts to perform updates just like if we were using the apiclient through the admin container on an individual host.

System overview and architecture

Brupop consists of three primary components: the controller, the agent, and the apiserver.

The controller orchestrates updates across your cluster, while the agent is responsible for periodically querying for Bottlerocket updates, draining the node, and performing the update when asked by the controller. The agent performs all cluster object mutation operations via the Brupop API Server, which performs additional authorization using the Kubernetes TokenReview API and certificates generated by Cert-Manager — ensuring that any request associated with a node is being made by the agent pod running on that node.

When installed, the Bottlerocket update operator starts a controller deployment on one node, an agent daemon set on every Bottlerocket node, and an Update Operator API Server deployment. All components together collaborate to drive each Bottlerocket node through an update when updates are available.

K8s controllers, in a nut shell

To understand Brupop’s design, one must understand the Kubernetes Controller pattern (http://(https//kubernetes.io/docs/concepts/architecture/controller/). State in Kubernetes is managed by a “Controller” which executes a control loop. A control loop has a target desired state, and a measured current state (like a thermostat, with a set temperature and a current temperature, You change the temperature you desire on your thermostat (which is the spec), and the HVAC system works in a loop to bring your house's temperature (the status) to the one you configured.). The controller performs the actions necessary to drive the current state towards the desired state.

In Kubernetes, every abstraction is stored in the backend as an object which has a spec representing the desired state of the object, as well as a status representing the current state. Controllers use these objects to perform the features expected of the cluster.

The Brupop Custom Resource

A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind; A custom resource is an extension of the Kubernetes API that is not necessarily available in a default Kubernetes installation. Kubernetes provides users with a mechanism for defining custom objects to be managed by the API.

In Brupop, update status of each node running Bottlerocket is tracked in the “shadow” custom resource, we call it “BottlerocketShadow” or “brs” for short.

Bottlerocket shadow

The BottlerocketShadow maintains status information for each node.

The spec contains information about the desired state for the node. In particular, the spec contains information about the desired Bottlerocket version and desired state in the Brupop state machine.

The status contains information about the desired latest version for the node, as well as the current state machine state.

pub enum BottlerocketShadowState
    /// Nodes in this state are waiting for new updates to become available. 
    /// This is both the starting, terminal and recovery state in the update process.
    Idle,
    /// Nodes in this state have staged a new update image, have installed 
    /// the new image, and have updated the partition table
    /// to mark it as the new active image.
    StagedAndPerformedUpdate,
    /// Nodes in this state have used the kubernetes cordon and drain APIs to remove
    /// running pods, have un-cordoned the node to allow work to be scheduled, and
    /// have rebooted after performing an update.
    RebootedIntoUpdate,
    /// Nodes in this state are monitoring to ensure that the node seems healthy
    /// before marking the update as complete.
    MonitoringUpdate,
    /// Nodes in this state have crashed due to Bottlerocket Update API call failure.
    ErrorReset,
}

The controller component

The controller component is a control loop responsible for managing the state of the entire cluster. The controller maintains the logic of state transfer based on our state machine. For a BottlerocketShadow, the controller uses the state in status to determine state, and then set the state in spec to the next state to drive updates. The default behavior is to update one node at a time. Users may set the MAX_CONCURRENT_UPDATE environment variable in the controller’s container to allow for multiple containers to update at the same time.

The agent component

The agent component is another control loop responsible for managing the state of an individual node. Agents gather system information using Bottlerocket’s local API . It communicates with the API by mounting a unix socket into the agent container. The agent updates the BottlerocketShadow status associated with its node to include this gathered information, as well as the current update state machine state. The agent then uses the BottlerocketShadow’s spec to determine if it needs to make state transitions in the state machine, and executes on them.

When an update occurs, the agent cordons the node to prevent new pods from being scheduled to it, and then drains it of any running pods ). After the update-reboot, the node is uncordoned.

All operations which write data to the Kubernetes API are routed through the apiserver. This includes updates to the BottlerocketShadow status and commands to drain and cordon the node, or uncordon it later.

The API server component

The apiserver component performs all write requests to the Kubernetes API on behalf of the Brupop agent. This is because Kubernetes uses role-based access controls (RBAC). If RBAC were used to allow the agents direct access to modify their own Shadow objects, agents would have the ability to write to the status of all Shadow objects. To solve this issue, we choose to channel bottlerocketshadow writes through a preveledged web API. The apiserver checks request headers for a special token which is mounted into the agent pods by Kubernetes, then uses the Kubernetes TokenReview APIs to assert that the requesting node is certainly associated with the target BottlerocketShadow. Further, we utilize certificates generated by cert-manager to ensure calls to the api are generated from trusted sources.

Metrics endpoint

Brupop exposes prometheus style service metrics via web servers hosted on the Brupop API Server and Controller components. These are exposed via HTTP at the standard /metrics endpoint. Currently, the endpoints surface request metrics from the APIServer, a metric representing the current Bottlerocket version, and a metric representing the current state of each node managed by Brupop. Annotations are available on the default yaml manifest which enable metrics to be automatically scrapped into available prometheus like metric stores on the cluster.

Monitoring cluster history and metrics with Prometheus

A sample configuration is provided which demonstrates a Prometheus deployment into a cluster that is configured to gather metrics data from the update operator. Once Prometheus is running in the cluster, you can use the Prometheus UI to visualize the cluster's history.

Using port-forward via kubectl:

kubectl port-forward pods/PROMETHEUS-POD-NAME 9090:9090

You can then point a web browser to localhost:9090 to see visualize the update operator metrics and see updates in real time.

Contact us

Please feel free to contact us via GitHub if you have any questions or feature requests! We are always happy to hear from our community. You can also join the Bottlerocket community meeting if you’d like to hear from us directly.