diff --git a/arch.rst b/arch.rst index d961842..b79c2f3 100644 --- a/arch.rst +++ b/arch.rst @@ -42,8 +42,8 @@ Platform-as-a-Service (PaaS). Aether supports this combination by implementing both the RAN and the user plane of the Mobile Core on-prem, as cloud-native workloads -co-located on the Aether cluster. This is often referred to as local -breakout because it enables direct communication between mobile +co-located on the Aether cluster. This is often referred to as *local +breakout* because it enables direct communication between mobile devices and edge applications without data traffic leaving the enterprise. This scenario is depicted in :numref:`Figure %s `, which does not name the edge applications, but @@ -62,7 +62,7 @@ example. The approach includes both edge (on-prem) and centralized (off-prem) components. This is true for edge apps, which often have a centralized -counterpart running in a commodity cloud. It is also true for the +counterpart running in a commodity cloud. It is also true for the 5G Mobile Core, where the on-prem User Plane (UP) is paired with a centralized Control Plane (CP). The central cloud shown in this figure might be private (i.e., operated by the enterprise), public (i.e., @@ -72,9 +72,9 @@ cloud). Also shown in :numref:`Figure %s ` is a centralized *Control and Management Platform*. This represents all the functionality needed to offer Aether as a managed service, with system administrators using a portal exported by this platform to operate the -underlying infrastructure and services. The rest of this book is about -everything that goes into implementing that *Control and Management -Platform*. +underlying infrastructure and services within their enterprise. The +rest of this book is about everything that goes into implementing that +*Control and Management Platform*. 2.1 Edge Cloud -------------- @@ -112,8 +112,8 @@ the SD-Fabric), are deployed as a set of microservices, but details about the functionality implemented by these containers is otherwise not critical to this discussion. For our purposes, they are representative of any cloud native workload. (The interested reader is -referred to our 5G and SDN books for more information about the -internal working of SD-RAN, SD-Core, and SD-Fabric.) +referred to our companion 5G and SDN books for more information about +the internal working of SD-RAN, SD-Core, and SD-Fabric.) .. _reading_5g: .. admonition:: Further Reading @@ -151,8 +151,8 @@ Platform (AMP). Each SD-Core CP controls one or more SD-Core UPs, as specified by 3GPP, the standards organization responsible for 5G. Exactly how CP instances (running centrally) are paired with UP instances (running at -the edges) is a configuration-time decision, and depends on the degree -of isolation the enterprise sites require. AMP is responsible for +the edges) is a runtime decision, and depends on the degree of +isolation the enterprise sites require. AMP is responsible for managing all the centralized and edge subsystems (as introduced in the next section). @@ -173,12 +173,12 @@ we started with in :numref:`Figure %s ` of Chapter 1).\ [#]_ This is because, while each ACE site usually corresponds to a physical cluster built out of bare-metal components, each of the SD-Core CP subsystems shown in :numref:`Figure %s ` is actually -deployed as a logical Kubernetes cluster on a commodity cloud. The +deployed in a logical Kubernetes cluster on a commodity cloud. The same is true for AMP. Aether’s centralized components are able to run in Google Cloud Platform, Microsoft Azure, and Amazon’s AWS. They also run as an emulated cluster implemented by a system like KIND—Kubernetes in Docker—making it possible for developers to run -these components on a laptop. +these components on their laptop. .. [#] Confusingly, Kubernetes adopts generic terminology, such as “cluster” and “service”, and gives it very specific meaning. In @@ -190,8 +190,7 @@ these components on a laptop. potentially thousands of such logical clusters. And as we'll see in a later chapter, even an ACE edge site sometimes hosts more than one Kubernetes cluster (e.g., one running production - services and one used for development and testing of new - services). + services and one used for trial deployments of new services). 2.3 Control and Management -------------------------- @@ -304,7 +303,7 @@ both physical and virtual resources. 2.3.2 Lifecycle Management ~~~~~~~~~~~~~~~~~~~~~~~~~~ -Lifecycle Management is the process of integrating fixed, extended, +Lifecycle Management is the process of integrating debugged, extended, and refactored components (often microservices) into a set of artifacts (e.g., Docker containers and Helm charts), and subsequently deploying those artifacts to the operational cloud. It includes a @@ -368,7 +367,7 @@ the cloud offers to end users. Thus, we can generalize the figure so Runtime Control mediates access to any of the underlying microservices (or collections of microservices) the cloud designer wishes to make publicly accessible, including the rest of AMP! In effect, Runtime -Control implements an abstraction layer, codified with programmatic +Control implements an abstraction layer, codified with a programmatic API. Given this mediation role, Runtime Control provides mechanisms to @@ -434,7 +433,7 @@ operators a way to both read (monitor) and write (control) various parameters of a running system. Connecting those two subsystems is how we build closed loop control. -A third example is even more ambiguous. Lifecycle management usually +A third example is even more nebulous. Lifecycle management usually takes responsibility for *configuring* each component, while runtime control takes responsibility for *controlling* each component. Where you draw the line between configuration and control is somewhat diff --git a/intro.rst b/intro.rst index ce47136..d786090 100644 --- a/intro.rst +++ b/intro.rst @@ -72,6 +72,12 @@ perspective on the problem. We return to the confluence of enterprise, cloud, access technologies later in this chapter, but we start by addressing the terminology challenge. +.. _reading_aether: +.. admonition:: Further Reading + + `Aether: 5G-Connected Edge Cloud + `__. + 1.1 Terminology --------------- @@ -107,7 +113,7 @@ terminology. * **OSS/BSS:** Another Telco acronym (Operations Support System, Business Support System), referring to the subsystem that implements both operational logic (OSS) and business logic - (BSS). Usually the top-most component in the overall O&M + (BSS). It is usually the top-most component in the overall O&M hierarchy. * **EMS:** Yet another Telco acronym (Element Management System), @@ -164,23 +170,23 @@ terminology. * **Continuous Integration / Continuous Deployment (CI/CD):** An approach to Lifecycle Management in which the path from development (producing new functionality) to testing, integration, - and ultimately deployment is an automated pipeline. Typically - implies continuously making small incremental changes rather than - performing large disruptive upgrades. + and ultimately deployment is an automated pipeline. CI/CD + typically implies continuously making small incremental changes + rather than performing large disruptive upgrades. * **DevOps:** An engineering discipline (usually implied by CI/CD) that balances feature velocity against system stability. It is a practice typically associated with container-based (also known as - *cloud native*) systems, and typified by *Site Reliability + *cloud native*) systems, as typified by *Site Reliability Engineering (SRE)* practiced by cloud providers like Google. * **In-Service Software Upgrade (ISSU):** A requirement that a component continue running during the deployment of an upgrade, with minimal disruption to the service delivered to - end-users. Generally implies the ability to incrementally roll-out - (and roll-back) an upgrade, but is specifically a requirement on - individual components (as opposed to the underlying platform used - to manage a set of components). + end-users. ISSU generally implies the ability to incrementally + roll-out (and roll-back) an upgrade, but is specifically a + requirement on individual components (as opposed to the underlying + platform used to manage a set of components). * **Monitoring & Logging:** Collecting data from system components to aid in management decisions. This includes diagnosing faults, tuning @@ -188,10 +194,10 @@ terminology. and provisioning additional capacity. * **Analytics:** A program (often using statistical models) that - produces additional insights (value) from raw data. Can be used to - close a control loop (i.e., auto-reconfigure a system based on + produces additional insights (value) from raw data. It can be used + to close a control loop (i.e., auto-reconfigure a system based on these insights), but could also be targeted at a human operator - (that subsequently takes some action). + that subsequently takes some action. Another way to talk about operations is in terms of stages, leading to a characterization that is common for traditional network devices: @@ -301,9 +307,9 @@ manageable: majority of configuration involves initiating software parameters, which is more readily automated. -* Cloud native implies a set best-practices for addressing many of the - FCAPS requirements, especially as they relate to availability and - performance, both of which are achieved through horizontal +* Cloud native implies a set of best-practices for addressing many of + the FCAPS requirements, especially as they relate to availability + and performance, both of which are achieved through horizontal scaling. Secure communication is also typically built into cloud RPC mechanisms. @@ -319,17 +325,19 @@ monitoring data in a uniform way, and (d) continually integrating and deploying individual microservices as they evolve over time. Finally, because a cloud is infinitely programmable, the system being -managed has the potential to change substantially over time.\ [#]_ This -means that the cloud management system must itself be easily extended -to support new features (as well as the refactoring of existing -features). This is accomplished in part by implementing the cloud -management system as a cloud service, but it also points to taking -advantage of declarative specifications of how all the disaggregated -pieces fit together. These specifications can then be used to generate -elements of the management system, rather than having to manually -recode them. This is a subtle issue we will return to in later -chapters, but ultimately, we want to be able to auto-configure the -subsystem responsible for auto-configuring the rest of the system. +managed has the potential to change substantially over time.\ [#]_ +This means that the cloud management system must itself be easily +extended to support new features (as well as the refactoring of +existing features). This is accomplished in part by implementing the +cloud management system as a cloud service, which means we will see a +fair amount of recursive dependencies throughout this book. It also +points to taking advantage of declarative specifications of how all +the disaggregated pieces fit together. These specifications can then +be used to generate elements of the management system, rather than +having to manually recode them. This is a subtle issue we will return +to in later chapters, but ultimately, we want to be able to +auto-configure the subsystem responsible for auto-configuring the rest +of the system. .. [#] For example, compare the two services Amazon offered ten years ago (EC2 and S3) with the well over 100 services available on @@ -371,13 +379,19 @@ identifies the technology we assume. ~~~~~~~~~~~~~~~~~~~~~~~ The assumed hardware building blocks are straightforward. We start -with bare-metal servers and switches, built using merchant -silicon. These might, for example, be ARM or x86 processor chips and +with bare-metal servers and switches, built using merchant silicon +chips. These might, for example, be ARM or x86 processor chips and Tomahawk or Tofino switching chips, respectively. The bare-metal boxes also include a bootstrap mechanism (e.g., BIOS for servers and ONIE for switches), and a remote device management interface (e.g., IPMI or Redfish). +.. _reading_redfish: +.. admonition:: Further Reading + + Distributed Management Task Force (DMTF) `Redfish + `__. + A physical cloud cluster is then constructed with the hardware building blocks arranged as shown in :numref:`Figure %s `: one or more racks of servers connected by a leaf-spine switching @@ -397,11 +411,11 @@ that software running on the servers controls the switches. software components, which we describe next. Collectively, all the hardware and software components shown in the figure form the *platform*. Where we draw the line between what's *in the platform* -and what runs *on top of the platform* will become clear in later -chapters, but the summary is that different mechanisms will be -responsible for (a) bringing up the platform and prepping it to host -workloads, and (b) managing the various workloads that need to be -deployed on that platform. +and what runs *on top of the platform*, and why it is important, will +become clear in later chapters, but the summary is that different +mechanisms will be responsible for (a) bringing up the platform and +prepping it to host workloads, and (b) managing the various workloads +that need to be deployed on that platform. 1.3.2 Server Virtualization @@ -415,7 +429,7 @@ resources, all running on the commodity processors in the cluster: 2. Kubernetes instantiates and interconnects containers. 3. Helm charts specify how collections of related containers are - interconnected. + interconnected to build applications. These are all well known and ubiquitous, and so we only summarize them here. Links to related information for anyone that is not familiar diff --git a/preface.rst b/preface.rst index b6dd7ba..02955ff 100644 --- a/preface.rst +++ b/preface.rst @@ -11,21 +11,21 @@ job of it. The answer, we believe, is that the cloud is becoming ubiquitous in another way, as it moves from hundreds of datacenters to tens of thousands of enterprises. And while it is clear that the commodity -cloud providers will happily manage those edge clusters as a logical +cloud providers are eager to manage those edge clusters as a logical extension of their datacenters, they do not have a lock on the know-how for making that happen. This book lays out a roadmap that a small team of engineers followed -over a course of a year to stand-up and operationalize a hybrid cloud -spanning a dozen enterprises, and hosting a non-trivial cloud native -service (5G connectivity in our case, but that’s just an example). The -team was able to do this by leveraging 20+ open source components, -but selecting those components is just a start. There were dozens of -technical decisions to make along the way, and a few thousand lines of -configuration code to write. We believe this is a repeatable exercise, -which we report in this book. (And the code for those configuration -files is open source, for those that want to pursue the topic in more -detail.) +over the course of a year to stand-up and operationalize a hybrid +cloud that spans a dozen enterprises, and hosts a non-trivial cloud +native service (5G connectivity in our case, but that’s just an +example). The team was able to do this by leveraging 20+ open source +components, but selecting those components is just a start. There were +dozens of technical decisions to make along the way, and a few +thousand lines of configuration code to write. We believe this is a +repeatable exercise, which we report in this book. (And the code for +those configuration files is open source, for those that want to +pursue the topic in more detail.) Our roadmap may not be the right one for all circumstances, but it does shine a light on the fundamental challenges and trade-offs @@ -41,8 +41,8 @@ How to operationalize a computing system is a question that’s as old as the field of *Operating Systems*. Operationalizing a cloud is just today’s version of that fundamental problem, which has become all the more interesting as we move up the stack, from managing *devices* to -managing *services*. The fact that this topic is both timely and -foundational are among the reasons it is worth studying. +managing *services*. That this topic is both timely and foundational +are among the reasons it is worth studying. Guided Tour of Open Source @@ -80,11 +80,11 @@ Sunay for his influence on its overall design. Suchitra Vemuri's insights into testing and quality assurance were also invaluable. This book is still very much a work-in-progress, and we will happily -acknowledge anyone that provides feedback. Please send us your +acknowledge everyone that provides feedback. Please send us your comments using the `Issues Link `__. Also see the `Wiki `__ for the TODO -list we're working on. +list we're currently working on. | Larry Peterson, Scott Baker, Andy Bavier, Zack Williams, and Bruce Davie | October 2021