diff --git a/README.md b/README.md index 0323636..7955ee4 100644 --- a/README.md +++ b/README.md @@ -40,4 +40,4 @@ mkdocs serve Then just point the browser to [http://127.0.0.1:8000](http://127.0.0.1:8000). -[docs]: https://docs.cocos.ai +[docs]: https://docs.cocos.ultraviolet.rs diff --git a/docs/agent.md b/docs/agent.md index f0cd92d..08e6005 100644 --- a/docs/agent.md +++ b/docs/agent.md @@ -1,17 +1,19 @@ # Agent -The agent is responsible for the life cycle of the computation, i.e., running the computation and sending events about the status of the computation within the TEE. The agent is found inside the VM (TEE), and each computation within the TEE has its own agent. When a computation run request is sent from from the manager, manager creates a VM where the agent is found and sends the computation manifest to the agent. +The agent is responsible for the life cycle of the computation, i.e., running the computation and sending events about the status of the computation within the TEE. The agent is found inside the VM (TEE), and each computation within the TEE has its own agent. When a computation run request is sent from the manager, manager creates a VM where the agent is found and sends the computation manifest to the agent. The picture below shows where the Agent runs in the Cocos system, helping us better understand its role. ![Agent](./img/agent.png){ align=center } ## StateMachine + - Orchestrates the overall flow of the computation. - Transitions between states based on received events. - Defines valid state transitions and associated functions. ### States + - `idle`: Initial state, waiting for the computation to start. - `receivingManifest`: Receives the initial computation manifest. - `receivingAlgorithm`: Receives the algorithm for the computation. @@ -21,6 +23,7 @@ The picture below shows where the Agent runs in the Cocos system, helping us bet - `complete`: All results have been consumed, computation lifecycle ends. ### Events + - `start`: Triggers the computation startup process. - `manifestReceived`: Indicates computation manifest has been received. - `algorithmReceived`: Indicates the algorithm has been received. @@ -30,7 +33,7 @@ The picture below shows where the Agent runs in the Cocos system, helping us bet ## Agent Events -As the computation in the agent undergoes different operations, it sends events to the manager so that the user can monitor the computation from either the UI or other client. Events sent to the manager based on the agent state as defined by the statemachine. +As the computation in the agent undergoes different operations, it sends events to the manager so that the user can monitor the computation from either the UI or other client. Events sent to the manager are based on the agent state as defined by the statemachine. ## Vsock Connection Between Agent & Manager diff --git a/docs/architecture.md b/docs/architecture.md index 0363d30..c792fff 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -3,15 +3,15 @@ CocosAI system is running on the host, and it's main goal is to enable: - Programatic creation of enclaves (TEEs) -- Gest OS and system enviroment withn the enclave VMs +- Guest OS and system environment within the enclave VMs - Monitoring of enclaves - In-enclave SW manager agent -- Ectyped data trensfer into the enclave and computation execution +- Ecrypted data transfer into the enclave and computation execution - Result retrieval via encrypted channel to an authorized party - Providing of HW measurement and attestation report - Enablement of vTPM and [DICE](https://trustedcomputinggroup.org/accurately-attest-the-integrity-of-devices-with-dice/) integrity checks (root chain of trust) in order to ensure secure boot of the TEEs -These features are implemented by several independed components of CocosAI system: +These features are implemented by several independent components of CocosAI system: 1. Manager 2. Agent @@ -21,7 +21,7 @@ These features are implemented by several independed components of CocosAI syste ![Cocos Arch](./img/arch.png){ align=center } - >**N.B.** CocosAI open-source project does not provide Computation Management service. It is usually a cloud component, used to define a Computation (i.e. define computation metadata, like participant list, algorithm and data providers, result recipients, etc...). Ultraviolet provide commercial product Prism, a multi-party computation platform, that implements multi-tenant and scalable Computation Management service, running in the cloud or on premise, and capable to connect and control CocosAI system running on the TEE host. + >**N.B.** CocosAI open-source project does not provide Computation Management service. It is usually a cloud component, used to define a Computation (i.e. define computation metadata, like participants list, algorithm and data providers, result recipients, etc...). Ultraviolet provides commercial product Prism, a multi-party computation platform, that implements multi-tenant and scalable Computation Management service, running in the cloud or on premise, and capable to connect and control CocosAI system running on the TEE host. ## Manager @@ -29,13 +29,13 @@ Manager is a gRPC client that listens to requests sent through gRPC and sends th ## Agent -Agent defines firmware which goes into the TEE and is used to control and monitor computation within TEE and enable secure and encrypted communication with outside world (in order to fetch the data and provide the result of the computation). The Agent contains a gRPC server that listens for requests from gRPC clients. Communication between the Manager and Agent is done via vsock. The Agent sends events to the Manager via vsock, which then forwards these via gRPC. Agent contains a gRPC server that exposes useful functions that can be accessed by other gRPC clients such as the CLI. +Agent defines firmware which goes into the TEE and is used to control and monitor computation within TEE and enable secure and encrypted communication with the outside world (in order to fetch the data and provide the result of the computation). The Agent contains a gRPC server that listens for requests from gRPC clients. Communication between the Manager and Agent is done via vsock. The Agent sends events to the Manager via vsock, which then forwards these via gRPC. Agent contains a gRPC server that exposes useful functions that can be accessed by other gRPC clients such as the CLI. ## EOS EOS, or Enclave Operating System, is ... ## CLI -CoCoS CLI is used to access the agent within the secure enclave. CLI communicates to agent using gRPC, with funcitons such as algo to provide the algorithm to be run, data to provide the data to be used in the computation, and run to start the computation. It also has functions to fetch and validate the attestation report of the enclave. +CoCoS CLI is used to access the agent within the secure enclave. CLI communicates to agent using gRPC, with functions such as algo to provide the algorithm to be run, data to provide the data to be used in the computation, and run to start the computation. It also has functions to fetch and validate the attestation report of the enclave. For more information on CLI, please refer to [CLI docs](./cli.md). diff --git a/docs/computation.md b/docs/computation.md index b2270b5..b2a3131 100644 --- a/docs/computation.md +++ b/docs/computation.md @@ -1,24 +1,26 @@ # Computation -Computation in CocosAI is any execution of a program (Algorithm) or an data set (Data), that can be one data file, or a lot of files comping from different parties. + +Computation in CocosAI is any execution of a program (Algorithm) on a data set (Data), that can be one data file, or a lot of files coming from different parties. Computations are multi-party, meaning that program and data providers can be different parties that do not want to expose their intellectual property to other parties participating in the computation. `Computation` is a structure that holds all the necessary information needed to execute the computation securely (list of participants, execution backend - i.e. where computation will be executed, role of each participant, cryptographic certificates, etc...). ## Computation Roles -Computation is multi-party, i.e. has multiple participants. Each of the users that participate in the computation can have one of the follwoing roles: -1. **Computation Owner** - user that created the `Computation` and that defines who will participate in it and with wich role (by inviting other users to the Computation) -2. **Algorithm Provider** - user that will provide th actual program to be executed -3. **Data Provider** - user that will provide a data on which algorithm will be executed, i.e. data which algorithm will process +Computation is multi-party, i.e. has multiple participants. Each of the users that participate in the computation can have one of the following roles: + +1. **Computation Owner** - user that created the `Computation` and that defines who will participate in it and with which role (by inviting other users to the Computation) +2. **Algorithm Provider** - user that will provide the actual program to be executed +3. **Data Provider** - user that will provide data on which the algorithm will be executed, i.e. data which algorithm will process 4. **Result Recipient** - user that will recieve result after the processing -One user can have several roles - for example, Algorithm Provider can also be a Result Recipient. +One user can have several roles - for example, an Algorithm Provider can also be a Result Recipient. ## Computation Manifest -Computation Manifest represent that Computation description and is sent upon `run` command to the Manager as a JSON. + +Computation Manifest represents the Computation description and is sent upon `run` command to the Manager as a JSON. Manager fetches the Computation Manifest and sends it into the TEE to Agent, via vsock. The first thing that Agent does upon boot, is that it fetches the Computation Manifest and reads it. For this Manifest, Agent understands who are the participants in the computation adn with wich role, i.e. from whom it can accept the connections and what data they will send. Agent also learns from the Manifest what algorithm is used and how many datasets will be provided. This way it knows when it received all necessary files to start the execution. Finally, Agent learns from the Manifest to whom it needs to send the Result of the computation. - diff --git a/docs/index.md b/docs/index.md index 0f2daf8..05c2dc5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,7 +2,7 @@ CocosAI (Confidential Computing System for AI) is a SW system for enabling confidential and privacy-preserving AI/ML, i.e. execution of model training and algorithm inference on confidential data sets. Privacy-preservation is considered a “holy grail” of AI. It opens many possibilities, among which is a collaborative, trustworthy AI. -CocosAI leverages Confidential Computing, a novel paradigm based on specialized HW CPU extensions for producting secure encrypted enclaves in memory (Trusted Execution Enviroments, or TEEs), thus isloalting confidential data and programs from the rest of the SW running on the hos +CocosAI leverages Confidential Computing, a novel paradigm based on specialized HW CPU extensions for producting secure encrypted enclaves in memory (Trusted Execution Enviroments, or TEEs), thus isolating confidential data and programs from the rest of the SW running on the host. The final product enables data scientists to train AI and ML models on confidential data that is never revealed, and can be used for Secure Multi-Party Computation (SMPC). AI/ML on combined data sets that come from different sources will unlock huge value. @@ -13,7 +13,7 @@ The final product enables data scientists to train AI and ML models on confident CoCoS.ai is enabling the following features: - TEE enablement, deployment and monitoring -- In-enclave agent, netowrking controller and other system software +- In-enclave agent, networking controller and other system software - Encrypted asynchronous data transfer and result delivery - API for programmable platform manipulation - HW and SW supported attestation with verification tools @@ -23,4 +23,4 @@ CoCoS.ai is enabling the following features: CocosAI is published under liberal [Apache-2.0](https://github.com/ultravioletrs/cocos/blob/main/LICENSE) open-source license. ## GitHub -CcosAI can be downlaoded from its [GitHub repository](https://github.com/ultravioletrs/cocos) +CocosAI can be downloaded from its [GitHub repository](https://github.com/ultravioletrs/cocos) diff --git a/docs/manager.md b/docs/manager.md index 2b1c66a..963f693 100644 --- a/docs/manager.md +++ b/docs/manager.md @@ -5,11 +5,11 @@ Manager runs on the TEE-capable host (AMD SEV-SNP, Intel SGX or Intel TDX) and h 1. To deploy the well-prepared TEE upon the `start` command and upload the necessary configuration into it (command line arguments, TLS certificates, etc...) 2. To monitor deployed TEE and provide remot logs -Manager expsoses and API for control, based on gRPC, and is controlled by Computation Management service. Manager acts as the client of Computation Management service and connects to it upon the start via TLS-encoded gRPC connection. +Manager exposes an API for control, based on gRPC, and is controlled by Computation Management service. Manager acts as the client of Computation Management service and connects to it upon the start via TLS-encoded gRPC connection. -Computation Management service is used to to cnfigure computation metadata. Once a computation is created by a user and the invited users have uploaded their public certificates (used later for identification and data exchange in the enclave), a run request is sent. The Manager is responsible for creating the TEE in which computation will be ran and managing the computation lifecycle. +Computation Management service is used to to configure computation metadata. Once a computation is created by a user and the invited users have uploaded their public certificates (used later for identification and data exchange in the enclave), a run request is sent. The Manager is responsible for creating the TEE in which computation will be ran and managing the computation lifecycle. -Communication to between Computation Management cloud and the Manager is done via gRPC, while communication between Manager and Agent is done via [Virtio Vsock](https://wiki.qemu.org/Features/VirtioVsock). Vsock is used to send Agent events from the computation in the Agent to the Manager. The Manager then sends the events back to Computation Mangement cloud via gRPC, and these are visible to the end user. +Communication between Computation Management cloud and the Manager is done via gRPC, while communication between Manager and Agent is done via [Virtio Vsock](https://wiki.qemu.org/Features/VirtioVsock). Vsock is used to send Agent events from the computation in the Agent to the Manager. The Manager then sends the events back to Computation Mangement cloud via gRPC, and these are visible to the end user. The picture below shows where the Manager runs in the Cocos system, helping us better understand its role. @@ -17,9 +17,9 @@ The picture below shows where the Manager runs in the Cocos system, helping us b ## Manager <> Agent -When TEE is booted, and Agent is autmatically deployed and is used for outside communication with the enclave (via the API) and for computation orchestration (data and algorithm upload, start of the computation and retrieval of the result). +When TEE is booted, an Agent is automatically deployed and is used for outside communication with the enclave (via the API) and for computation orchestration (data and algorithm upload, start of the computation and retrieval of the result). -Agent is a gRPC server, and CLI is a gRPC client of the Agent. The Manager sends the Computation Manifest to the Agent via vsock and the Agent runs the computation, according to the Computation Manifest, while sending evnets back to manager on the status. The Manager then sends the events it receives from agent via vsock to Computation Mangement cloud through gRPC. +Agent is a gRPC server, and CLI is a gRPC client of the Agent. The Manager sends the Computation Manifest to the Agent via vsock and the Agent runs the computation, according to the Computation Manifest, while sending events back to manager on the status. The Manager then sends the events it receives from agent via vsock to Computation Mangement cloud through gRPC. ## Setup and Test Manager <> Agent @@ -47,7 +47,9 @@ sudo apt install qemu-kvm Create `img` directory in `cmd/manager`. Create `tmp` directory in `cmd/manager`. #### Add V-sock + The necessary kernel modules must be loaded on the hypervisor. + ```shell sudo modprobe vhost_vsock ls -l /dev/vhost-vsock @@ -56,10 +58,9 @@ ls -l /dev/vsock # crw-rw-rw- 1 root root 10, 121 Jan 16 12:05 /dev/vsock ``` - ### Prepare Cocos HAL -Cocos HAL for Linux is framework for building custom in-enclave Linux distribution. Use the instructions in [Readme](https://github.com/ultravioletrs/cocos/blob/main/hal/linux/README.md). +Cocos HAL for Linux is a framework for building custom in-enclave Linux distribution. Use the instructions in [Readme](https://github.com/ultravioletrs/cocos/blob/main/hal/linux/README.md). Once the image is built copy the kernel and rootfs image to `cmd/manager/img` from `buildroot/output/images/bzImage` and `buildroot/output/images/rootfs.cpio.gz` respectively. #### Test VM Creation @@ -96,10 +97,13 @@ qemu-system-x86_64 \ -monitor pty \ -monitor unix:monitor,server,nowait ``` + Once the VM is booted press enter and on the login use username `root`. #### Build and Run Agent + Agent is started automatically in the VM. + ```sh # List running processes and use 'grep' to filter for processes containing 'agent' in their names. ps aux | grep cocos-agent @@ -138,7 +142,6 @@ MANAGER_QEMU_OVMF_VARS_FILE=/usr/share/OVMF/OVMF_VARS.fd NB: we set environment variables that we will use in the shell process where we run `manager`. - ## Deployment To start the service, execute the following shell script (note a server needs to be running see [here](../test/computations/README.md)): @@ -163,7 +166,7 @@ MANAGER_QEMU_ENABLE_SEV=false \ ./build/cocos-manager ``` -To enable [AMD SEV](https://www.amd.com/en/developer/sev.html) support, start manager like this +To enable [AMD SEV](https://www.amd.com/en/developer/sev.html) support, start manager like this ```sh MANAGER_GRPC_URL=localhost:7001 @@ -181,6 +184,7 @@ NB: To verify that the manager successfully launched the VM, you need to open th ```bash go run ./test/computations/main.go ``` + and in the second the manager by executing (with the environment variables of choice): ```bash @@ -196,6 +200,7 @@ ps aux | grep qemu-system-x86_64 ``` You should get something similar to this + ``` darko 324763 95.3 6.0 6398136 981044 ? Sl 16:17 0:15 /usr/bin/qemu-system-x86_64 -enable-kvm -machine q35 -cpu EPYC -smp 4,maxcpus=64 -m 4096M,slots=5,maxmem=30G -drive if=pflash,format=raw,unit=0,file=/usr/share/OVMF/OVMF_CODE.fd,readonly=on -drive if=pflash,format=raw,unit=1,file=img/OVMF_VARS.fd -device virtio-scsi-pci,id=scsi,disable-legacy=on,iommu_platform=true -drive file=img/focal-server-cloudimg-amd64.img,if=none,id=disk0,format=qcow2 -device scsi-hd,drive=disk0 -netdev user,id=vmnic,hostfwd=tcp::2222-:22,hostfwd=tcp::9301-:9031,hostfwd=tcp::7020-:7002 -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= -nographic -monitor pty ``` @@ -217,7 +222,7 @@ If the `ps aux | grep qemu-system-x86_64` give you something like this darko 13913 0.0 0.0 0 0 pts/2 Z+ 20:17 0:00 [qemu-system-x86] ``` -means that the a QEMU virtual machine that is currently defunct, meaning that it is no longer running. More precisely, the defunct process in the output is also known as a ["zombie" process](https://en.wikipedia.org/wiki/Zombie_process). +means that the QEMU virtual machine that is currently defunct, meaning that it is no longer running. More precisely, the defunct process in the output is also known as a ["zombie" process](https://en.wikipedia.org/wiki/Zombie_process). You can troubleshoot the VM launch procedure by running directly `qemu-system-x86_64` command. When you run `manager` with `MANAGER_LOG_LEVEL=info` env var set, it prints out the entire command used to launch a VM. The relevant part of the log might look like this @@ -244,4 +249,3 @@ pkill -f qemu-system-x86_64 The pkill command is used to kill processes by name or by pattern. The -f flag to specify that we want to kill processes that match the pattern `qemu-system-x86_64`. It sends the SIGKILL signal to all processes that are running `qemu-system-x86_64`. If this does not work, i.e. if `ps aux | grep qemu-system-x86_64` still outputs `qemu-system-x86_64` related process(es), you can kill the unwanted process with `kill -9 `, which also sends a SIGKILL signal to the process. -