Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust runtime and connectors: phase one #1167

Merged
merged 14 commits into from
Sep 12, 2023
Merged

Commits on Sep 1, 2023

  1. runtime protocol updates: $internal and validation network ports

    * Remove NetworkPort from Request.Validate of the runtime protocols.
    
    This is now an internal detail negotiated during validation. It's outcome is
    still captured in the built flow specs, but it's not validated by the connector
    (and isn't actually known until after validation completes).
    
    * Breaking: remove use of `Any` for internal field and switch to bytes.
    
    Any has major compatibility issues in its JSON encoding between Go and
    Rust. For one, the pbjson_types serde imlementation for Any does not
    match the protobuf spec (`typeUrl` instead of `@type`). For another,
    Go `jsonpb` decoding requires that @type dynamically resolves to a
    concrete Go type which is unmarshalled. Neat, but that's not the
    intended purpose of the internal field, which is really a pass-through
    piggyback for runtime-internal structured data.
    
    So, instead have internal be regular bytes. Also use `$internal` as the
    JSON name to make the special nature of this field clearer.
    Add and update tooling in Rust / Go to make it easier to work with
    internal fields.
    
    For the moment, filter out `$internal` fields in the connector-proxy.
    We can remove this once these protocol changes have propagated to
    connectors.
    
    * Remove most of the internal BuildAPI, leaving only its Config
      message.
    
    * Derive: update Response.Spec to exactly match that of the capture &
      materialize protocols. We should further extract the various
      Response.Spec messages into a common message with the same tags and
      names. I avoided doing so just now to reduce churn.
    
    * Add explicit zeroing in Rust of the `config_json` fields. `sops`
      decryption will shortly happen from Rust, and this provides a tight
      bound on how long decrypted credentials can be resident in memory.
    jgraettinger committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    dae5e19 View commit details
    Browse the repository at this point in the history
  2. crates/sources: rework Loader to be Send + Sync

    For use in a tokio::Runtime, which requires these traits.
    
    This is basically using a BoxFuture instead of LocalBoxFuture, and
    making the internal `tables` field thread-safe.
    
    It remains the case that Loader does not evaluate in parallel -- only
    concurently.
    jgraettinger committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    bbd3344 View commit details
    Browse the repository at this point in the history
  3. derive-sqlite/typescript: containerize and update for protocol updates

    Both connectors are updated for protocol changes, especially handling of
    internal fields and new helper routines.
    
    `derive-typescript` is further updated to build outside of the root
    Cargo workspace, as a stand-alone Dockerized binary.
    jgraettinger committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    ba2bdc0 View commit details
    Browse the repository at this point in the history
  4. crates/validation: update Connectors + ControlPlane traits

    * Make `Connectors` and `ControlPlane` Send + Sync so it can be used in
      tokio::Runtime.
    
    * Have `Connectors` take and receive top-level Request/Response so that
      `internal` fields may be provided. This is used for threading-through
      log level as well as for resolving flow.NetworkPorts.
    
    * Remove `inspect_image` from Connectors, as that is no longer required
      (related network ports are instead carried by the Response.internal field).
    
    * Remove all dependencies on build_api::Config. All we actually need is
      the `build_id` and `project_root`, so use those instead.
    
    * Some use of destructuring is updated due to new zero-ing Drop
      implementations for message types.
    
    * Add a new explicit validation that generated file URLs are valid.
      (They sometimes weren't in some prior build configurations).
    jgraettinger committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    2c88f15 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2023

  1. crates/async-process: add input_output

    For passing input to a process while also accumulating its output.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    52efa51 View commit details
    Browse the repository at this point in the history
  2. crates/runtime: support for connector images

    Add `unseal` module which performs `sops` decryptions, matching our
    existing Go implementation.
    
    Add `container` module which manages an image container instance,
    similar to our Go implementation.
    
    Containers have an explicit, random, and recognizable container name
    like `fc_a50cfebf` instead of docker-generated pet names like `bold_bose`.
    This allows us to inspect containers that we start.
    
    Containers do NOT use host port-mappings, as the Go implementation does.
    Instead a runtime::Container description is returned which includes the
    containers allocated (host-local) IP address as well as the
    network-ports inspected from its container image.
    
    There are a variety of startup races that Docker doesn't do a great job
    of managing -- there is no precise means of knowing, through docker,
    when it's safe to ask what a started container's IP address is, because
    Docker does network setup as an async background task. It IS reliable to
    wait until the internal container process has definititively started,
    which we can do by waiting for _something_ to arrive over its forward
    stderr. So, we have flow-connector-init write a single whitespace byte
    on startup and expect to read it from the host.
    
    Next, building on module `container`, add an `image_connector` adapter
    which manages the lifecyle of an RPC stream which may start and restart
    *many* container instances over its lifetime.
    
    Refactor deriev::Middleware into a common runtime::Runtime which
    provides all three (capture, derive, materialize) of the core Runtime
    RPCs. We'll extend this further with shuffling services in the future.
    
    Capture and materialize have minimal implementations which dispatch to an
    image connector.
    
    Derivations are re-worked to transparently map a `derive: using: typescript`
    into an image connector of `ghcr.io/estuary/derive-typescript:dev`.
    The hacky `flowctl` flow for running these derivations is removed.
    
    Also refactor TaskService & TaskRuntime into a thinner TaskServie and
    reusable TokioContext. TokioContext makes it easy to build an isolated
    tokio::Runtime with that dispatches its tracing events to an ops::Log handler,
    and has the required non-synchronous Drop semantics.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    ddf0484 View commit details
    Browse the repository at this point in the history
  3. crates/build: remove CGO build API and refactor routines

    The CGO build API is removed entirely. Instead, the role of the `build`
    crate is now to hold common routines shared across the `agent` and
    `flowctl` crates for building Flow specifications into built catalog
    specs.
    
    It also provides general-purpose implementations of sources::Fetcher
    and validation::Connectors, but not validation::ControlPlane.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    035fe1b View commit details
    Browse the repository at this point in the history
  4. crates/flowctl: raw build/json-schema and build-crate refactors

    Overhaul the `local_specs` module atop the refactored `build` crate.
    What remains is intended to be opinionated wrappers of routines provide
    by `build`.
    
    Don't immediately fail un-credentialed users. This allows invocations
    that don't require authenticated control-plane access to succeed.
    
    Initialize to a WARN log-level by default (so that "you're not
    authenticated" warnings actually show up).
    
    Add a `raw build` subcommand which performs a build::managed_build().
    This is a replacement for `flowctl-go api build`.
    
    Add a `raw json-schema` subcommand to replace `flowctl-go json-schema`.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    2431586 View commit details
    Browse the repository at this point in the history
  5. crates/agent: perform in-process managed builds

    Instead of shelling out to `flowctl-go`.
    
    This is a fairly stright forward drop-in replacement with
    `build::managed_build`, dispatched to an isolated runtime::TokioContext
    so that in-process tracing events can also be gathered.
    
    However, this means the Go logrus package is no longer responsible for
    text-formatting logs and we need to present structured instances of
    `ops::Log` to the control-plane and UI. Eventually we'd like to directly
    pass-through structured logs into the `internal.log_lines` table, but
    for now I've introduced a reasonable ANSI-colored rendering of
    deeply structured logs (that, IMO, improves on logrus).
    
    When we're ready, we can rip this out and trivially pass-through JSON
    encodings of ops::Logs.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    0d81819 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    eedd881 View commit details
    Browse the repository at this point in the history
  7. go: use flowctl raw build for catalog builds

    bindings.BuildCatalog() remains, but is now a simple wrapper around
    `flowctl raw build`.
    
    bindings.CatalogJSONSchema and `flowctl-go json-schema` are removed.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    cc21dce View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    473a6a7 View commit details
    Browse the repository at this point in the history
  9. crates/sources: fix some subtle bugs in spec indirection and merging

    Merging specs into source tables and indirecting specs are
    transformations done by `flowctl` as part of manipulating a user's local
    catalog specs. It's not used in the control-plane or data-plane.
    As such, in my testing I found some aspects were a bit fast-and-lose.
    
    Bug 1) sources::extend_from_catalog() was not properly attaching the fragment
    scopes to entities merged into tables::Sources from a models::Catalog.
    
    This matters because those scopes may be used to idenfity JSON schemas
    to bundle when later inline-ing the schema of a collection, so they need
    to be correct. Update the routine to be much more pedantic about
    producing correct scopes.
    
    Bug 2) sources::indirect_large_files() wasn't producing tables::Imports
    for the resources that it chose to indirect. This *also* matters if
    those sources are later inlined again, because JSON schema bundling
    relies on correct imports to identify schemas to bundle.
    
    Fixing both of these bugs allows `flowctl catalog pull-specs` to
    correctly pull down, indirect, then inline, validate, and generate files
    for a control-plane derivation (where it was failing previously).
    
    Extend sources scenario test coverage to fully compare each fixture for
    equality after fully round-tripping through indirection.
    
    Bonuses, that got wrapped up in this changeset and which I'm not
    bothering to pull out right now:
    
    * build::Fetcher now uses a thirty-second timeout for all fetches.
      In combination with the connectors timeout, this means catalog builds
      are now guaranteed to complete in a bounded time frame.
    
    * Move flowctl::local_specs::into_catalog() =>
      sources::merge::into_catalog().
    
    * When running with RUST_LOG=debug, `flowctl` will now persist a SQLite
      build database for examination or shipping to Estuary support.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    9a6e209 View commit details
    Browse the repository at this point in the history
  10. flowctl: support docker desktop and missing flow-connector-init

    When running on non-linux OS's, ask docker to publish all ports
    (including the init port), then extract and pass around the mapping
    of container ports to mapped host addresses.
    
    Also fix a flow-connector-init startup race where it writes to stderr
    before its port has been bound.
    jgraettinger committed Sep 12, 2023
    Configuration menu
    Copy the full SHA
    e751641 View commit details
    Browse the repository at this point in the history