Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSC-STM-B3: Finish PSC Kinesis Stream Transformer #251

Open
tiredpixel opened this issue Feb 28, 2024 · 3 comments
Open

PSC-STM-B3: Finish PSC Kinesis Stream Transformer #251

tiredpixel opened this issue Feb 28, 2024 · 3 comments
Assignees

Comments

@tiredpixel
Copy link
Contributor

The main work recently (and monthly import) have involved running the bulk transformer, which transforms from the S3 files produced from buffering the Kinesis stream.

This means the app which consumes from Kinesis directly hasn’t been run or updated recently.

  • Extend Kinesis Stream consuming to allow consuming from multiple shards
  • Update the app to follow the format of our other apps
  • Add tests

Estimate: 4 hours

@tiredpixel
Copy link
Contributor Author

The existing code works for consuming from multiple Kinesis shards. However, the manner by which it does that isn't optimal:

  • fetch list of shards
  • for each shard in stream [A]
    • fetch batch of 50 records (max)
    • for each record in batch
      • process record (statement)
      • save sequence number/pointer for shard
  • sleep 1s
  • goto (A)

There are a number of considerations with this approach:

  1. Despite sleeping being according to Kinesis recommendations, so as not to read the stream too frequently, it means catching up a shard takes longer, even for iterations which return no records.
  2. Interleaving between shards like this means that by the time the first shard is returned to, the iterator might have timed out (which happens after 5m).
  3. There isn't a separate thread for each shard (contrary to Kinesis recommendations), so processing these records cannot be done in parallel.
  4. It's not possible to start a separate process to consume only one shard, so processing these records cannot be done in parallel and the program cannot be scaled horizontally.

Extending to support additional threads or processes likely wouldn't be too much work; however, multi-threading hasn't always been smooth-sailing with existing bulk data (i.e. non-stream) transformations, and I'm concerned this could lead to more conflicts when writing to Elasticsearch resulting in program crashes.

Despite these limitations, the existing approach is likely good enough for us at present, because we're using only a single shard per stream, and even a single shard is able to cope with a far higher throughput than we're able to cope with, given how long it takes to process each statement. Not only that, but using multiple shards affects event order, and this would have to be considered carefully for our use case, especially given that statements are generally order-sensitive.

@tiredpixel
Copy link
Contributor Author

Kinesis Quotas and Limits

Data stream throughput

Provisioned mode

There is no upper limit. Maximum throughput depends on the number of shards provisioned for the stream. Each shard can support up to 1 MB/sec or 1,000 records/sec write throughput or up to 2 MB/sec or 2,000 records/sec read throughput. […]

https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

This is approximately 3 orders of magnitude faster on write than we're currently utilising.

@tiredpixel
Copy link
Contributor Author

In keeping with recent work on other parts of the program, I'm not writing extra formal tests for this. I am not convinced of the benefit of doing so, especially as keeping to the previous pattern would result in calls to Kinesis and other external services being stubbed (i.e. not actually executed live) anyway. I note there are some existing tests checking some overall calls, but extending these would be significant work, and I'm unpersuaded about the merit of doing so considering other details of the project, codebase, and roadmap.

@tiredpixel tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant