Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSC-STM-B1: Add Storage for PSC Kinesis pointers per shard #249

Open
tiredpixel opened this issue Feb 28, 2024 · 2 comments
Open

PSC-STM-B1: Add Storage for PSC Kinesis pointers per shard #249

tiredpixel opened this issue Feb 28, 2024 · 2 comments
Assignees

Comments

@tiredpixel
Copy link
Contributor

This is similar to Task T3, where a stream pointer will need to be stored and retrieved when processing from the Kinesis Stream.

A slight complexity is that stream pointers actually exist per shard as opposed to for the whole stream, so it is necessary for each shard consumer to do this. In practice, we currently only use a single shard, so for an initial release it would suffice to assume this.

Estimate: 4 hours

@tiredpixel
Copy link
Contributor Author

Clarification from PDF about meaning of PSC-STM-B section:

Part B: Continuous Transformation

Goal
The goal is to replace the bulk transformation from S3 for processing PSC records with transforming them immediately.

@tiredpixel
Copy link
Contributor Author

It was found that the existing Transformer PSC streaming code already handled storage of stream pointers, including for multiple shards. Some experiments with this code were made, and things seem to be working as they ought with regards to handling the Kinesis stream itself.

Note that there isn't a single thread per shard, contrary to Kinesis recommendations, but since we're currently using a single shard per stream and in consideration of the potential of race conditions within BODS statement publishing (e.g. Transformer PSC bulk import), this seems fine for now.

Various logging was added, since there wasn't any visibility into what part of the stream was being played, nor record-level logging such as was already added to Ingester PSC. It should now be far easier to see whether things are working as they ought, and what's currently happening, even if there is no data currently to be consumed.

D, [2024-04-29T12:06:47.409687 #1] DEBUG -- : [shardId-000000000000] LAG: 7767s | N: 0
D, [2024-04-29T12:06:48.486908 #1] DEBUG -- : [shardId-000000000000] LAG: 7409s | N: 0
D, [2024-04-29T12:06:49.856137 #1] DEBUG -- : [shardId-000000000000] LAG: 7306s | N: 50
I, [2024-04-29T12:06:49.856372 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054639667922658832751255758045186] /company/12623074/persons-with-significant-control/individual/ljaZYMgpUOgr00UWT0FC8xPM99k
I, [2024-04-29T12:06:49.856460 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054640331622933801182672671735810] /company/08182390/persons-with-significant-control/individual/ab8m2iBBRbWU7TpxFUl-c73FcUk
I, [2024-04-29T12:06:49.856552 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054640528677842398367228148842498] /company/09790717/persons-with-significant-control/individual/xDMA8UvAMcdwQBM441ypFqRjQ3M
I, [2024-04-29T12:06:49.856606 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054640758373748125146771343015938] /company/09790842/persons-with-significant-control/individual/MZkzYP2Bu86mzv4C9m7hb1Zt-mc
I, [2024-04-29T12:06:49.856686 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054640934876917788882630850117634] /company/09790919/persons-with-significant-control/individual/U-w3feL6dBsjrs-WQyoBOBiLDYI
I, [2024-04-29T12:06:49.856738 #1]  INFO -- : [shardId-000000000000] [49649995681699993522081054641157319268597974398996054018] /company/08999723/persons-with-significant-control/individual/2BtFZsQnJusqG5V9vB6i9Bw5GXg

Note that there is a 1s delay between each fetch of the stream. Although this makes things work slower than they could otherwise, this is in keeping with Kinesis recommendations. Despite this, a 24h stream with few-to-no events could be caught up within a few minutes, so this should be okay. The current lag of each shard was added to logging, to make this more apparent:

D, [2024-04-29T12:12:18.712305 #1] DEBUG -- : [shardId-000000000000] LAG: 396s | N: 0
D, [2024-04-29T12:12:19.784994 #1] DEBUG -- : [shardId-000000000000] LAG: 129s | N: 0
D, [2024-04-29T12:12:20.845756 #1] DEBUG -- : [shardId-000000000000] LAG: 0s | N: 0
D, [2024-04-29T12:12:21.904611 #1] DEBUG -- : [shardId-000000000000] LAG: 0s | N: 0

@tiredpixel tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant