Backfill missing and modified Littlepay customer_funding_source data #3528

charlie-costanzo · 2024-11-05T17:11:45Z

“Littlepay would like to resolve the customer funding sources that are missing (or are missing the principal customer id) by creating a "catch up" file of them and posting it to our data feed bucket.

This would mean that we will have an extra file put in the bucket (potentially more than one) with the missing or incomplete records (per merchant of course), it will be time-stamped the date it is produced and should just be read in in order as is normal.”

Two cases:
Mismatched participant_id inside customer_funding_source
No funding_source_id inside customer_funding_source

As requested by Akos

charlie-costanzo · 2024-11-05T17:12:52Z

I don't think this will work, because the airflow operator will sync the most recent single file to our GCS bucket from the most recent partition in the Littlepay S3 bucket:
Airflow operator littlepay_raw_sync.py (link) syncs most recent data from Littlepay S3 buckets
This only takes a single file from the most recent partition, per get_latest_file (link to line in operator, link to source file)
But, will it work if all of the updated rows are in a single new file with our otherwise-expected new data?
We do handle new or updated data via hashing and deduplication
Example in stg_littlepay__customer_funding_source (link): hashing here, deduplication using this macro in the file here
However, the hash above is for full row duplicates, so actually seems like it won't exclude the previous rows with missing data in favor of the new rows
BUT! I believe that this QUALIFY ROW_NUMBER() partitioned by participant_id, funding_source_id, customer_id and ordered by littlepay_export_ts DESC will (link)
On the join, new funding_source_id values will get picked up in fct_payment_rides_v2

tl;dr: it looks like we need to receive one file, but if we do, then the handling that I mentioned in the last bullet above should prioritize the new data and new principal_customer_ids via our QUALIFY statement

charlie-costanzo · 2024-11-05T17:13:45Z

Sent an email to Akos about this:

The new data actually won't get picked up by our pipeline if it's sent as multiple files at the same time. So it seems like there are two options:

See if everything can be published as a single file
Publish the data as multiple files more than one hour apart. Our pipeline looks for new data every hour, and will only take the most recent file each hour. If we're able to publish the data files in a scattered way, they should be picked up
Would it be possible to send a sample file ahead of the publishing of the modified historical data? We have assumptions about how the pipeline will handle the new rows, but would love to test them beforehand to see exactly how the data will be differentiated.

erikamov assigned charlie-costanzo Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfill missing and modified Littlepay customer_funding_source data #3528

Backfill missing and modified Littlepay customer_funding_source data #3528

charlie-costanzo commented Nov 5, 2024 •

edited

Loading

charlie-costanzo commented Nov 5, 2024

charlie-costanzo commented Nov 5, 2024

Backfill missing and modified Littlepay customer_funding_source data #3528

Backfill missing and modified Littlepay customer_funding_source data #3528

Comments

charlie-costanzo commented Nov 5, 2024 • edited Loading

charlie-costanzo commented Nov 5, 2024

charlie-costanzo commented Nov 5, 2024

charlie-costanzo commented Nov 5, 2024 •

edited

Loading