Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill missing and modified Littlepay customer_funding_source data #3528

Open
charlie-costanzo opened this issue Nov 5, 2024 · 2 comments
Assignees

Comments

@charlie-costanzo
Copy link
Member

charlie-costanzo commented Nov 5, 2024

“Littlepay would like to resolve the customer funding sources that are missing (or are missing the principal customer id) by creating a "catch up" file of them and posting it to our data feed bucket.

This would mean that we will have an extra file put in the bucket (potentially more than one) with the missing or incomplete records (per merchant of course), it will be time-stamped the date it is produced and should just be read in in order as is normal.”

Two cases:
Mismatched participant_id inside customer_funding_source
No funding_source_id inside customer_funding_source

As requested by Akos

@charlie-costanzo
Copy link
Member Author

I don't think this will work, because the airflow operator will sync the most recent single file to our GCS bucket from the most recent partition in the Littlepay S3 bucket:
Airflow operator littlepay_raw_sync.py (link) syncs most recent data from Littlepay S3 buckets
This only takes a single file from the most recent partition, per get_latest_file (link to line in operator, link to source file)
But, will it work if all of the updated rows are in a single new file with our otherwise-expected new data?
We do handle new or updated data via hashing and deduplication
Example in stg_littlepay__customer_funding_source (link): hashing here, deduplication using this macro in the file here
However, the hash above is for full row duplicates, so actually seems like it won't exclude the previous rows with missing data in favor of the new rows
BUT! I believe that this QUALIFY ROW_NUMBER() partitioned by participant_id, funding_source_id, customer_id and ordered by littlepay_export_ts DESC will (link)
On the join, new funding_source_id values will get picked up in fct_payment_rides_v2

tl;dr: it looks like we need to receive one file, but if we do, then the handling that I mentioned in the last bullet above should prioritize the new data and new principal_customer_ids via our QUALIFY statement

@charlie-costanzo
Copy link
Member Author

Sent an email to Akos about this:

The new data actually won't get picked up by our pipeline if it's sent as multiple files at the same time. So it seems like there are two options:

  • See if everything can be published as a single file
  • Publish the data as multiple files more than one hour apart. Our pipeline looks for new data every hour, and will only take the most recent file each hour. If we're able to publish the data files in a scattered way, they should be picked up
    Would it be possible to send a sample file ahead of the publishing of the modified historical data? We have assumptions about how the pipeline will handle the new rows, but would love to test them beforehand to see exactly how the data will be differentiated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant