You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
“Littlepay would like to resolve the customer funding sources that are missing (or are missing the principal customer id) by creating a "catch up" file of them and posting it to our data feed bucket.
This would mean that we will have an extra file put in the bucket (potentially more than one) with the missing or incomplete records (per merchant of course), it will be time-stamped the date it is produced and should just be read in in order as is normal.”
Two cases:
Mismatched participant_id inside customer_funding_source
No funding_source_id inside customer_funding_source
As requested by Akos
The text was updated successfully, but these errors were encountered:
I don't think this will work, because the airflow operator will sync the most recent single file to our GCS bucket from the most recent partition in the Littlepay S3 bucket:
Airflow operator littlepay_raw_sync.py (link) syncs most recent data from Littlepay S3 buckets
This only takes a single file from the most recent partition, per get_latest_file (link to line in operator, link to source file)
But, will it work if all of the updated rows are in a single new file with our otherwise-expected new data?
We do handle new or updated data via hashing and deduplication
Example in stg_littlepay__customer_funding_source (link): hashing here, deduplication using this macro in the file here
However, the hash above is for full row duplicates, so actually seems like it won't exclude the previous rows with missing data in favor of the new rows
BUT! I believe that this QUALIFY ROW_NUMBER() partitioned by participant_id, funding_source_id, customer_id and ordered by littlepay_export_ts DESC will (link)
On the join, new funding_source_id values will get picked up in fct_payment_rides_v2
tl;dr: it looks like we need to receive one file, but if we do, then the handling that I mentioned in the last bullet above should prioritize the new data and new principal_customer_ids via our QUALIFY statement
The new data actually won't get picked up by our pipeline if it's sent as multiple files at the same time. So it seems like there are two options:
See if everything can be published as a single file
Publish the data as multiple files more than one hour apart. Our pipeline looks for new data every hour, and will only take the most recent file each hour. If we're able to publish the data files in a scattered way, they should be picked up
Would it be possible to send a sample file ahead of the publishing of the modified historical data? We have assumptions about how the pipeline will handle the new rows, but would love to test them beforehand to see exactly how the data will be differentiated.
“Littlepay would like to resolve the customer funding sources that are missing (or are missing the principal customer id) by creating a "catch up" file of them and posting it to our data feed bucket.
This would mean that we will have an extra file put in the bucket (potentially more than one) with the missing or incomplete records (per merchant of course), it will be time-stamped the date it is produced and should just be read in in order as is normal.”
Two cases:
Mismatched participant_id inside customer_funding_source
No funding_source_id inside customer_funding_source
As requested by Akos
The text was updated successfully, but these errors were encountered: