-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store twitter API responses #194
Conversation
@ri72miieop is attempting to deploy a commit to the theexgenesis' projects Team on Vercel. A member of the Team first needs to authorize it. |
const filename = `${record.originator_id}/${record.timestamp}__${record.type}.json`; | ||
|
||
// Upload file with explicit error handling | ||
const { error: uploadError } = await supabase.storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this make a massive number of files? would it be better to save batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am already processing them in batches (default of 150), afaict there's no batching API for uploading the data, right?
(I may have misunderstood what you meant with batches... if so pls let me know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant saving a batch in a single file, wyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be worse. This way it has 1 file per record and all the data for that record is contained in the file, to access it later. Imagine we need to reprocess all files, we simply would need to go through all folders and grab the latest item from each, if we batch them we would need to pick&choose the records from the files we grabbed. Also to reprocess a specific one is straightforward: just use the ID to get the proper folder and pick the latest item.
We end up with more files this way, but I am not seeing any negative consequences for it and I see some positive ones:
- if we want to move files to a "cold storage" system we could move all except the latest (in case it may need to be reprocessed), this way it's easy and straightforward
- downloading all the most recent responses (for example to have an updated dev db) would cost less traffic
- easy to find the most recent record(s)
- when we get a file we have all the data related to it at that specific time, we don't need to do any type of processing over the file
if I am missing any negative consequence or limitation please let me know tho!
Note: Just to be clear, as it is a single file will have several entities (tweet, user, media,urls, etc) related to that item
ADD COLUMN IF NOT EXISTS updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP; | ||
|
||
DROP TRIGGER IF EXISTS update_tweets_updated_at ON public.tweets; | ||
CREATE TRIGGER update_tweets_updated_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you check this doesn't slow things down when inserting a large archive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this kinds of benchmarks... I did NOT check, but it's a simple trigger which updates a column, it shouldn't have much of an effect, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wouldn't expect it to be too bad, but I've been surprised by how slow RLS makes things for example
supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql
Outdated
Show resolved
Hide resolved
supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql
Outdated
Show resolved
Hide resolved
supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql
Outdated
Show resolved
Hide resolved
|
||
|
||
|
||
CREATE OR REPLACE FUNCTION private.tes_process_account_records() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until now accounts have included only rows from people who uploaded, and includes stats calculations for each reow in the table - this changes things. Any chance we could have a scraped_account table, or twitter_user instead at least until we figure redesign it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can import them into a different table for now, but the tweets need a valid account_id to be inserted, so if I import the accounts into a different table I can't also import into the tweets table. Maybe this is a point we should talk about Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, I'll think about it, let's figure it out on Monday
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…_to_storage removed duplicated function to update updated_at column changed "tes_invoke_edge_function_process_temporary_data" to "tes_invoke_edge_function_move_data_to_storage" to be more clearer throw exception on schedule_data_moving when invocation isn't successful
removed tables account and profile and changed them into a view that only shows data from people that uploaded to the community archive
…into TES * 'TES' of https://github.com/ri72miieop/community-archive: db: make idempotent db: comment remote schema db: unique rls names db: uniquerls policy names db: if exists db: comment out first rls mig db: revert global activity summary to update correctly
…into TES # Conflicts: # supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql
…and profile (all_account, all_profile)
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
||
|
||
|
||
CREATE OR REPLACE FUNCTION private.tes_process_account_records() |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…fferent archives); safer way to drop table profile and account without a cascade;
Fixes and improvements of PR "Store twitter API responses #194"
Changes made:
archive_id changed to allow nulls (tweets, profile, followers, following, likes)
created column updated_at (tweets, profile, followers, following, likes)
new table
tes_blocked_scraping_users
to disable users from uploading datanew table
temporary_data
to record the data to processnew storage folder
twitter_api_files
to store the filescron job to process the data and update each respective table (running once every 5 min)
cron job calling an edge function that stores the file on supabase storage. (running every min)
NOTE: we need to setup the env vars for edge functions