Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store twitter API responses #194

Merged
merged 28 commits into from
Jan 24, 2025
Merged

Store twitter API responses #194

merged 28 commits into from
Jan 24, 2025

Conversation

ri72miieop
Copy link
Contributor

Changes made:

  • archive_id changed to allow nulls (tweets, profile, followers, following, likes)

  • created column updated_at (tweets, profile, followers, following, likes)

  • new table tes_blocked_scraping_users to disable users from uploading data

  • new table temporary_data to record the data to process

  • new storage folder twitter_api_files to store the files

  • cron job to process the data and update each respective table (running once every 5 min)

  • cron job calling an edge function that stores the file on supabase storage. (running every min)

    • this edge function works as a manager, if there is work to be done it calls up to 5 times the processing edge function, otherwise it calls only 1
    • each call can move up to 150 files

NOTE: we need to setup the env vars for edge functions

Copy link

vercel bot commented Dec 30, 2024

@ri72miieop is attempting to deploy a commit to the theexgenesis' projects Team on Vercel.

A member of the Team first needs to authorize it.

const filename = `${record.originator_id}/${record.timestamp}__${record.type}.json`;

// Upload file with explicit error handling
const { error: uploadError } = await supabase.storage
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make a massive number of files? would it be better to save batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am already processing them in batches (default of 150), afaict there's no batching API for uploading the data, right?

(I may have misunderstood what you meant with batches... if so pls let me know)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant saving a batch in a single file, wyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be worse. This way it has 1 file per record and all the data for that record is contained in the file, to access it later. Imagine we need to reprocess all files, we simply would need to go through all folders and grab the latest item from each, if we batch them we would need to pick&choose the records from the files we grabbed. Also to reprocess a specific one is straightforward: just use the ID to get the proper folder and pick the latest item.

We end up with more files this way, but I am not seeing any negative consequences for it and I see some positive ones:

  • if we want to move files to a "cold storage" system we could move all except the latest (in case it may need to be reprocessed), this way it's easy and straightforward
  • downloading all the most recent responses (for example to have an updated dev db) would cost less traffic
  • easy to find the most recent record(s)
  • when we get a file we have all the data related to it at that specific time, we don't need to do any type of processing over the file

if I am missing any negative consequence or limitation please let me know tho!

Note: Just to be clear, as it is a single file will have several entities (tweet, user, media,urls, etc) related to that item

ADD COLUMN IF NOT EXISTS updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP;

DROP TRIGGER IF EXISTS update_tweets_updated_at ON public.tweets;
CREATE TRIGGER update_tweets_updated_at
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check this doesn't slow things down when inserting a large archive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this kinds of benchmarks... I did NOT check, but it's a simple trigger which updates a column, it shouldn't have much of an effect, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wouldn't expect it to be too bad, but I've been surprised by how slow RLS makes things for example




CREATE OR REPLACE FUNCTION private.tes_process_account_records()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until now accounts have included only rows from people who uploaded, and includes stats calculations for each reow in the table - this changes things. Any chance we could have a scraped_account table, or twitter_user instead at least until we figure redesign it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can import them into a different table for now, but the tweets need a valid account_id to be inserted, so if I import the accounts into a different table I can't also import into the tweets table. Maybe this is a point we should talk about Monday.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I'll think about it, let's figure it out on Monday

This comment was marked as outdated.

supabase/functions/process_temporary_data/index.ts Outdated Show resolved Hide resolved
ri72miieop and others added 14 commits January 4, 2025 20:09
…_to_storage

removed duplicated function to update updated_at column
changed "tes_invoke_edge_function_process_temporary_data" to "tes_invoke_edge_function_move_data_to_storage" to be more clearer
throw exception on schedule_data_moving when invocation isn't successful
removed tables account and profile and changed them into a view that only shows data from people that uploaded to the community archive
…into TES

* 'TES' of https://github.com/ri72miieop/community-archive:
  db: make idempotent
  db: comment remote schema
  db: unique rls names
  db: uniquerls policy names
  db: if exists
  db: comment out first rls mig
  db: revert global activity summary to update correctly
…into TES

# Conflicts:
#	supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql
Copy link

vercel bot commented Jan 21, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
community-archive ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 22, 2025 6:21pm




CREATE OR REPLACE FUNCTION private.tes_process_account_records()

This comment was marked as outdated.

sql/permissions/02_table_permissions.sql Show resolved Hide resolved
sql/tables/01_all_account.sql Show resolved Hide resolved
@TheExGenesis TheExGenesis merged commit fab5af5 into TheExGenesis:main Jan 24, 2025
3 checks passed
TheExGenesis added a commit that referenced this pull request Jan 26, 2025
Fixes and improvements of PR "Store twitter API responses #194"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants