Store twitter API responses #194

ri72miieop · 2024-12-30T21:49:15Z

Changes made:

archive_id changed to allow nulls (tweets, profile, followers, following, likes)
created column updated_at (tweets, profile, followers, following, likes)
new table tes_blocked_scraping_users to disable users from uploading data
new table temporary_data to record the data to process
new storage folder twitter_api_files to store the files
cron job to process the data and update each respective table (running once every 5 min)
cron job calling an edge function that stores the file on supabase storage. (running every min)
- this edge function works as a manager, if there is work to be done it calls up to 5 times the processing edge function, otherwise it calls only 1
- each call can move up to 150 files

NOTE: we need to setup the env vars for edge functions

…puting it on the fly

vercel · 2024-12-30T21:49:19Z

@ri72miieop is attempting to deploy a commit to the theexgenesis' projects Team on Vercel.

A member of the Team first needs to authorize it.

TheExGenesis · 2025-01-03T19:44:19Z

supabase/functions/process_temporary_data/index.ts

+        const filename = `${record.originator_id}/${record.timestamp}__${record.type}.json`;
+
+        // Upload file with explicit error handling
+        const { error: uploadError } = await supabase.storage


does this make a massive number of files? would it be better to save batches?

I am already processing them in batches (default of 150), afaict there's no batching API for uploading the data, right?

(I may have misunderstood what you meant with batches... if so pls let me know)

I meant saving a batch in a single file, wyt?

I think that would be worse. This way it has 1 file per record and all the data for that record is contained in the file, to access it later. Imagine we need to reprocess all files, we simply would need to go through all folders and grab the latest item from each, if we batch them we would need to pick&choose the records from the files we grabbed. Also to reprocess a specific one is straightforward: just use the ID to get the proper folder and pick the latest item.

We end up with more files this way, but I am not seeing any negative consequences for it and I see some positive ones:

if we want to move files to a "cold storage" system we could move all except the latest (in case it may need to be reprocessed), this way it's easy and straightforward

downloading all the most recent responses (for example to have an updated dev db) would cost less traffic

easy to find the most recent record(s)

when we get a file we have all the data related to it at that specific time, we don't need to do any type of processing over the file

if I am missing any negative consequence or limitation please let me know tho!

Note: Just to be clear, as it is a single file will have several entities (tweet, user, media,urls, etc) related to that item

supabase/functions/schedule_data_moving/index.ts

TheExGenesis · 2025-01-03T20:02:19Z

supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

+ADD COLUMN IF NOT EXISTS updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP;
+
+DROP TRIGGER IF EXISTS update_tweets_updated_at ON public.tweets;
+CREATE TRIGGER update_tweets_updated_at 


did you check this doesn't slow things down when inserting a large archive?

It would be nice to have this kinds of benchmarks... I did NOT check, but it's a simple trigger which updates a column, it shouldn't have much of an effect, right?

Yeah I wouldn't expect it to be too bad, but I've been surprised by how slow RLS makes things for example

supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

TheExGenesis · 2025-01-03T20:22:14Z

supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

+
+
+
+CREATE OR REPLACE FUNCTION private.tes_process_account_records()


until now accounts have included only rows from people who uploaded, and includes stats calculations for each reow in the table - this changes things. Any chance we could have a scraped_account table, or twitter_user instead at least until we figure redesign it?

Sure, I can import them into a different table for now, but the tweets need a valid account_id to be inserted, so if I import the accounts into a different table I can't also import into the tweets table. Maybe this is a point we should talk about Monday.

Ah right, I'll think about it, let's figure it out on Monday

supabase/functions/process_temporary_data/index.ts

…_to_storage removed duplicated function to update updated_at column changed "tes_invoke_edge_function_process_temporary_data" to "tes_invoke_edge_function_move_data_to_storage" to be more clearer throw exception on schedule_data_moving when invocation isn't successful

removed tables account and profile and changed them into a view that only shows data from people that uploaded to the community archive

…into TES * 'TES' of https://github.com/ri72miieop/community-archive: db: make idempotent db: comment remote schema db: unique rls names db: uniquerls policy names db: if exists db: comment out first rls mig db: revert global activity summary to update correctly

…nd all_account)

…into TES # Conflicts: # supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

…and profile (all_account, all_profile)

vercel · 2025-01-21T10:55:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
community-archive	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 22, 2025 6:21pm

supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

+
+
+
+CREATE OR REPLACE FUNCTION private.tes_process_account_records()


sql/permissions/02_table_permissions.sql

sql/tables/01_all_account.sql

…fferent archives); safer way to drop table profile and account without a cascade;

…s to profile

Fixes and improvements of PR "Store twitter API responses #194"

ri72miieop added 6 commits November 28, 2024 21:02

fix parameter usage

8836eee

Merge branch 'TheExGenesis:main' into TES

c481f9a

Merge branch 'main' into TES

c5572df

fix parameter usage

1f63db0

fix search function to use fts column already existent instead of com…

c524449

…puting it on the fly

store twitter api responses

8180274

Merge branch 'main' into TES

3c67903

TheExGenesis reviewed Jan 3, 2025

View reviewed changes

ri72miieop and others added 14 commits January 4, 2025 20:09

Merge branch 'main' into TES

2326942

created table all_account and all_profile

1c7b4a2

removed tables account and profile and changed them into a view that only shows data from people that uploaded to the community archive

added indexes to all_profile

df80ee5

updated delete_all_archives to use the new table names (all_profile a…

a4d8385

…nd all_account)

recreate materialized view

26905cc

recreate materialized view

764e901

Merge branch 'TES' of https://github.com/ri72miieop/community-archive …

7446551

…into TES # Conflicts: # supabase/migrations/20241204170257_TES_send_data_to_community_archive.sql

create db schema for tables specific for TES

332f19c

fix syntax error

5eb96b3

updated script to import data to use the new tables name for account …

c088cbf

…and profile (all_account, all_profile)

run the import job every minute

d479147

changed date of migration file

233a9a5

ri72miieop requested a review from TheExGenesis January 16, 2025 18:02

expose new schema

c543ae0

vercel bot had a problem deploying to Preview January 21, 2025 10:56 Failure

TheExGenesis reviewed Jan 21, 2025

View reviewed changes

ri72miieop added 2 commits January 21, 2025 21:58

now deals correctly with duplicated profiles (when someone uploads di…

cfc9fc7

…fferent archives); safer way to drop table profile and account without a cascade;

remove supabase edge functions from being processed by pnpm build

92b1c9b

ri72miieop and others added 4 commits January 21, 2025 22:34

fix TopMentionedMissingUsers to be consistent with the current change…

839f096

…s to profile

updated table to be consistent with the last migration

09bd29a

fix build error and updated database types

f5cc71b

Merge branch 'main' into TES

dd14e66

vercel bot deployed to Preview January 22, 2025 18:21 View deployment

TheExGenesis merged commit fab5af5 into TheExGenesis:main Jan 24, 2025
3 checks passed

TheExGenesis added a commit that referenced this pull request Jan 26, 2025

Merge pull request #200 from ri72miieop/TES

83b5962

Fixes and improvements of PR "Store twitter API responses #194"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store twitter API responses #194

Store twitter API responses #194

ri72miieop commented Dec 30, 2024

vercel bot commented Dec 30, 2024

TheExGenesis Jan 3, 2025

ri72miieop Jan 3, 2025

TheExGenesis Jan 3, 2025

ri72miieop Jan 4, 2025

TheExGenesis Jan 3, 2025

ri72miieop Jan 3, 2025

TheExGenesis Jan 3, 2025

TheExGenesis Jan 3, 2025

ri72miieop Jan 3, 2025

TheExGenesis Jan 3, 2025

This comment was marked as outdated.

vercel bot commented Jan 21, 2025 •

edited

Loading

This comment was marked as outdated.




		CREATE OR REPLACE FUNCTION private.tes_process_account_records()

Store twitter API responses #194

Store twitter API responses #194

Conversation

ri72miieop commented Dec 30, 2024

vercel bot commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

vercel bot commented Jan 21, 2025 • edited Loading

This comment was marked as outdated.

vercel bot commented Jan 21, 2025 •

edited

Loading