Analysis page slowed down by get_all_datasets() #391

stijn-uva · 2020-04-09T17:12:20Z

All pages including analysis/common/functions.php run get_all_datasets() on load. This function runs a number of queries to retrieve statistics about all datasets in TCAT, notably the number of tweets (COUNT(t.id)).

For very large query bins (50 million+ tweets), a COUNT(*) query can take quite a while to complete (over a minute in some worst-case scenarios I've seen), greatly slowing down the page. This would not be a huge issue if the query could be cached, but for active bins the count changes on every scrape, so the result of this query cannot be cached properly. This makes the analysis page very hard to use since virtually everything is greatly slowed down.

In this scenario it would be better to cache the number of tweets in another way. Since the content of the tweets table is only manipulated in a limited number of places in the code, it would be possible to store this statistic (and possibly others) in another table or as a column in tcat_query_bins. It could then be increased or decreased as needed. This would require a small amount of extra processing while inserting/deleting tweets but would make for a much improved user experience. This could also be done automatically with MySQL triggers.

Alternatively, one could use the approximate row count returned by e.g. SHOW TABLE STATUS, but this is inaccurate and may give misleading results.

The text was updated successfully, but these errors were encountered:

xmacex · 2021-02-08T19:33:06Z

Running plain SQL directly on the database takes no time at all,

SELECT count(id)
FROM a_juicy_collection_of_tasty_tweets;

yields

+-----------+
| count(id) |
+-----------+
|  31736732 |
+-----------+
1 row in set (0.00 sec)

Or alternatively from the metadata

SELECT table_rows
FROM information_schema.tables
WHERE table_name = 'a_juicy_collection_of_tasty_tweets';

yields

+------------+
| TABLE_ROWS |
+------------+
|   31732985 |
+------------+
1 row in set (0.00 sec)

Similarly fetching min(created_at) and max(created_at) take no time at all.

(just some observations if somebody (myself?) wants to refactor that get_all_datasets() or it's replacement for that purpose)

niczem self-assigned this Apr 16, 2020

ErikBorra added the enhancement label Oct 19, 2020

stijn-uva unassigned niczem Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis page slowed down by get_all_datasets() #391

Analysis page slowed down by get_all_datasets() #391

stijn-uva commented Apr 9, 2020 •

edited

Loading

xmacex commented Feb 8, 2021

Analysis page slowed down by get_all_datasets() #391

Analysis page slowed down by get_all_datasets() #391

Comments

stijn-uva commented Apr 9, 2020 • edited Loading

xmacex commented Feb 8, 2021

stijn-uva commented Apr 9, 2020 •

edited

Loading