Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis page slowed down by get_all_datasets() #391

Open
stijn-uva opened this issue Apr 9, 2020 · 1 comment
Open

Analysis page slowed down by get_all_datasets() #391

stijn-uva opened this issue Apr 9, 2020 · 1 comment

Comments

@stijn-uva
Copy link
Member

stijn-uva commented Apr 9, 2020

All pages including analysis/common/functions.php run get_all_datasets() on load. This function runs a number of queries to retrieve statistics about all datasets in TCAT, notably the number of tweets (COUNT(t.id)).

For very large query bins (50 million+ tweets), a COUNT(*) query can take quite a while to complete (over a minute in some worst-case scenarios I've seen), greatly slowing down the page. This would not be a huge issue if the query could be cached, but for active bins the count changes on every scrape, so the result of this query cannot be cached properly. This makes the analysis page very hard to use since virtually everything is greatly slowed down.

In this scenario it would be better to cache the number of tweets in another way. Since the content of the tweets table is only manipulated in a limited number of places in the code, it would be possible to store this statistic (and possibly others) in another table or as a column in tcat_query_bins. It could then be increased or decreased as needed. This would require a small amount of extra processing while inserting/deleting tweets but would make for a much improved user experience. This could also be done automatically with MySQL triggers.

Alternatively, one could use the approximate row count returned by e.g. SHOW TABLE STATUS, but this is inaccurate and may give misleading results.

@niczem niczem self-assigned this Apr 16, 2020
@xmacex
Copy link
Contributor

xmacex commented Feb 8, 2021

Running plain SQL directly on the database takes no time at all,

SELECT count(id)
FROM a_juicy_collection_of_tasty_tweets;

yields

+-----------+
| count(id) |
+-----------+
|  31736732 |
+-----------+
1 row in set (0.00 sec)

Or alternatively from the metadata

SELECT table_rows
FROM information_schema.tables
WHERE table_name = 'a_juicy_collection_of_tasty_tweets';

yields

+------------+
| TABLE_ROWS |
+------------+
|   31732985 |
+------------+
1 row in set (0.00 sec)

Similarly fetching min(created_at) and max(created_at) take no time at all.

(just some observations if somebody (myself?) wants to refactor that get_all_datasets() or it's replacement for that purpose)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants