Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when scraping the same timerange, place, keywords multiple times #428

Open
annika-stechemesser opened this issue Aug 29, 2022 · 11 comments

Comments

@annika-stechemesser
Copy link

Hello,

I used gtrendsR to scrape for two searchwords with an "or" connection (covid+corona) in one place (geocode US-CT-533) for the timerange 2020-04-01 2021-07-01. Here is my line of code:

local_trends=gtrends(keyword='covid+corona',geo=local_geo,time ="2020-04-01 2021-07-01")$interest_over_time

I ran it multiple times and noticed that every time I got different results (see plot below). How is this possible given that none of the parameters changed and the timerange is in the past? Also none of the versions I got with gtrends exactly match the data I see in the browser when I put these inputs in the search.

Can you explain what is going on here and advise me how to get the correct data?

Thanks very much!

image

@eddelbuettel
Copy link
Collaborator

eddelbuettel commented Aug 29, 2022

I think we have seen this before and it is explained as 'well they reserve the right to answer that way' as what we hit is not a fully defined API :-/ Maybe Google subsamples, and you found a query that shows that? Edit: Never mind!

But I better let @PMassicotte chime in...

@PMassicotte
Copy link
Owner

That is strange, I can not reproduce the problem on my side.

library(gtrendsR)
library(ggplot2)

l <- list()

v <- 1:6
for (i in v) {
  
  df <- gtrends(keyword='covid+corona',geo="US",time ="2020-04-01 2021-07-01")$interest_over_time
  df$run <- paste("Run#", i)
  
  l[[i]] <- df
}

df <- do.call(rbind, l)

ggplot(df, aes(x = date, y = hits, color = run)) +
  geom_line()

Created on 2022-08-29 with reprex v2.0.2

@PMassicotte
Copy link
Owner

New try using your exact same GEO code:

library(gtrendsR)
library(ggplot2)

l <- list()

v <- 1:6
for (i in v) {
  
  df <- gtrends(keyword='covid+corona',geo="US-CT-533",time ="2020-04-01 2021-07-01")$interest_over_time
  df$run <- paste("Run#", i)
  
  l[[i]] <- df
}

df <- do.call(rbind, l)

ggplot(df, aes(x = date, y = hits, color = run)) +
  geom_line()

Created on 2022-08-29 with reprex v2.0.2

@PMassicotte
Copy link
Owner

@annika-stechemesser Can you try my code and see if you have the same results?

@annika-stechemesser
Copy link
Author

annika-stechemesser commented Aug 29, 2022

If I run your code and loop through multiple scrapes without wait they match up, however my graph looks slightly different to yours for example. I ran my various scrapes with a larger time delay between them, maybe that's it? I will try to run them spread out over a few hours and see what I get. Thanks a lot for the help!

image

@JBleher
Copy link
Contributor

JBleher commented Aug 29, 2022

Google provides the folliwng information: https://support.google.com/trends/answer/4365533?hl=en
According to Google, there are two types of samples one can access:

  1. “Real-time data is a sample covering the last seven days.”
  2. “Non-real-time data is a separate sample from real-time data and goes as far back
    as 2004 and up to 36 hours before your search

Appendix B of https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210 may be an interesting read as well.

@JBleher
Copy link
Contributor

JBleher commented Aug 29, 2022

Also the medium article by Simon Rogers is telling: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8

Our hypothesis is that samples from the full Google Trends dataset are not retaken for each query. However, we suspect that the sample taken from the full dataset could be based on an in-memory database somewhere on a Google Trends server instance, so that queries to Google Trends can be processed faster. If different IP addresses are routed to different instances there might be different in-memory samples that give different results. Also, if instances are shutdown, renewed, or the routing of traffic changes, the in-memory database may have to be resampled from the full Google Trends dataset.
We therefore assume that the result from Google Trends does not depend on the IP address per se. More precisely, we think it depends on the instance your query is routed to. This would also explain the inconsistencies in Google Trends data reported across time by Behnen, Kessler, Kruse, Schoenmakers, Zerr, and Gómez (2020), since in modern Cloud service instances are scaled up and down dynamically, depending on traffic.

@annika-stechemesser
Copy link
Author

Thank you @JBleher these comments have been really helpful. Running the code with a ~24h break gave different timeseries (see below). The same run in a non-delayed loop still gives the same data. I am not sure what to do with that stastistically but it does not seem to be a problem with gTrends. Thank you!

image

@JBleher
Copy link
Contributor

JBleher commented Aug 31, 2022

On a positive note, the time series you are querying seems to be calculated on enough search volume, so that the variation induced by different samples is rather small.

@annika-stechemesser
Copy link
Author

Do you have any advice on how to force getting into a new batch? I tried changing IP address and deleting cookies manually in the browser but none of that worked so far. I would just like to see a bunch of variation for my request but am pretty unclear on how to get it... Does the cookie-URL parameter have anything to do with it? Thank you!

@JBleher
Copy link
Contributor

JBleher commented Sep 1, 2022

You may be able to use different servers in different locations. Lists of free proxy-servers that you could use can be found on the internet. Or you could use the TOR network. However, be aware that some servers may be used by other people to circumvent rate limits. So you will still need to slow down the requests and have some try and catch logic to handle potentially empty data sets...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants