You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, both the Stream code block, and the Search code block do not store any of the data objects on disk, but on the variable searched_tweets_srch. We need to address this so that the tweets are stored as JSON files.
And add a save_path variable with a path to a folder to save the contents.
Use the said variable as input to a new function you will create named save_tweets_json(save_path), which then stores each tweet as a .json file. As such, for example, if you download 200 tweets, it should accordingly have 200 files in the folder.
In order to save the files, we need to adopt a file name convention. This is usually best to be an ID which we can trace to the original tweet easily by just pasting into an URL and being able to see it in the browser. Which brings us to the next task:
Data Model
In order to find the desired file name, we need to understand the data model. Currently, the notebook does not explain the data model, but contains shy references to it (CTRL +F URL 1 and URL 2. You should add the necessary references to the Notebook and explain overall how the API looks like with respect to our interests. For example, upon a quick look at the data you provided me, this appears to be the format of each tweet:
However, I am not too clear yet if necessarily this "reconstruction process" would work for all tweets we download. You should assess if this is possible or not, perhaps by trying to understand what the status word means and see if all our tweets are "status".
Depending on your findings, it may suffice to name each file on the fom <screen_name>_<id> to desambiguate them, and avoid us storing duplicate tweets.
Misc
Please follow-up on this issue here instead of Slack now that it has been formalized. Make sure you read through and fully understand how to submit Pull Requests on the format we use on this repo: https://github.com/sailuh/perceive/blob/master/CONTRIBUTING.md
The text was updated successfully, but these errors were encountered:
This issue concerns the following notebook: https://github.com/sailuh/perceive/blob/master/Crawlers/Twitter/cve_twitter_extraction.ipynb
Saving as JSON
Currently, both the
Stream
code block, and theSearch
code block do not store any of the data objects on disk, but on the variablesearched_tweets_srch
. We need to address this so that the tweets are stored as JSON files.Add a new code block just below:
And add a
save_path
variable with a path to a folder to save the contents.Use the said variable as input to a new function you will create named
save_tweets_json(save_path)
, which then stores each tweet as a .json file. As such, for example, if you download 200 tweets, it should accordingly have 200 files in the folder.In order to save the files, we need to adopt a file name convention. This is usually best to be an ID which we can trace to the original tweet easily by just pasting into an URL and being able to see it in the browser. Which brings us to the next task:
Data Model
In order to find the desired file name, we need to understand the data model. Currently, the notebook does not explain the data model, but contains shy references to it (CTRL +F
URL 1
andURL 2
. You should add the necessary references to the Notebook and explain overall how the API looks like with respect to our interests. For example, upon a quick look at the data you provided me, this appears to be the format of each tweet:Consider the existing tweet we discussed previously:
https://twitter.com/patrickwardle/status/912254053849079808
It appears the url format is of the form /
user
/some_id
Upon a quick inspection on the example I provided above, it seems the fields:
'screen_name': 'exposurebball'
'id': 962589493164433415
Could be "plugged in" the
user
andsome_id
. Indeed, if you try to do so, you will be able to open the said tweet on your browser:https://twitter.com/youngbloodelite/status/962589493164433415
However, I am not too clear yet if necessarily this "reconstruction process" would work for
all tweets we download
. You should assess if this is possible or not, perhaps by trying to understand what thestatus
word means and see if all our tweets are "status".Depending on your findings, it may suffice to name each file on the fom
<screen_name>_<id>
to desambiguate them, and avoid us storing duplicate tweets.Misc
Please follow-up on this issue here instead of Slack now that it has been formalized. Make sure you read through and fully understand how to submit Pull Requests on the format we use on this repo: https://github.com/sailuh/perceive/blob/master/CONTRIBUTING.md
The text was updated successfully, but these errors were encountered: