-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
357 lines (239 loc) · 12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "`{tweetio}`"
output:
github_document:
html_preview: true
toc: true
toc_depth: 2
html_document:
keep_md: yes
always_allow_html: yes
editor_options:
chunk_output_type: console
---
<!-- README.Rmd generates README.md. -->
```{r, echo=FALSE}
knitr::opts_chunk$set(
# collapse = TRUE,
fig.align = "center",
comment = "#>",
fig.path = "man/figures/",
message = FALSE,
warning = FALSE
)
options(width = 150)
```
<!-- badges: start -->
[![Gitter](https://badges.gitter.im/tweetio/community.svg)](https://gitter.im/tweetio/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[![Lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/knapply/tweetio?branch=master&svg=true)](https://ci.appveyor.com/project/knapply/tweetio)
[![Travis-CI Build Status](https://travis-ci.org/knapply/tweetio.svg?branch=master)](https://travis-ci.org/knapply/tweetio)
[![Codecov test coverage](https://codecov.io/gh/knapply/tweetio/branch/master/graph/badge.svg)](https://codecov.io/gh/knapply/tweetio?branch=master)
[![GitHub last commit](https://img.shields.io/github/last-commit/knapply/tweetio.svg)](https://github.com/knapply/tweetio/commits/master)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Depends](https://img.shields.io/badge/Depends-GNU_R>=3.3-blue.svg)](https://www.r-project.org/)
[![CRAN status](https://www.r-pkg.org/badges/version/tweetio)](https://cran.r-project.org/package=tweetio)
[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/knapply/tweetio.svg)](https://github.com/knapply/tweetio)
<!-- [![HitCount](http://hits.dwyl.io/knapply/tweetio.svg)](http://hits.dwyl.io/knapply/tweetio) -->
<!-- badges: end -->
<!-- [![R build status](https://github.com/knapply/tweetio/workflows/R-CMD-check/badge.svg)](https://github.com/knapply/tweetio/actions?workflow=R-CMD-check) -->
# Introduction
`{tweetio}`'s goal is to enable safe, efficient I/O and transformation of Twitter data. Whether the data came from the Twitter API, a database dump, or some other source, `{tweetio}`'s job is to get them into R and ready for analysis.
`{tweetio}` is __not__ a competitor to [`{rtweet}`](https://rtweet.info/): it is not interested in collecting Twitter data. That said, it definitely attempts to compliment it by emulating its data frame schema because...
1. It's incredibly easy to use.
2. It's more efficient to analyze than a key-value format following the raw data.
3. It'd be a waste not to maximize compatibility with tools built specifically around `{rtweet}`'s data frames.
# Installation
You'll need a C++ compiler. If you're using Windows, that means [Rtools](https://cran.r-project.org/bin/windows/Rtools/).
```{r, eval=FALSE}
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("knapply/tweetio")
```
# Usage
```{r}
library(tweetio)
```
`{tweetio}` uses [`{data.table}`](https://rdatatable.gitlab.io/data.table/) internally for performance and stability reasons, but if you're a [`{tidyverse}`](https://www.tidyverse.org/) fan who's accustomed to dealing with `tibble`s, you can set an option so that `tibble`s are _always_ returned.
Because `tibble`s have an incredibly informative and user-friendly `print()` method, we'll set the option for examples. Note that if the `{tibble}` package is not installed, this option is ignored.
```{r}
options(tweetio.as_tibble = TRUE)
```
You can check on all available `{tweetio}` options using `tweetio_options()`.
```{r}
tweetio_options()
```
<!-- # What's New? -->
<!-- ## Easy Access to Twitter-disclosed Information Operations Archives -->
<!-- ```{r} -->
<!-- io_campaign_metadata -->
<!-- ``` -->
## Simple Example
First, we'll save a stream of tweets using `rtweet::stream_tweets()`.
```{r}
temp_file <- tempfile(fileext = ".json")
rtweet::stream_tweets(timeout = 15, parse = FALSE,
file_name = temp_file)
```
We can then pass the file path to `tweetio::read_tweets()` to efficiently parse the data into an `{rtweet}`-style data frame.
```{r}
tiny_rtweet_stream <- read_tweets(temp_file)
tiny_rtweet_stream
```
## Performance
`rtweet::parse_stream()` is totally sufficient for smaller files (as long as the returned data are valid JSON), but `tweetio::read_tweets()` is _much_ faster.
```{r}
small_rtweet_stream <- "inst/example-data/api-stream-small.json.gz"
res <- bench::mark(
rtweet = rtweet::parse_stream(small_rtweet_stream),
tweetio = tweetio::read_tweets(small_rtweet_stream)
,
check = FALSE,
filter_gc = FALSE
)
res[, 1:9]
```
With bigger files, using `rtweet::parse_stream()` is no longer realistic, especially if the JSON are invalid.
```{r, echo=FALSE, eval=FALSE}
# big_tweet_stream_path <- "~/ufc-tweet-stream.json.gz"
#
# raw_lines <- readLines(big_tweet_stream_path)
# valid_lines <- purrr::map_lgl(raw_lines,
# ~ jsonify::validate_json(.x) && jsonlite::validate(.x))
# writeLines(raw_lines[valid_lines][1:100000], "inst/example-data/ufc-tweet-stream.json")
# R.utils::gzip("inst/example-data/ufc-tweet-stream.json",
# "inst/example-data/ufc-tweet-stream.json.gz",
# remove = FALSE, overwrite = TRUE)
```
```{r}
big_tweet_stream_path <- "inst/example-data/ufc-tweet-stream.json.gz"
temp_file <- tempfile(fileext = ".json")
R.utils::gunzip(big_tweet_stream_path, destname = temp_file, remove = FALSE)
c(`compressed MB` = file.size(big_tweet_stream_path) / 1e6,
`decompressed MB` = file.size(temp_file) / 1e6)
```
```{r}
res <- bench::mark(
rtweet = rtweet_df <- rtweet::parse_stream(big_tweet_stream_path),
tweetio = tweetio_df <- tweetio::read_tweets(big_tweet_stream_path)
,
filter_gc = FALSE,
check = FALSE,
iterations = 1
)
res[, 1:9]
```
Not only is `tweetio::read_tweets()` more efficient in time and memory usage, it's able to successfully parse much more of the data.
```{r}
`rownames<-`(
vapply(list(tweetio_df = tweetio_df, rtweet_df = rtweet_df), dim, integer(2L)),
c("nrow", "ncol")
)
```
## Data Dumps
A common practice for handling social media data at scale is to store them in search engine databases like Elasticsearch, but it's (unfortunately) possible that you'll need to work with data dumps.
I've encountered two flavors of these schema (that may be in .gzip files or ZIP archives):
1. .jsonl: newline-delimited JSON
2. .json: the complete contents of a database dump packed in a JSON array
This has three unfortunate consequences:
1. Packages that were purpose-built to work directly with `{rtweet}`'s data frames can't play along with your data.
2. You're going to waste most of your time (and memory) getting data into R that you're not going to use.
3. The data are _very_ tedious to restructure in R (lists of lists of lists of lists of lists...).
`{tweetio}` solves this by parsing everything and building the data frames at the C++ level, including handling GZIP files and ZIP archives for you.
# Spatial Tweets
If you have `{sf}` installed, you can use `as_tweet_sf()` to only keep those tweets that contain valid bounding box polygons or points.
```{r}
tweet_sf <- as_tweet_sf(tweetio_df)
tweet_sf[, "geometry"]
```
There are currently four columns that can potentially hold spatial geometries:
1. `"bbox_coords"`
2. `"quoted_bbox_coords"`
3. `"retweet_bbox_coords"`
4. `"geo_coords"`
You can select which one to use to build your `sf` object by modifying the `geom_col=` parameter (default: `"bbox_coords"`)
```{r}
as_tweet_sf(tweetio_df,
geom_col = "quoted_bbox_coords")[, "geometry"]
```
You can also build _all_ the supported bounding boxes by setting `geom_col=` to `"all"`.
```{r}
all_bboxes <- as_tweet_sf(tweetio_df, geom_col = "all")
all_bboxes[, c("which_geom", "geometry")]
```
From there, you can easily use the data like any other `{sf}` object.
```{r, fig.width=12, fig.height=8}
library(ggplot2)
world <- rnaturalearth::ne_countries(returnclass = "sf")
world <- world[world$continent != "Antarctica", ]
ggplot(all_bboxes) +
geom_sf(fill = "white", color = "lightgray", data = world) +
geom_sf(aes(fill = which_geom, color = which_geom),
alpha = 0.15, size = 1, show.legend = TRUE) +
coord_sf(crs = 3857) +
scale_fill_viridis_d() +
scale_color_viridis_d() +
theme(legend.title = element_blank(), legend.position = "top",
panel.background = element_rect(fill = "#daf3ff"))
```
# Tweet Networks
If you want to analyze tweet networks and have `{igraph}` or `{network}` installed, you can get started immediately using `tweetio::as_tweet_igraph()` or `tweetio::as_tweet_network()`.
```{r}
tweet_df <- tweetio_df[1:1e4, ]
as_tweet_igraph(tweet_df)
as_tweet_network(tweet_df)
```
If you want to take advantage of all the metadata available, you can set `all_status_data` and/or `all_user_data` to `TRUE`
```{r}
as_tweet_igraph(tweet_df,
all_user_data = TRUE, all_status_data = TRUE)
as_tweet_network(tweet_df,
all_user_data = TRUE, all_status_data = TRUE)
```
## Two-Mode Networks
You can also build two-mode networks by specifying the `target_class` as `"hashtag"`s, `"url"`s, or `"media"`.
* Returned `<igraph>`s will be set as bipartite following `{igraph}`'s convention of a `logical` vertex attribute specifying each partition. Accounts are always `TRUE`.
* Returned `<network>`s will be set as bipartite following `{network}`'s convention of ordering the "actors" first, and setting the network-level attribute of "bipartite" as the number of "actors". Accounts are always the "actors".
If bipartite, the returned objects are always set as undirected.
### Users to Hashtags
```{r}
as_tweet_igraph(tweet_df, target_class = "hashtag")
as_tweet_network(tweet_df, target_class = "hashtag")
```
### Users to URLs
```{r}
as_tweet_igraph(tweet_df, target_class = "url")
as_tweet_network(tweet_df, target_class = "url")
```
### Users to Media
```{r}
as_tweet_igraph(tweet_df, target_class = "media")
as_tweet_network(tweet_df, target_class = "media")
```
## `<proto_net>`
You're not stuck with going directly to `<igraph>`s or `<network>`s though. Underneath the hood, `as_tweet_igraph()` and `as_tweet_network()` use `as_proto_net()` to build a `<proto_net>`, a list of edge and node data frames.
```{r}
as_proto_net(tweetio_df,
all_status_data = TRUE, all_user_data = TRUE)
```
# Progress
### Supported Data Inputs
- [x] Twitter API streams: .json, .json.gz
- [x] API to Elasticsearch data dump (JSON Array): .json, .json.gz
- [x] API to Elasticsearch data dump (line-delimited JSON): .jsonl, .jsonl.gz
### Supported Data Outputs
- [x] CSV
- [x] Excel
- [x] Gephi-friendly GraphML
### Structures
- [x] `{rtweet}`-style data frames
- [x] Spatial Tweets via `{sf}`
- [x] Tweet networks via `{igraph}`
- [x] Tweet networks via `{network}`
# Shout Outs
The [`{rtweet}`](https://rtweet.info/) package __spoils R users rotten__, in the best possible way. The underlying data carpentry is so seamless that the user doesn't need to know anything about the horrors of Twitter data, which is pretty amazing. If you use `{rtweet}`, you probably owe [Michael Kearney](https://twitter.com/kearneymw) some [citations](https://github.com/mkearney/rtweet_citations).
`{tweetio}` uses a combination of C++ via [`{Rcpp}`](http://www.rcpp.org/), the [`rapidjson`](http://rapidjson.org/) C++ library (made available by [`{rapidjsonr}`](https://cran.r-project.org/web/packages/rapidjsonr/index.html)), [`{jsonify}`](https://cran.r-project.org/web/packages/jsonify/index.html)) for an R-level interface to `rapidjson`, [`{RcppProgress}`](https://cran.r-project.org/web/packages/RcppProgress/index.html)), and __R's not-so-secret super weapon__: [`{data.table}`](https://rdatatable.gitlab.io/data.table/).
Major inspiration from [`{ndjson}`](https://gitlab.com/hrbrmstr/ndjson) was taken, particularly its use of [`Gzstream`](https://www.cs.unc.edu/Research/compgeom/gzstream/).
# Environment
```{r}
sessionInfo()
```