-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrstudioconf-wordcloud.qmd
120 lines (85 loc) · 3.86 KB
/
rstudioconf-wordcloud.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: "#rstudioconf2022 Twitter wordcloud"
author: "Haley Fox"
date: August 5, 2022
format: html
editor: visual
---
### Objective
Scrape tweets using #rstudioconf2022 using the Twitter API and `rtweet`. Create a wordcloud from those tweets using `wordcloud`.
Load libraries and read in data
```{r}
#| label: load-packages
#| message: false
if (!require(librarian)) {
install.packages("librarian")
library(librarian)
}
librarian::shelf(tidyverse,
rtweet,
wordcloud,
RColorBrewer)
```
### Pull tweets with the hastag #rstudioconf2022.
Getting authorized to use the Twitter API is typically fairly simple. Here is a great [tutorial](https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html){target="_blank"} about setting up a Twitter development account and connecting to the Twitter API in RStudio.
```{r}
#| label: read-tweets
#| code-overflow: wrap
auth_as("wordle-auth") #this is an old authorization I had previously setup
# pull last 1,000 tweets with rstudioconf2022, do not include retweets, and only include tweets in English
tweets <- search_tweets(q = "#rstudioconf2022", n = 5000, include_rts = FALSE, lang = "en")
# view data
head(tweets$text)
# clean data
tweets_clean <- tweets %>%
mutate(text = gsub('\\b\\w{1,3}\\b', '', text)) %>% #remove words only 1-3 letters long
mutate(text = gsub("https\\S*", "", text)) %>% #remove links
mutate(text = gsub("@\\S*", "", text)) %>% #remove @ symbol
mutate(text = gsub("amp", "", text)) %>% #remove &
mutate(text = gsub("[\r\n]", "", text)) %>% #remove when jumping to new line
mutate(text = gsub("[[:punct:]]", "", text)) #remove emojis
```
Format tweets so each word is separated.
```{r}
#| label: read-tweets
#| code-overflow: wrap
# separate tweets into individual words
tweets_list <- strsplit(tweets_clean$text, split = " ")
# convert list to dataframe
words_df <- data.frame(matrix(unlist(tweets_list), nrow=1399, byrow=TRUE),stringsAsFactors=FALSE)
```
Create frequency table for words.
```{r}
#| label: freq-table
# convert dataframe from wide to long format
words_col <- words_df %>%
pivot_longer(c(1:23)) %>%
select(-name) %>%
rename(word = value)
# make lowercase
words_col$word <- tolower(words_col$word)
# create frequency table
words_freq <- plyr::count(words_col, 'word')
# remove empty space still included with high frequency (will mess up wordcloud because size and color are based on frequency)
words_freq <- words_freq %>%
filter(freq < 5000)
# remove some words that aren't very informative so they don't show up in the wordcloud
words_freq1 <- words_freq %>%
filter(!word %in% c("this","that","with","about","from","like","they","their","sure","would","through","even","when","only","some","also","been","into","after","next","will","what","before","over","have","your","were","being","during","such","there","then","much","very","because","other","then","just","really","something","these"))
```
Create wordcloud number 1 - not great visualization because "rstudioconf" and "rstats" are at such a greater frequency than the other words.
```{r}
#| label: word-cloud-1
wordcloud(words = words_freq1$word, freq = words_freq1$freq, min.freq = 20, max.words = 200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
```
Change frequencies and make a new wordcloud that's much better!
```{r}
#| label: word-cloud-2
# change frequencies so anthing over 1000 is set to 150, anything from 400 - 1000 is set to 125, and anything from 100 - 400 is set to 110.
words_freq2 <- words_freq1 %>%
mutate(new_freq = ifelse(freq > 1000, 150,
ifelse(freq > 400 & freq < 1000, 125,
ifelse(freq > 100 & freq < 400, 110, freq))))
# make wordcloud
wordcloud(words = words_freq2$word, freq = words_freq2$new_freq, min.freq = 20, max.words = 200, random.order=FALSE, rot.per=0.35, scale=c(4,.09), colors=brewer.pal(8, "Dark2"))
```