rstudioconf-wordcloud.qmd

---
title: "#rstudioconf2022 Twitter wordcloud"
author: "Haley Fox"
date: August 5, 2022
format: html
editor: visual
---

### Objective
Scrape tweets using #rstudioconf2022 using the Twitter API and `rtweet`. Create a wordcloud from those tweets using `wordcloud`.

Load libraries and read in data

```{r}
#| label: load-packages
#| message: false

if (!require(librarian)) {
  install.packages("librarian")
  library(librarian)
}
librarian::shelf(tidyverse,
                 rtweet,
                 wordcloud,
                 RColorBrewer)
```


### Pull tweets with the hastag #rstudioconf2022.

Getting authorized to use the Twitter API is typically fairly simple. Here is a great [tutorial](https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html){target="_blank"} about setting up a Twitter development account and connecting to the Twitter API in RStudio.

```{r}
#| label: read-tweets
#| code-overflow: wrap

auth_as("wordle-auth") #this is an old authorization I had previously setup

# pull last 1,000 tweets with rstudioconf2022, do not include retweets, and only include tweets in English
tweets <- search_tweets(q = "#rstudioconf2022", n = 5000, include_rts = FALSE, lang = "en")

# view data
head(tweets$text)

# clean data
tweets_clean <- tweets %>% 
  mutate(text = gsub('\\b\\w{1,3}\\b', '', text)) %>% #remove words only 1-3 letters long
  mutate(text = gsub("https\\S*", "", text)) %>% #remove links
  mutate(text = gsub("@\\S*", "", text)) %>%  #remove @ symbol
  mutate(text = gsub("amp", "", text)) %>% #remove &
  mutate(text = gsub("[\r\n]", "", text)) %>%  #remove when jumping to new line
  mutate(text = gsub("[[:punct:]]", "", text)) #remove emojis

```


Format tweets so each word is separated.

```{r}
#| label: read-tweets
#| code-overflow: wrap

# separate tweets into individual words
tweets_list <- strsplit(tweets_clean$text, split = " ")

# convert list to dataframe
words_df <- data.frame(matrix(unlist(tweets_list), nrow=1399, byrow=TRUE),stringsAsFactors=FALSE)
```


Create frequency table for words.

```{r}
#| label: freq-table

# convert dataframe from wide to long format
words_col <- words_df %>% 
  pivot_longer(c(1:23)) %>% 
  select(-name) %>% 
  rename(word = value)

# make lowercase
words_col$word <- tolower(words_col$word)

# create frequency table
words_freq <- plyr::count(words_col, 'word')

# remove empty space still included with high frequency (will mess up wordcloud because size and color are based on frequency)
words_freq <- words_freq %>% 
  filter(freq < 5000)

# remove some words that aren't very informative so they don't show up in the wordcloud
words_freq1 <- words_freq %>% 
  filter(!word %in% c("this","that","with","about","from","like","they","their","sure","would","through","even","when","only","some","also","been","into","after","next","will","what","before","over","have","your","were","being","during","such","there","then","much","very","because","other","then","just","really","something","these"))
```


Create wordcloud number 1 - not great visualization because "rstudioconf" and "rstats" are at such a greater frequency than the other words.

```{r}
#| label: word-cloud-1
wordcloud(words = words_freq1$word, freq = words_freq1$freq, min.freq = 20, max.words = 200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
```


Change frequencies and make a new wordcloud that's much better!

```{r}
#| label: word-cloud-2

# change frequencies so anthing over 1000 is set to 150, anything from 400 - 1000 is set to 125, and anything from 100 - 400 is set to 110.
words_freq2 <- words_freq1 %>% 
  mutate(new_freq = ifelse(freq > 1000, 150,
         ifelse(freq > 400 & freq < 1000, 125,
         ifelse(freq > 100 & freq < 400, 110, freq))))

# make wordcloud
wordcloud(words = words_freq2$word, freq = words_freq2$new_freq, min.freq = 20, max.words = 200, random.order=FALSE, rot.per=0.35, scale=c(4,.09),  colors=brewer.pal(8, "Dark2"))
```