Curate dataset

About

Description

This script is essential to attaching state and region data to each tweet, which is necessary for any analysis of the regional distribution of terms. The quantity of each term in each CSV file is also tabulated in this section, which provides important information about the data being analyzed, and could help explain the ultimate geographic distribution.

Usage

After running the setup code and the code creating the join_tweets_to_states function, simply replace the names of my CSV files, such as “plurals” in the “Second Person Plurals” section, with the name of your own. For the join_tweets_to_states function to work, an API key from the Census Bureau will be necessary, as mentioned in the previous section.

Setup

# Script-specific options or packages
library(skimr)

Run

join_tweets_to_states <- function(tweets) {
  # Function
  # Takes a data frame result from an rtweet::search_term() query
  # and maps the tweets with lat/lng values to US states
  # 
  # Requires a US Census API Key: https://api.census.gov/data/key_signup.html
  # that is added to the .Renviron file with tidycensus::census_api_key()
  
  library(tidyverse, quietly = TRUE) # data manipulation
  library(tidycensus)                # us state geometries
  
  # Get the spatial geometries for each US state/ territory
  states_sf <- 
    get_acs(geography = "state", # states
            variables = "B01003_001", # total population
            geometry = TRUE) %>% # spatial geometry
    select(state_name = NAME, geometry) # only keep state name and geometry columns
  
  # Align the CRS for tweets to the census
  tweets_sf <- 
    tweets %>% # original data frame
    filter(lat != "") %>% # only keep tweets with lat/lng values
    st_as_sf(coords = c("lng", "lat"), # map x/long, y/lat values to coords
             crs = st_crs(states_sf)) # align coordinate reference system
  
  # Map tweet coords to census state geometries
  tweets_states_sf <- 
    st_join(tweets_sf, states_sf) 
  
  tweets_states_df <- 
    tweets_states_sf %>% # spatial features object
    as_tibble() %>% # convert to tibble (remove spatial features)
    filter(state_name != "") # keep only US states/ territories
  
  return(tweets_states_df) # return the final data frame
  
}

Second Person Plurals

plurals <- read_csv(file = "../data/original/twitter/plurals.csv") # read dataset from disk

## Warning: One or more parsing issues, see `problems()` for details

plurals %>%
    count(search_term, sort = TRUE) # preview search term counts

Outdoor Sales

sale <- read_csv(file = "../data/original/twitter/sale.csv") # read dataset from disk

## Warning: One or more parsing issues, see `problems()` for details

sale %>%
    count(search_term, sort = TRUE) # preview

Gym Shoes

shoes <- read_csv(file = "../data/original/twitter/shoes.csv") # read dataset from disk

shoes %>%
    count(search_term, sort = TRUE)

Soft Drinks

tonix <- read_csv(file = "../data/original/twitter/tonix.csv") # read dataset from disk

tonix %>%
    count(search_term, sort = TRUE) # preview

Major Roads

roads <- read_csv(file = "../data/original/twitter/roads.csv") # read dataset from disk


roads %>%
    count(search_term, sort = TRUE) # preview

Finalize

Log

This script should output information on the quantity of each term gathered, and ensure the census information can be used in the project henceforth.

References

Vaux, Bert. “Harvard Dialect Survey.” Harvard Dialect Survey, 2003, http://dialect.redlog.net.