Data visualization / Visualisation de données

Tidy Tuesdays

Below, you will find some of the graphs (and accompanying code) I made as part of the R4DS weekly visualization challenges. All of the entries I made for #TidyTuesday can be found on my Github page.

25 February 2020

On this week’s TidyTuesday challenge, we worked with data on vaccination rates across the USA. I decided to make a graph of vaccination rates by county. See below for the final product. Start by loading the necessary packages and the measles data.

# packages
library(tidyverse)
library(pacman)
library(tidyverse)

# data
measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')

Let’s explore the data.

summary(measles$mmr) # distribution of vaccination rates
summary(measles$overall) # distribution of vaccination rates

weird <- measles %>% 
  filter(overall == -1 | mmr == -1)# who are the "-1" ? Apparently they're missing data.
table(weird$state) # some have missing data for "mmr", some have missing data for "overall" -- we will use this later

length(unique(measles$county)) # how many unique counties?
table(measles$year, useNA = "always") # which years?

Then, we need to do some data clean-up.

measles_plot_viz <- measles %>% 
  distinct(state, name, county, overall, mmr, lat, lng, enroll) %>% # remove duplicates
  mutate(drop = ifelse(mmr >= 0 & overall >= 0, "drop", "keep")) %>% 
  filter(lng < 0) %>% # remove schools outside of the continent
  gather(type, value, overall:mmr) %>% 
  filter(value >= 0) %>% 
  filter(!(drop == "drop" & type == "overall"))

Finally, let’s draw the map! I will be using the maps package to do so. I had help from Kieran Healy’s book during this step, as well as Timo Grossenbacher’s blog.

# load packages
library(maps)
library(mapproj)

# 1. load the county data
us_counties <- map_data("county") %>% 
  mutate(state_county = str_c(region, subregion, sep = "_")) 
head(us_counties)

# 2. check the distribution of vacc. rate
summary(measles_plot_viz$value)

# 3. draw a plain county map to see if it's working
p <- ggplot(mapping = aes(x = long, y = lat, group = group), data = us_counties)
p1 <- p + geom_polygon(color = "white", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  guides(fill = FALSE) +
  theme_minimal()

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

p1

Now that we know the map is working at the vaccination data makes sense, we can plot vaccination rates across the U.S. But first, for the map’s caption, we need to check which type of vaccination rate is plotted in which state:

table(measles_plot_viz$type, measles_plot_viz$state)

##          
##           Arizona Arkansas California Colorado Connecticut Florida Idaho
##   mmr        1171      567      14068     1501         586       0     0
##   overall       0        0          0        0           0    2667   467
##          
##           Illinois  Iowa Maine Massachusetts Michigan Minnesota Missouri
##   mmr         7115     0   357           946        0      1570      645
##   overall        0  1163     0             0     2351         0        0
##          
##           Montana New Jersey New York North Carolina North Dakota  Ohio
##   mmr         520          0     4137              0          360  2919
##   overall       0       1993        0           2083            0     0
##          
##           Oklahoma Oregon Pennsylvania Rhode Island South Dakota Tennessee
##   mmr            0    806         1930            0          373         0
##   overall     1052      0            0          215            0      1152
##          
##           Texas  Utah Vermont Washington Wisconsin
##   mmr       810   603     255       1977         0
##   overall     0     0       0          0      2497

Now we can plot:

# 4. draw map with vacc. data
p <- ggplot()
p1 <- p + 
  geom_polygon(mapping = aes(x = long, y = lat, group = group), 
               data = us_counties, 
               fill = "#f5f5f2", 
               color = "black", 
               size = 0.1) + # add county lines
  geom_point(mapping = aes(x = lng, y = lat, color = value), 
             data = subset(measles_plot_viz, value > 50), 
             size = 0.6) + # add school dots
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) + # make the map more "curvy"
  theme_void() + # remove grid and axes
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, 
                                  vjust = 0.5, 
                                  size = 15),
        plot.background = element_rect(fill = "#f5f5f2", 
                                       color = NA),
        plot.caption = element_text(hjust = 0.5, 
                                    size = 10),
        plot.margin = margin(t = 1, r = 1, b = 1, l = 1, 
                             unit = "cm"),
        legend.margin = margin(t = 2, r = 2, b = 5, l = 2, 
                               unit = "mm")) + # modify plot title/caption/background/legend
  scale_color_viridis_c(option = "plasma", 
                        direction = -1, 
                        name = "Vaccination rate",
                        guide = guide_legend(direction = "horizontal",
                                             title.position = "top",
                                             title.hjust = 0.5)) + # change legend title
  labs(title = "Schools' Vaccination Rate in 31 U.S. States",
       caption = "Measles, Mumps, and Rubella (MMR) vaccination\nrate shown in FL, ID, IA, MI, NJ, NC, OK, RI, TN, WI.\nElsewhere, the overall vaccination rate is shown.") # deifne title and subtitle
p1

# save plot using ggsave(filename = "name.png", plot = last_plot(), width = 15, height = 15)

9 June 2020

For this week,s TidyTuesday, I create a graph that reports the names and achievements of African-Americans in the Arts and Entertainment. I chose a bar chart, where each bar corresponds to a decade of African-American accomplishments. Start by loading the necessary packages and the data on African-American firsts.

library(tidyverse)
library(extrafont)

firsts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-09/firsts.csv')

# Let's explore some variables 
table(firsts$category)

Now, we will recode the data. We will focus on the person variable, which contains the information on who the achiever is, and what they’ve accomplished. Once we’ve cleaned up the person variable, we will create a categorical variable for the decade of achievement, and collapse everyone (and their accomplishment) from the same decade together into one observation. The last thing we will do is insert a new line every 20 characters. If we do not insert the new line, the text will run too wide in the graph.

dat <- firsts %>% 
  mutate(person_sh = str_remove_all(person, "\\(.*\\)"), # remove everything in parenthesis
         person_sh = str_remove_all(person, "\\(.*"), # remove everything in parenthesis
         person_sh = str_remove_all(person_sh, "\\[.*\\]"), # remove brackets and their content
         person_sh = str_trim(person_sh, "right"), # remove white space at the end of names
         person_complete = str_c(person_sh, ", ", accomplishment), # create complete name + accomplishment title
         word_count = str_count(person_complete), # count number of words in complete names
         count = 1,
         person_complete = str_replace_all(person_complete, "African-American", "Af-Am"), # shorten for visualization purposes
         twenty_yr = ifelse(year <= 1780, '1760s-70s', 
                            ifelse(year >= 1781 & year <= 1800, "1780s-90s",
                                          ifelse(year >= 1801 & year <= 1820, "1800s-10s",
                                                 ifelse(year >= 1821 & year <= 1840, "1820s-30s",
                                                        ifelse(year >= 1841 & year <= 1860, "1840s-50s",
                                                               ifelse(year >= 1861 & year <= 1880, "1860s-70s",
                                                                      ifelse(year >= 1881 & year <= 1900, "1880s-90s",
                                                                             ifelse(year >= 1901 & year <= 1920, "1900s-10s",
                                                                                    ifelse(year >= 1921 & year <= 1940, "1920s-30s",
                                                                                           ifelse(year >= 1941 & year <= 1960, "1940s-50s",
                                                                                                  ifelse(year >= 1961 & year <= 1980, "1960s-70s",
                                                                                                         ifelse(year >= 1981 & year <= 2000, "1980s-90s", "2000s-10s"))))))))))))) %>% # create categorical variable for twenty years 
  group_by(twenty_yr, category) %>% 
  mutate(persons_20 = paste0(person_complete, collapse = "; ")) %>% # collapse strings for each twenty years and category
  ungroup() %>% 
  group_by(category) %>%
  mutate(count_cat = cumsum(count)) %>%
  ungroup() %>% 
  distinct(twenty_yr, count_cat, category, persons_20) %>% 
  group_by(twenty_yr, category) %>% 
  arrange(count_cat) %>% 
  filter(row_number() == n()) 

dat$persons_20 <- gsub("(.{15,}?)\\s", "\\1\n", dat$persons_20) # insert new line every 15 characters

Let’s make the graph! I was inspired by G. Karamanis’s work to do this graph. You can check out his work here. I decided to plot accomplishments in the Arts and Entertainment only, but you could do the same with science, education, sports, etc.

# plot, Arts only
viz <- dat %>% filter(category == "Arts & Entertainment")

p <- ggplot(viz, aes(x = twenty_yr, y = count_cat))
p1 <- p + 
  scale_y_continuous(name = "", 
                     limits = c(-700,700)) +
  scale_x_discrete(name = "",
                   expand = c(0.2, 0)) +
  annotate("text", x = viz$twenty_yr, # add names 
           y = viz$count_cat, 
           label = viz$persons_20, 
           size = 2.5, 
           color = "white",
           family = "Arial Narrow") +
  annotate("text", x = "1760s-70s", # add title 
           y = 400, 
           label = "First\nAfrican Americans\nin Arts & Entertainment", 
           size = 15, 
           hjust = 0,
           color = "white",
           family = "Andale Mono") +
  theme_void() +
  coord_cartesian(clip = "off") +
  theme(plot.background = element_rect(fill = 'grey11', colour = 'grey11'),
        plot.margin = margin(200, 40, 100, 30), # play with margins
        axis.text.x = element_text(colour = "gray69", # modify x axis 
                                   size = 15,
                                   family = "Andale Mono"))
p1