Webscraping

Using dplyr and ggplot for Visualization

library(ggplot2)
library(stringr)
library(forcats)
library(gganimate)
library(dplyr)
library(tidytext)
library(plotly)
library(huxtable)

I am still getting used to the idea of a few of these functions so if this seems all over the place, there is a reason for that!

I came across this post on r/dataisbeautiful and I was instantly convinced to give it a shot!

The method I used was a little bit convoluted and hacky. What I am going to show here is how to read data off of a website and then show you how the data looks, and how to plot it in the way we would like.

Further down the line when I am more comfortable with the complex, I can make a more detailed tutorial.

For now, we will be downloading the file hosted here

# First read in the csv 
doctorwho <- 
  read.csv("doctorwho_ratings.csv")
head(doctorwho)
Season EpisodeNumber Doctor Title Ratings Quality
1 1 Ninth Rose 7.5 Good
1 2 Ninth The End of the World 7.6 Good
1 3 Ninth The Unquiet Dead 7.5 Good
1 4 Ninth Aliens of London 7   Regular
1 5 Ninth World War Three 7   Regular
1 6 Ninth Dalek 8.7 Great

From our data set we can see that we have information on the following:

  • Season Number

  • Episode Number

  • Episode Title

  • IMDb User Rating

  • Categorical Rating

We can do a few things with this data. First, lets see which doctor has the most episodes.

doctorwho %>% 
  group_by(Doctor) %>% 
  count(sort=T) |> 
  hux() |> 
  theme_article()
Doctor n
Eleventh 55
Tenth 42
Twelfth 27
Thirteenth 22
Ninth 13

The eleventh doctor wins out here!

Next it might be fun to see which season had the highest rating and then graph that.

# These are the custom colors we will be using for our ratings
dwcolor <- 
  c("Bad"="red2", "Garbage" ="dodgerblue2", 
    "Great" ="greenyellow", "Regular"="darkorange1", 
    "Good" ="gold1")

doctorwho |> 
  group_by(Season) |> 
  ggplot(aes(as.factor(Season),Ratings)) + 
  geom_point(aes(color = Quality),
             alpha = .2) +
  stat_summary(
    fun = "mean",
    geom = "point"
  ) + 
  theme_minimal() +
  theme(legend.position = "none",
        axis.ticks = element_blank()) +
  labs(x="\nSeason",y="Rating\n",
       title="Doctor Who Average Season Rating\n",
       caption="(based on IMDb User Ratings)") +
  theme(
    plot.caption = 
      element_text(
      face = "italic"),
    plot.title = 
      element_text(
      face = "bold")
  ) +
  #scale_color_manual(values = dwcolor)
  scale_color_discrete()

The takeaway from this is that the last two seasons were rated as the worst with Season 4 being rated as the best.

Another question might be which Doctor has the highest rated episodes?

Luckily we have a column with categorical information that tells us what ratings correspond to “Great”.

Here is the system used: < 5.0 = Garbage < 6.5 = Bad < 7.5 = Regular < 8.5 = Good < 10 = Great

So we could either filter(Ratings>8.5) or we could filter(Quality=="Great").

Grouping by Doctor ensures that the information of who the Doctor was that received that rating.

doctorwho %>% 
  group_by(Doctor,Ratings) %>% 
  filter(Quality=="Great") %>% 
  group_by(Doctor) %>% 
  count() %>% 
  ggplot(aes(fct_reorder(Doctor,n),n,fill=Doctor,label = n)) +
    geom_bar(stat="identity") +
    theme_minimal()+
  coord_flip() +
    theme(
      axis.ticks = element_blank(),
      legend.position = "none",
      plot.caption = element_text(face = "italic"),
      axis.text.x = element_blank(),
      axis.text.y = element_text(face = "bold")
      ) +
    labs(x="",y="",
         title="Number of 'Great' Episodes: Per Doctor\n",
         captions="(based on IMDb User Ratings)") +
  scale_fill_viridis_d(alpha = .6) +
  geom_label(color = "black") 

To be honest, no surprise here either!

Lastly, let’s construct a heatmap of all of the episodes of Doctor Who.

=

# Plotly is a package add-on that will make our graph interactive
# This is how it can be specially formatted
doctorwho_pltly <- 
  doctorwho %>% 
  mutate(text=
           paste0("Season: ", Season, "\n",
                  "Episode: ", EpisodeNumber,"\n",
                  "Title: ", Title, "\n",
                  "Doctor: ",Doctor))

library(ggtext)
p <- 
  doctorwho %>% 
  ggplot(aes(factor(Season),EpisodeNumber,fill=Quality)) +
    geom_tile(colour="black")+
    geom_text(aes(label=Ratings))+
    scale_fill_manual(values=dwcolor)+
  theme_minimal()+
  theme(axis.ticks = element_blank(),
        plot.subtitle = element_markdown()) +
    labs(x="Season\n",
         y="\nEpisode",
         title="Doctor Who IMDb Ratings: 2005-2020",
         subtitle = "Ratings of the <span style='color:#0072B2;'>good</span>
         and <span style='color:#009E73;'>bad</span>")
  
ggplotly(p)