DailyR rvest web scraping
Dear Friends, extracting data from the web is an important skill to have in data science. R provides many packages to ‘scrape’ data. In this post, I use the rvest package to scrape data from the top premier league scorers from a BBC site.
I’m a huge Liverpool fan and want to check out how teams and players are doing. First, browse the BBC website and inspected the url. Use the inspect feature from your browser to inspect the data and appropriate xpath.
- Use read_html and html_nodes to scrape the data
- Use strsplit to separate the features of each player’s stats
- Use data.table to organize the data
- Use plot_ly to visualize the results
Check out this Video for Step By Step Instructions
Scrape the Data
knitr::opts_chunk$set(echo = TRUE) library(rvest) url = "http://www.bbc.com/sport/football/premier-league/top-scorers" # website to scrape x_path = '//*[@id="top-scorers"]/ol' # xpath website <- read_html(url) top_scorers <- website %>% html_nodes(xpath = x_path) %>% html_text() # text scraped from website substring(top_scorers, 1, 400) # inspect first 200 characters
##  " Mohamed Salah Liverpool 148 mins per goal 3256 mins played 22 Goals scored 8 Assists Shots on targetTotal 62% 64 104 Pierre-Emerick Aubameyang Arsenal 124 mins per goal 2731 mins played 22 Goals scored 5 Assists Shots on targetTotal 56% 40 72 Sadio Mané Liverpool 140 mins per goal 3085 mins played 22 Goals scored 1 Assists "
Place the Data in a Data.Table
The data.table package is a great tool to work with data. Check out my post here for further details. Let’s wrnagle this data into something that makes sense and is easy to visualize.
library(data.table) library(pander) top_scorers <- strsplit(top_scorers, " ") # Use the space marker to split the data near player names top_scorers <- data.table(name = top_scorers[]) # place the results in a data.table top_scorers$team <- sapply(top_scorers$name, function(x) unlist(strsplit(x, " "))) # use the smaller space marker to split near team names top_scorers$name <- sapply(top_scorers$name, function(x) unlist(strsplit(x, " "))) # cleans up name column, remove everything after the space marker digits <- sapply(1:length(top_scorers$team), function(x) as.numeric(unlist(strsplit(gsub("[^\\d ]+", " ", top_scorers$team[x], perl = TRUE), " "))[x != ""])) # extract all the numerical data from the text digits <- unlist(digits) # turns the list of 24 vectors into a single vector digits <- digits[!is.na(digits)] # removes NAs dim(digits) <- c(7,25) # conforms the single vector into a matrix wiht 7x24 dimensions digits <- data.table(t(digits)) # convert the matrix into a data.table colnames(digits) <- c("minutes_per_goal", "minutes_played", "goals_scored", "assists", "shots_on_target_percentage", "shots_on_target", "shot_attempts") # column headers top_scorers$team <- sapply(top_scorers$team, function(x) unlist(strsplit(x, " "))) # clean up name column, remove everything after the space marker top_scorers <- cbind(top_scorers, digits) # combine the data.tables pander(top_scorers[1:5,]) # checkout the first 5 player data
|Sergio Agüero||Man City||118||2479|
Plot the Data
OK, now that we wrangled the data into a data.table, let’s look at the data briefly with a chart. Plotly is a great package that enables users to interact with the chart. Let’s check it out.
library(plotly) # uber web-based interactive graphing tools data(top_scorers) top_scorers$team <- as.factor(top_scorers$team) # make teams as.factor p <- plot_ly(top_scorers, # the data.table x = ~ minutes_per_goal, y = ~ goals_scored, z = ~ assists, color = ~ team) %>% # make the teams as.factor add_markers() %>% layout(scene = list(xaxis = list(title = 'Minutes per Goal'), yaxis = list(title = 'Goals Scores'), zaxis = list(title = 'Assists'))) p