If you have eight hours to cut down a tree, it is best to spend six hours sharpening your axe and then two hours cutting down the tree.
Anonymous, on the benefits of having good tools
A long time ago, I started my day by opening a tab of three websites — Dilbert, Non-Sequitur, and another comic strip I’ve forgotten at the moment. It was a great start into the day, because — usually — at least one of the comic strips made me laugh, or at least smile.
And it’s great to develop a different perspective — to see the world differently.
I’m no longer starting my days this way, but I still like those strips. And recently I noticed that I had not looked at them in a while.
Unfortunately, it can be hard to continue with the strips. Did I already see that comic strip? Even if you download them, often they are not named by date. So I decided to simply download these strips again.
Instead of going through each page and download the strip manually, I decided to write an R script. While R is usually used for statistics, it is surprisingly flexible. And after working out the structure of the Dilbert page, it was possible to, well, write a script that downloads each comic strip and provide me with a table that contains the text of the comic strip and the tags.
The script is rather shoddy, giving that it’s my first attempt, but it works. Mostly. The random delays allow for the script to run for a while, but occasionally, the server cuts the connection. So you might have to restart it again.
Still, it’s much, much more efficient than downloading the strips by hand.
library(rvest) library(xml2) library(dplyr) library(lubridate) library("beepr") #https://dilbert.com/strip/2019-12-14 # create subdirectories # for(i in 1989:2019) { # dir.create(paste0("parse/comics/", i)) # } # Initialize # startdate <- as_date("1989-04-16") # stopdate <- as_date("1989-05-16") # resultsTable <- tribble(~rDate, ~publishDate, ~imageURL, ~description, ~tags) # names(resultsTable) # resultsTable # Continue continueTableData <- read_csv("parse/dataexport/resultsTable.csv") continueTableData <- continueTableData %>% select(-X1) startdate <- max(continueTableData$rDate) + 1 # stopdate <- as_date("1990-06-30") # next go yearly resultsTable <- continueTableData # https://dilbert.com/strip/1990-05-10 # startdate <- max(resultsTable$rDate) + 1 # next step stopdate <- startdate + 30 # finalstopdate <- startdate + 365 # as_date("1991-12-31") finalstopdate <- as_date("2019-12-16") # startdate <- as_date("2019-02-27") run = TRUE while(run) { while(startdate <= stopdate) { print(startdate) htmlSource <- read_html(paste0("https://dilbert.com/strip/", startdate)) divName <- paste0('div[class="comic-item-container js_comic_container_', startdate, '"]') imageURL <- htmlSource %>% html_nodes(divName) %>% html_attr("data-image") # imageURL <- html_nodes('div[class="comic-item-container js_comic_container_2015-08-24"]') %>% html_attr("data-image") imageURL <- paste0("http:", imageURL) print(imageURL) resultsTable <- resultsTable %>% add_row( rDate = as.character(startdate), publishDate = htmlSource %>% html_nodes('meta[property="article:publish_date"]') %>% html_attr("content"), imageURL = imageURL, description = htmlSource %>% html_nodes(divName) %>% html_attr("data-description"), tags = htmlSource %>% html_nodes(divName) %>% html_attr("data-tags") ) download.file(imageURL, paste0("parse/comics/", substr(startdate, 1, 4), "/dilbert-", startdate,".jpg"), mode = 'wb') startdate <- startdate + 1 Sys.sleep(sample(2:4, 1)) } write.csv(resultsTable, "parse/dataexport/resultsTable.csv") if(startdate <= finalstopdate) { print("waiting ...") Sys.sleep(sample(12:24, 1)) if(startdate + 30 <= finalstopdate) { stopdate <- startdate + 30 } else { stopdate <- finalstopdate } } else { run = FALSE beep(sound = 4, expr = NULL) } }
Suggestions for improvement greatly appreciated.