library(plyr)
library(dplyr)
library(tidyr)
library(rvest)

Background

Even though hundreds were dying in West Africa, the 2014 Ebola outbreak in was not in the news much in the United States. Only after a few Americans got infected did it become more prevalent in the US news circuits. Now the Ebola situation in West Africa has gone back to obscurity in the United States media. Is this indicative of how the Ebola situation is in Africa or is it media blindness?

This is a Google Trends (screenshot) chart showing interest in Ebola over time. As the outbreak was raging on, there was a uptick in interest in Ebola. There was a major spike in interest in October 2014 and then interest quickly dropped.

Google Searches for Ebola

Google Searches for ‘Ebola’

Data Source:

The data comes from two sources, HDX Beta website and The World Health Organization website. The data from HDX was scraped from the World Health Organization website. The HDX data was outdated and had a few inconsistencies in the data. Although there were issues with the data, the information was already gathered and parsed. It was also easier to handle the data as it was aggregated into 1 file. Another important fact about this dataset was that it contained about 3 months of data not available from the W.H.O. site.

The World Health Organization data was more complete for the timeframe it was covering but was seperated into daily files. This would make for a more tedius time sourcing the data. Ultimately, both datasets were massaged, cleaned, and combined.

Data Collection Process - Attempt 1

Originally, the data collection process was easy. The data was already gathered and processed from the W.H.O. website by the folks at HDX Beta. The data was already in Tidy format and at this point, I believed it was the format I needed the data in. After some analysis of the data, it became apparent that there were issues with the data. There were large gaps in the data and more processing was needed to get the data in the format needed. I used this dataset to get started but eventually notice the inadequecy of the data when plotting.

Data Collection Process - Attempt 2

After verifying the data from HDX represented the same data from the W.H.O., I gathered the data listed on the W.H.O. site.

Dataset Listed by Dates

Dataset Listed by Dates

The data was stored in a datatable which I was not able to access programatically. After several failed attempts to pull the data from datatable, I needed another approach. There was a download button available so I tried to access the data that way. This again was unsuccessful because it was using JavaScript to send information back to the server. There was a second download link that revealed the arguments being passed to the server. I was able to mimic the arguments and was able to issue a request and download the data that was returned.

Data Table

Data Table

link <- "http://apps.who.int/gho/data/node.ebola-sitrep.ebola-summary?lang=en"
ebola_site <- read_html(link)

#Read all the links from the website
link_list <- ebola_site %>% html_nodes("a") %>% html_attr("href")

#Keep only the links we need
link_list <- link_list[33:331]

link_dates <- vector()

#Extract the date portion of the links
link_dates <- substr(link_list, 33, 40)

There was a pattern between all the links. Only the date changed between the links so I iterated through the list of links and read each dataset. I split the link into 3 parts, with the varying date appended into the middle.

pre_link <- 'http://apps.who.int/gho/athena/xmart/data-text.csv?target=EBOLA_MEASURE/CASES,DEATHS&profile=text&filter=COUNTRY:GIN;COUNTRY:UNSPECIFIED;COUNTRY:LBR;COUNTRY:UNSPECIFIED;COUNTRY:SLE;COUNTRY:UNSPECIFIED;LOCATION:-;DATAPACKAGEID:'
post_link <- ';INDICATOR_TYPE:SITREP_CUMULATIVE;INDICATOR_TYPE:SITREP_CUMULATIVE_21_DAYS;SEX:-'

#Convert to dates
date_parts <- as.Date(link_dates, "%Y%m%d")

#Create empty dataframe to store the data.
ebola_scrape = data.frame()

#Construct full link
for (i in length(date_parts)) {
  dl_link <- paste0(pre_link, date_parts, post_link)
}

This method did not fully work because a few datasets had different number of columns. This would eventually cause errors after a few iterations.

for(j in 1:length(dl_link)){
  print(j)
  download.file(dl_link[j], 'DELETEME.csv', mode = 'w')
  temp_df <- read.csv("DELETEME.csv", skip=2, header = TRUE, stringsAsFactors = FALSE)
  #Add a new column with the date of the 'file' 
  temp_df["Date"] = date_parts[j]
  ebola_scrape <- rbind(ebola_scrape, temp_df) #Add new data to dataframe
  Sys.sleep(1) ##Add a delay so I dont hammer website
}

Next step was to download each dataset and write them to individual files. The idea was to process them and then combine the dataset once they had the same file structure. This process worked well but would be tedious to combine the data afterwards.

#Reinitialize the dataframe
ebola_scrape = data.frame()

for(j in 1:length(dl_link)){
  print(j)
  download.file(dl_link[j], paste0('DELETEME_',j,'.csv'), mode = 'w')
  Sys.sleep(1) ##Add a delay so I dont hammer website
}

Data Collection Process - Attempt 3

While processing the data in the previous step, I eventually realized that the links were sending queries back to their server to generate the data I needed. I experimented with the links a few times and eventually settled on the following link.

http://apps.who.int/gho/athena/xmart/data-text.csv?target=EBOLA_MEASURE/CASES,DEATHS&profile=text&filter=COUNTRY:-;LOCATION:-;INDICATOR_TYPE:SITREP_CUMULATIVE;SEX:-

I removed the country filter and the date filter. This query provided all the data in one dataset.

Please note, I have created a few bad queries which resulted in errors being returned.

Data Scrubbing

Reading in both data sources as CSV files

#World Health Organization
ebola_df <- read.csv("data/data-text_all_countries.csv", header = TRUE, stringsAsFactors = FALSE)

#HDX Beta
ebola_df2 <- read.csv("data/ebola_data_db_format.csv", header = TRUE, stringsAsFactors = FALSE)

Filtered the data to the countries with the most serious Ebola infection rate. The other countries had insignificant amount of data compared to the these three countries.

countries <- c('Guinea', 'Liberia', 'Sierra Leone')
ebola_df <- filter(ebola_df, Country %in% countries)
ebola_df2 <- filter(ebola_df2, Country %in% countries)

Keep the columns that were need

ebola <- ebola_df[,c(1,5,6,10,12,14)]

All suspected, probable, and confirmed cases are rolled up into the ‘Total’ case definition. Filtering for only ‘Total’ case definition

ebola_total <- filter(ebola, Case.definition == 'Total')

The ‘Data.package.ID’ column contains the date information that is need. The data is stored as strings and contain extra information. The date needs to be parsed and coverted before it can be used. We are trimming the Data.package.ID to get the YYYY-MM-DD data format and then converting into a date type.

head(ebola_total['Data.package.ID'])
##                           Data.package.ID
## 1                 Data package 2015-12-04
## 2 Data package 2015-12-2828 December 2015
## 3 Data package 2015-09-022 September 2015
## 4     Data package 2015-06-3030 June 2015
## 5                 Data package 2015-09-03
## 6                 Data package 2016-03-09
ebola_total['Data.package.ID'] <- as.Date(substr(ebola_total[,'Data.package.ID'], 14, 23), "%Y-%m-%d")

The data for Guinea, Sierra Leone, and Liberia was split into two dataset. The first dataset is for a 2014 outbreak and a second dataset was for a 2015 outbreak. The two datasets were combined and the infection/death numbers were aggragated per date to give a cumulative number. This also removed ‘Country/Date’ duplicates that was giving trouble me trouble with the motion chart. Google Motion chart needs two columns:

-The first column must be of type ‘string’ and contain the entity names -The second column must contain time values.

grp_by <- ebola_total %>% group_by(Country, Data.package.ID, Ebola.measure) %>%
  summarise(value = sum(Numeric))

Processing the HDX dataset. The ‘Date’ field is converted from character to Date for Google Motion Chart requierments.

ebola_df2['Date'] <- as.Date(ebola_df2[,'Date'], "%Y-%m-%d")

Ignore the confirmed, probably, and suspected data. We will keep the ‘rolled up’ cumulative number for all 3 categories. This is the same as the ‘Total’ in the W.H.O dataset.

cum_death <- 'Cumulative number of confirmed, probable and suspected Ebola deaths'
cum_case <- 'Cumulative number of confirmed, probable and suspected Ebola cases'

indicator <- c(cum_case, cum_death)
ebola_ind <- filter(ebola_df2, Indicator %in% indicator)

After 2014-11-12, the data can be found in the W.H.O dataset. We are keeping only the data that is not available through the W.H.O. dataset

ebola_ind <- filter(ebola_ind, Date <= as.Date("2014-11-12"))
ebola_ind$Indicator[ebola_ind$Indicator==cum_case] <- 'Cases'
ebola_ind$Indicator[ebola_ind$Indicator==cum_death] <- 'Deaths'

Collapsing the data so the Cases and Death information will be contained in one row

ebola_df2_spread <- ebola_ind %>% spread(Indicator, value)

Collapsing the data so the Cases and Death information will be contained in one row

eb_tot_collapse <- grp_by %>% spread(Ebola.measure, value)

Sort and combine the data from the two datasources. The columns needed to be renamed so they could be joined.

sorted <- eb_tot_collapse[order(eb_tot_collapse$Country, eb_tot_collapse$Data.package.ID, decreasing=TRUE),]
sorted <- plyr::rename(sorted, c("Data.package.ID"='Date', 'Number of cases'='Cases', 'Number of deaths'='Deaths'))

sorted <- union(sorted, ebola_df2_spread)
sorted <- sorted[order(sorted$Country, sorted$Date, decreasing=TRUE),]

After sorting the data by Country and Date, I used the lead function to get various values from the previous day (data point). The previous values were used to create various metrics. Since the data points were non consistent in terms of periods between collection, I calculated the average value per day, based on how many days were between each data point. Turns out it was not needed for the motion chart as Google interpolates the data between data points.

sorted <- sorted %>%
  plyr::mutate(prev_country = lead(Country, order_by = Country)) %>%
  plyr::mutate(fatality_rate = Deaths/Cases) %>%
  plyr::mutate(prev_death = lead(Deaths, order_by = Country)) %>%
  plyr::mutate(prev_case = lead(Cases, order_by = Country)) %>%
  plyr::mutate(prev_date = lead(Date, order_by = Country)) %>%
  plyr::mutate(delta_days = as.numeric(Date-prev_date)) %>%
  plyr::mutate(delta_deaths = Deaths-prev_death) %>%
  plyr::mutate(delta_cases = Cases-prev_case) %>%
  #Average Delta (Death/Cases) per day
  plyr::mutate(avg_dlt_death = delta_deaths/delta_days) %>%
  plyr::mutate(avg_dlt_case = delta_cases/delta_days) %>%
  plyr::mutate(avg_dlt_fatality = avg_dlt_death/avg_dlt_case)

Cleaning up some outliers and remove columns we no longer need. I couldn’t get the lead/lag functions to work 100% as expected so some of the data generated were bad. I omitted bad data that was that was caused by the lead function. The calculations are based on the data being grouped by country but the boundry cases were being calculated with different countries, resulting in bad data.

  sorted <- filter(sorted, Country == prev_country)  
  sorted <- sorted[,c(-5,-7,-8,-9,-10)]

In general, cumulative numbers do not decrease, only increase. In this situation, a decrease is possible as some cases are reclassified. I dont think reclassification happens on a large scale, so I limited the dataset to numbers with small negative deltas and larger.

  sorted <- filter(sorted, delta_cases > -5)
  sorted <- filter(sorted, delta_cases < 600)
  sorted <- filter(sorted, delta_deaths > -5)

Remove NaNs from division by 0. Defaulted to 0

sorted[is.na(sorted)] <- 0
  #View(sorted)

Visualization - Google Motion Chart

The data was written to a CSV to be used for creating the Google Motion Charts.
Google Motion Chart for the Ebola outbreak in Africa is located in a seperate page. The Google trends chart for Ebola searches located there as well.

write.csv(sorted, file = "data/ebola_data.csv", row.names=FALSE)

Conclusions

Is the lacking media coverage indicative of the situation in Africa? This is not a straightforward answer. As shown by Google Trends, ebola searches peaked in october 2014 and dropped off shortly after that. Coincedentally, September 30, 2014 was when the first case of Ebola was reported in the united states and the last reported case in the United States was October 23, 2014.

The interest in Ebola dropped sharply in November 2014. During this time, the outbreak in West Africa, although slowing down, was still going strong. The waning interest in Ebola does coincide with the slowing down of infections in west Africa, but the infections in the United States seems to be the primary factor influencing interest and media coverage, not the situation in West Africa.

We can clearly see when the infections are starting to become under control when the cumulative number of infections start to form horizontal lines.

library(ggplot2)

dates <- c("09/30/14", "11/23/14")
us_ebola_date <- as.Date(dates, format = "%m/%d/%y")

ggplot(sorted, aes(x=Date, y=Cases)) +
  geom_point(aes(colour = Country)) + 
  geom_vline(xintercept = as.numeric(us_ebola_date)) +
  xlab('Date') + ylab('Cumulative cases of Ebola')

Here is a scatter plot of the number of new infections over time. The verticle bars represents the first and last infection in the US.

ggplot(sorted, aes(x=Date, y=delta_cases)) +
  geom_point(aes(colour = Country)) + 
  geom_vline(xintercept = as.numeric(us_ebola_date)) + 
  xlab('Date') + ylab('Number of New cases') + 
  ggtitle("New Infections")