19 Visualise And Analyse Time-Oriented Data

Hands-On Exercise for Week 7

Author

Published

July 7, 2023

(First Published: Jul 5, 2023)

19.1 Learning Outcome

We will learn how to create the following visualisations:

plotting a calender heatmap,
plotting a cycle plot,
plotting a slopegraph, and
plotting a horizon chart

19.2 Getting Started

19.2.1 Install and load the required R libraries

Install and load the the required R packages. The name and function of the new packages that will be used for this exercise are as follow

scales : provides various functions for scaling and formatting data
viridis : provides color palettes that are perceptually uniform and work well for representing data in visualisations
gridExtra : provides functions for arranging multiple plots on a page or within a plot
readxl :enables reading data from Microsoft Excel files (.xls and .xlsx) into R
knitr :used for dynamic report generation in R
data.table :offers fast data manipulation and aggregation operations, making it useful for working with large datasets
CGPfuntions : contains a function that is designed to automate the process of producing a Tufte style slopegraph using ggplot2.

Show the code

pacman::p_load(scales, viridis, lubridate, ggthemes, gridExtra, readxl, knitr, data.table, CGPfunctions, ggHoriPlot, tidyverse, CGPfunctions)

19.3 Plotting Calendar Heatmap

In this section, we will learn how to plot a calender heatmap programmetically by using ggplot2 package.

19.3.1 Import the data

eventlog.csv file consists of 199,999 rows of time-series cyber attack records by a country. It is imported in R and assigned to the attacks data frame.

Show the code

attacks <- read_csv("data/eventlog.csv", show_col_types = F)

19.3.2 Examine the data structure

kable() can be used to review the structure of the imported data frame.

Show the code

kable(head(attacks))

timestamp	source_country	tz
2015-03-12 15:59:16	CN	Asia/Shanghai
2015-03-12 16:00:48	FR	Europe/Paris
2015-03-12 16:02:26	CN	Asia/Shanghai
2015-03-12 16:02:38	US	America/Chicago
2015-03-12 16:03:22	CN	Asia/Shanghai
2015-03-12 16:03:45	CN	Asia/Shanghai

There are three columns, namely timestamp, source_country and tz.

timestamp field stores date-time values in POSIXct format.
source_country field stores the source of the attack. It is in ISO 3166-1 alpha-2 country code.
tz field stores time zone of the source IP address.

19.3.3 Data Preparation

Step 1: Deriving weekday and hour of day fields

Before we can plot the calender heatmap, two new fields namely wkday and hour need to be derived. In this step, we will write a function to perform the task.

Show the code

make_hr_wkday <- function(ts, sc, tz) {
  real_times <- ymd_hms(ts, 
                        tz = tz[1], 
                        quiet = TRUE)
  dt <- data.table(source_country = sc,
                   wkday = weekdays(real_times),
                   hour = hour(real_times))
  return(dt)
  }

Things to note:

ymd_hms() and hour() are from lubridate package, and
weekdays() is a base R function.

Step 2: Deriving the attacks tibble data frame

Show the code

wkday_levels <- c('Saturday', 'Friday', 
                  'Thursday', 'Wednesday', 
                  'Tuesday', 'Monday', 
                  'Sunday')

attacks <- attacks %>%
  group_by(tz) %>%
  do(make_hr_wkday(.$timestamp, 
                   .$source_country, 
                   .$tz)) %>% 
  ungroup() %>% 
  mutate(wkday = factor(
    wkday, levels = wkday_levels),
    hour  = factor(
      hour, levels = 0:23))

Things to note:

Beside extracting the necessary data into attacks data frame, mutate() of dplyr package is used to convert wkday and hour fields into factor so they’ll be ordered when plotting.

The table below shows the tidy tibble table after processing.

Show the code

kable(head(attacks))

tz	source_country	wkday	hour
Africa/Cairo	BG	Saturday	20
Africa/Cairo	TW	Sunday	6
Africa/Cairo	TW	Sunday	8
Africa/Cairo	CN	Sunday	11
Africa/Cairo	US	Sunday	15
Africa/Cairo	CA	Monday	11

19.3.4 Building a Calendar Heatmap

Show the code

grouped <- attacks %>% 
  count(wkday, hour) %>% 
  ungroup() %>%
  na.omit()

ggplot(grouped, 
       aes(hour, 
           wkday, 
           fill = n)) + 
geom_tile(color = "white", 
          size = 0.1) + 
theme_tufte(base_family = "Helvetica") + 
coord_equal() +
scale_fill_gradient(name = "# of attacks",
                    low = "sky blue", 
                    high = "dark blue") +
labs(x = NULL, 
     y = NULL, 
     title = "Attacks by weekday and time of day") +
theme(axis.ticks = element_blank(),
      plot.title = element_text(hjust = 0.5),
      legend.title = element_text(size = 8),
      legend.text = element_text(size = 6) )

Things to note:

a tibble data table called grouped is derived by aggregating the attack by wkday and hour fields.
a new field called n is derived by using group_by() and count() functions.
na.omit() is used to exclude missing value.
geom_tile() is used to plot tiles (grids) at each x and y position. color and size arguments are used to specify the border color and line size of the tiles.
coord_equal() is used to ensure the plot will have an aspect ratio of 1:1.
scale_fill_gradient() function is used to creates a two colour gradient (low-high).

19.3.5 Building Multiple Calendar Heatmaps

Step 1: Deriving attack by country object

In order to identify the top 4 countries with the highest number of attacks, we can do the followings:

count the number of attacks by country,
calculate the percent of attackes by country, and
save the results in a tibble data frame.

Show the code

attacks_by_country <- count(
  attacks, source_country) %>%
  # percent() is a function from scales packege
  mutate(percent = percent(n/sum(n))) %>%
  arrange(desc(n))

Step 2: Preparing the tidy data frame

In this step, we extract the attack records of the top 4 countries from attacks data frame and save the data in a new tibble data frame (i.e. top4_attacks).

Show the code

top4 <- attacks_by_country$source_country[1:4]
top4_attacks <- attacks %>%
  filter(source_country %in% top4) %>%
  count(source_country, wkday, hour) %>%
  ungroup() %>%
  mutate(source_country = factor(
    source_country, levels = top4)) %>%
  na.omit()

Step 3: Plotting the Multiple Calender Heatmap by using ggplot2 package.

Show the code

ggplot(top4_attacks, 
       aes(hour, 
           wkday, 
           fill = n)) + 
  geom_tile(color = "white", 
          size = 0.1) + 
  theme_tufte(base_family = "Helvetica") + 
  coord_equal() +
  scale_fill_gradient(name = "# of attacks",
                    low = "sky blue", 
                    high = "dark blue") +
  facet_wrap(~source_country, ncol = 2) +
  labs(x = NULL, y = NULL, 
     title = "Attacks on top 4 countries by weekday and time of day") +
  theme(axis.ticks = element_blank(),
        axis.text.x = element_text(size = 7),
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6) )

19.4 Plotting Cycle Plot

19.4.1 Import the data

The codes below import arrivals_by_air.xlsx by using read_excel() of readxl package and save it as a tibble data frame called air.

Show the code

air <- read_excel("data/arrivals_by_air.xlsx")

19.4.2 Data Preparation

Next, two new fields called month and year are derived from Month-Year field.

Show the code

air$month <- factor(month(air$`Month-Year`), 
                    levels=1:12, 
                    labels=month.abb, 
                    ordered=TRUE) 
air$year <- year(ymd(air$`Month-Year`))

Then, we extract the data for the target country (i.e. Vietnam)

Show the code

Vietnam <- air %>% 
  select(`Vietnam`, 
         month, 
         year) %>%
  filter(year >= 2010)

Thereafter, we compute year average arrivals by month.

Show the code

hline.data <- Vietnam %>% 
  group_by(month) %>%
  summarise(avg_value = mean(`Vietnam`))

19.4.3 Generating the Cycle Plot

Show the code

# Plot the graph
ggplot() + 
  geom_line(data=Vietnam,
            aes(x=factor(year), # Set year to factor to ensure that x-axis label is formatted as 4-digit year
                y=`Vietnam`, 
                group=month), 
            colour="black") +
  geom_hline(aes(yintercept=avg_value), 
             data=hline.data, 
             linetype=6, 
             colour="red", 
             size=0.5) + 
  facet_grid(~month) +
  labs(axis.text.x = element_text(angle = 90, hjust = 1),
       title = "Visitor arrivals from Vietnam by air, Jan 2010-Dec 2019") +
  xlab("Year") +
  ylab("No. of Visitors")+
  # Added to rotate the x-axis labels 90 degrees clockwise so that they don't overlap
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  # Added this line to display year label every 3 years on the x-axis to reduce clutter and overlapping of the year
  scale_x_discrete(breaks = function(x) x[seq(1, length(x), 3)])

19.5 Plotting a Slopegraph

Note:

Before getting start, make sure that CGPfunctions has been installed and loaded onto R environment. Then, refer to Using newggslopegraph to learn more about the function. Lastly, read more about newggslopegraph() and its arguments by referring to this link.

19.5.1 Import the data

Show the code

rice <- read_csv("data/rice.csv", show_col_types = F)

19.5.2 Generate the Slopegraph

Show the code

rice %>% 
  mutate(Year = factor(Year)) %>%
  filter(Year %in% c(1961, 1980)) %>%
  newggslopegraph(Year, Yield, Country,
                Title = "Rice Yield of Top 11 Asian Counties",
                SubTitle = "1961-1980",
                Caption = "Prepared by: Dr. Kam Tin Seong")

Things to note:

For effective data visualisation design, factor() is used convert the value type of Year field from numeric to factor.

\(**That's\) \(all\) \(folks!**\)