Efficient data analysis with data.table

Welcome!

Find the materials here: paocorrales.github.io/intro-datatable

All materials in this course are under the license CC-BY-SA 4.0.

Who are we?

Pao, Elio

Who are you?

Share your name and what’s your favourite ice-cream flavour.

Housekeeping

You will need:

  • R
  • data.table
install.packages("data.table")
  • ggplot2 or any other visualisation package (optional)

What is data.table?

At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.

It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.

Why data.table?

  • Fast and efficient
  • Does not have dependencies
  • Syntax is very concise
  • Ensures backwards compatibility

Follow along

If you can, try the code we will show you on the screen.

If you’re stuck or need help, tell us in the chat.

Materials

https://paocorrales.github.io/intro-datatable/

Reference semantics

Most R functions and methods uses copy-on-modify.

This code returns a new tibble that is a copy of my_data with a new column but it doesn’t modify my_data.

my_data |> 
  mutate(new_column = old_column*2)

data.table uses modify-in-place, which means that objects are not copied when modified. This code doesn’t create a new copy of my_data but it rather modifies my_data directly.

my_data[, new_column := old_column*2]

This is similar to the base R code:

my_data$new_column <- data$new_column

The syntax

The general data.table syntax looks like this:

DT[i, j, by]

Where DT is a data.table object, the i argument is used for filtering and joining operations, the j argument can summarise and transform, and the by argument defines the groups to which to apply these operations.

You can read the syntax as “In these rows, do this, grouped by that”.

It is very concise but easy to read (sometimes).

Exercises

Filter some rows

  1. Is your favourite band/artist listed in the data (column sort_name or clean_name)?
Code
rolling_stone[clean_name == "ABBA"]
  1. Who are the “30 under 30” in 2020? That is, bands with an average age under 30 years (column ave_age_at_top_500) that are in the top 30 in 2020 (rank_2020).
Code
rolling_stone[ave_age_at_top_500 < 30 & rank_2020 < 30] 

Work with columns

  1. Are more popular albums on Spotify (column spotify_popularity) higher in the 2003 ranking (column rank_2003)? Compute the correlation between the two columns (hint: there are missing values, so you will need to use use = "complete.obs").
Code
rolling_stone[, cor(spotify_popularity, rank_2003, use = "complete.obs")]
  1. Which rankings are missing in the database?
Code
rolling_stone[, which(!(1:500 %in% rank_2003))]

Operate over colums and groups

  1. How may bands in the Latin genre (column genre) appeared in the raking of 2020?
Code
rolling_stone[genre == "latin", sum(!is.na(rank_2020))]
  1. How many male, female or mixed-gender (column artist_gender) bands were included in the 2020 ranking (column rank_2020)?
Code
rolling_stone[, sum(!is.na(rank_2020)), by = artist_gender]
  1. Add a column with the average of the birth year of each band (columns artist_birth_year_sum and artist_member_count are relevant)?
Code
rolling_stone[, artist_birth_year_mean := artist_birth_year_sum/artist_member_count]
  1. Add a column with the average ranking of each album (columns rank_2003, rank_2012 and rank_2020 and album_id).
Code
rolling_stone[, mean_raking := mean(c(rank_2003, rank_2012, rank_2020), na.rm = TRUE), by = album_id]

Break!

Take 15 minutes.

15:00

Piping operations

  1. Are bands (artists with artist_member_count greater than 1) more successful than solo artists?
Code
rolling_stone[, is_band := artist_member_count > 1] |>
  _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
        mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
        mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
    by = is_band]

# Notice that due to reference semantics, this operation adds the 
# is_band column to the data.table. You can avoid this by using 
# an expression in the by argument.
rolling_stone |> 
  _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
        mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
        mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
    by = .(is_band = artist_member_count > 1)]
  1. What is the proportion of albums recorded in a Studio and their mean position in 2020?
Code
rolling_stone[, .(mean_rank_2020 = mean(rank_2020, na.rm = TRUE),
                  N = sum(!is.na(rank_2020))), by = type] |> 
  _[, prop := N/sum(N)]
  1. What is the mean number of years between an artist debut album and the release of their first top 500 album (see the column years_between) for each genre?
Code
rolling_stone[, .(mean_years = mean(years_between, na.rm = TRUE), .N),
                by = genre] |> 
  _[order(mean_years)]

Let’s melt the data

  1. What is the mean rank on each ranking year for gender?

    Code
    rolling_stone |> 
      melt(id.vars = c("release_year", "album_id", "artist_gender"), 
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
           variable.name = "rank_year",
           value.name = "rank") |> 
      _[, .(mean_rank = mean(rank, na.rm = TRUE)), 
        by = .(rank_year, artist_gender)]

Reshape!

  1. What is the proportion of Male to Female artists for each decade in the 2003 ranking?

    Code
    rolling_stone |> 
      _[, .(N = sum(!is.na(rank_2003))), 
        by = .(decade = floor(release_year / 10) * 10, artist_gender)] |> 
      dcast(decade ~ artist_gender, value.var = "N") |> 
      setnafill(fill = 0) |> 
      _[, ratio := Male / (Male + Female)] |> 
      ggplot(aes(decade, ratio)) +
      geom_line()
  2. Has this changed between the different ranking years? (You need to first melt and then dcast)

    Code
    rolling_stone |> 
      melt(id.vars = c("release_year", "album_id", "artist_gender"), 
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
           variable.name = "rank_year",
           value.name = "rank") |> 
      _[, .(N = sum(!is.na(rank))),
        by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |> 
      dcast(decade + rank_year ~ artist_gender, value.var = "N") |> 
      _[, ratio := Male / (Male + Female)] |> 
      ggplot(aes(decade, ratio)) +
      geom_line(aes(color = rank_year))

How to contribute

It is a great time to join the data.table community!

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table.

Thanks for joining!

paocorrales.github.io/intro-datatable