Efficient data analysis with data.table

Welcome!

Find the materials here: paocorrales.github.io/intro-datatable

All materials in this course are under the license CC-BY-SA 4.0.

Who are we?

Pao, Elio, Kelly

Who are you?

Introduce yourself to your neighbours 👋

Share your name and how many hours it took you to arrive to Salzburg

Housekeeping

You will need:

  • R
  • data.table
install.packages("data.table")
  • ggplot2 or any other visualisation package (optional)

Wi-Fi network name

userr2024

salzburg

TODO-ADD-LATER

What is data.table?

At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.

It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.

Why data.table?

  • Fast and efficient
  • Does not have dependencies
  • Syntax is very concise
  • Ensures backwards compatibility

Follow along

If you can, try the code we will show you on the screen.

For the exercises, work in teams and use the sticky notes!

🟪 “I’m stuck and need help!”

🟩 “I finished the exercise”

The syntax

The general data.table syntax looks like this:

DT[i, j, by]

Where DT is a data.table object, the i argument is used for filtering and joining operations, the j argument can summarise and transform, and the by argument defines the groups to which to apply these operations.

You can read the syntax as “In these rows, do this, grouped by that”.

It is very concise but easy to read (sometimes).

Exercises

Filter some rows

  1. Is your favourite band/artist listed in the data?
Code
rolling_stone[clean_name == "your fav band here"]
  1. How many albums got first position the Billboard magazine ranking?
Code
rolling_stone[peak_billboard_position == 1]

Work with columns

  1. Are more popular albums on Spotify (column spotify_popularity) higher in the 2003 ranking? Compute the correlation (hint: there are missing values, so you will need to use use = "complete.obs").
Code
rolling_stone[, cor(spotify_popularity, rank_2003, use = "complete.obs")]
  1. Which rankings are missing in the database?
Code
rolling_stone[, which(!(1:500 %in% rank_2003))]

Operate over colums and groups

  1. How may bands in the Latin genre appeared in the raking of 2020?

    Code
    rolling_stone[genre == "latin", sum(!is.na(rank_2020))]
  2. What the average of the birth year for each artist or band?

    Code
    rolling_stone[, artist_birth_year_mean := artist_birth_year_sum/artist_member_count]
  3. What is the mean raking of each album over the years?

    Code
    rolling_stone[, mean_raking := mean(c(rank_2003, rank_2012, rank_2020), na.rm = TRUE), by = album_id]

Break!

Take 15 minutes.

15:00

Piping operations

  1. Are bands more successful than solo artists?

    Code
    rolling_stone[, is_band := artist_member_count > 1] |>
      _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
            mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
            mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
        by = is_band]
    
    # Notice that due to reference semantics, this operation adds the 
    # is_band column to the data.table. You can avoid this by using 
    # an expression in the by argument.
    rolling_stone |> 
      _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
            mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
            mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
        by = .(is_band = artist_member_count > 1)]
  2. What is the proportion of albums recorded in a Studio and their mean position in 2020?

    Code
    rolling_stone[, .(mean_rank_2020 = mean(rank_2020, na.rm = TRUE),
                      N = sum(!is.na(rank_2020))), by = type] |> 
      _[, prop := N/sum(N)]
  3. What is the mean number of years between an artist debut album and the release of their first top 500 album (see the column years_between ) for each genre?

    Code
    rolling_stone[, .(mean_years = mean(years_between, na.rm = TRUE), .N),
                    by = genre] |> 
      _[order(mean_years)]

Let’s melt the data

  1. What is the mean rank on each ranking year for gender?

    Code
    rolling_stone |> 
      melt(id.vars = c("release_year", "album_id", "artist_gender"), 
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
           variable.name = "rank_year",
           value.name = "rank") |> 
      _[, .(mean_rank = mean(rank, na.rm = TRUE)), 
        by = .(rank_year, artist_gender)]

Reshape!

  1. What is the proportion of Male to Female artists for each decade in the 2003 ranking?

    Code
    rolling_stone |> 
      _[, .(N = sum(!is.na(rank_2003))), 
        by = .(decade = floor(release_year / 10) * 10, artist_gender)] |> 
      dcast(decade ~ artist_gender, value.var = "N") |> 
      setnafill(fill = 0) |> 
      _[, ratio := Male / (Male + Female)] |> 
      ggplot(aes(decade, ratio)) +
      geom_line()
  2. Has this changed between the different ranking years? (You need to first melt and then dcast)

    Code
    rolling_stone |> 
      melt(id.vars = c("release_year", "album_id", "artist_gender"), 
           measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
           variable.name = "rank_year",
           value.name = "rank") |> 
      _[, .(N = sum(!is.na(rank))),
        by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |> 
      dcast(decade + rank_year ~ artist_gender, value.var = "N") |> 
      _[, ratio := Male / (Male + Female)] |> 
      ggplot(aes(decade, ratio)) +
      geom_line(aes(color = rank_year))

How to contribute

It is a great time to join the data.table community!

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table.

Thanks for joining!

paocorrales.github.io/intro-datatable