Efficient data analysis with data.table

Welcome!

Find the materials here: paocorrales.github.io/intro-datatable

All materials in this course are under the license CC-BY-SA 4.0.

Who are we?

Pao, Elio, Kelly

Who are you?

Introduce yourself to your neighbours 👋

Share your name and how many hours it took you to arrive to Salzburg

Housekeeping

You will need:

R
data.table

install.packages("data.table")

ggplot2 or any other visualisation package (optional)

Wi-Fi network name

userr2024

salzburg

TODO-ADD-LATER

What is data.table?

At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.

It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.

Why data.table?

Fast and efficient
Does not have dependencies
Syntax is very concise
Ensures backwards compatibility

Follow along

If you can, try the code we will show you on the screen.

For the exercises, work in teams and use the sticky notes!

🟪 “I’m stuck and need help!”

🟩 “I finished the exercise”

Materials

https://bit.ly/datatable-user2024

The syntax

The general data.table syntax looks like this:

DT[i, j, by]

Where DT is a data.table object, the i argument is used for filtering and joining operations, the j argument can summarise and transform, and the by argument defines the groups to which to apply these operations.

You can read the syntax as “In these rows, do this, grouped by that”.

It is very concise but easy to read (sometimes).

Exercises

Filter some rows

Is your favourite band/artist listed in the data?

Code

rolling_stone[clean_name == "your fav band here"]

How many albums got first position the Billboard magazine ranking?

Code

rolling_stone[peak_billboard_position == 1]

Work with columns

Are more popular albums on Spotify (column spotify_popularity) higher in the 2003 ranking? Compute the correlation (hint: there are missing values, so you will need to use use = "complete.obs").

Code

rolling_stone[, cor(spotify_popularity, rank_2003, use = "complete.obs")]

Which rankings are missing in the database?

Code

rolling_stone[, which(!(1:500 %in% rank_2003))]

Operate over colums and groups

How may bands in the Latin genre appeared in the raking of 2020?
Code
```
rolling_stone[genre == "latin", sum(!is.na(rank_2020))]
```

What the average of the birth year for each artist or band?

Code

rolling_stone[, artist_birth_year_mean := artist_birth_year_sum/artist_member_count]

What is the mean raking of each album over the years?

Code

rolling_stone[, mean_raking := mean(c(rank_2003, rank_2012, rank_2020), na.rm = TRUE), by = album_id]

Break!

Take 15 minutes.

15:00

Piping operations

Are bands more successful than solo artists?

Code

rolling_stone[, is_band := artist_member_count > 1] |>
  _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
        mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
        mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
    by = is_band]

# Notice that due to reference semantics, this operation adds the 
# is_band column to the data.table. You can avoid this by using 
# an expression in the by argument.
rolling_stone |> 
  _[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
        mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
        mean_rank_2020 = mean(rank_2020, na.rm = TRUE)), 
    by = .(is_band = artist_member_count > 1)]

What is the proportion of albums recorded in a Studio and their mean position in 2020?

Code

rolling_stone[, .(mean_rank_2020 = mean(rank_2020, na.rm = TRUE),
                  N = sum(!is.na(rank_2020))), by = type] |> 
  _[, prop := N/sum(N)]

What is the mean number of years between an artist debut album and the release of their first top 500 album (see the column years_between ) for each genre?
Code
```
rolling_stone[, .(mean_years = mean(years_between, na.rm = TRUE), .N),
                by = genre] |> 
  _[order(mean_years)]
```

Let’s melt the data

What is the mean rank on each ranking year for gender?

Code

rolling_stone |> 
  melt(id.vars = c("release_year", "album_id", "artist_gender"), 
       measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
       variable.name = "rank_year",
       value.name = "rank") |> 
  _[, .(mean_rank = mean(rank, na.rm = TRUE)), 
    by = .(rank_year, artist_gender)]

Reshape!

What is the proportion of Male to Female artists for each decade in the 2003 ranking?

Code

rolling_stone |> 
  _[, .(N = sum(!is.na(rank_2003))), 
    by = .(decade = floor(release_year / 10) * 10, artist_gender)] |> 
  dcast(decade ~ artist_gender, value.var = "N") |> 
  setnafill(fill = 0) |> 
  _[, ratio := Male / (Male + Female)] |> 
  ggplot(aes(decade, ratio)) +
  geom_line()

Has this changed between the different ranking years? (You need to first melt and then dcast)

Code

rolling_stone |> 
  melt(id.vars = c("release_year", "album_id", "artist_gender"), 
       measure.vars = c("rank_2003", "rank_2012", "rank_2020"), 
       variable.name = "rank_year",
       value.name = "rank") |> 
  _[, .(N = sum(!is.na(rank))),
    by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |> 
  dcast(decade + rank_year ~ artist_gender, value.var = "N") |> 
  _[, ratio := Male / (Male + Female)] |> 
  ggplot(aes(decade, ratio)) +
  geom_line(aes(color = rank_year))

How to contribute

It is a great time to join the data.table community!

In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table.

data.table on Mastodon: @r_data_table@fosstodon.org
The Raft rdatatable-community.github.io/The-Raft, the data.table blog.
The GitHub repo with 900+ open issues, 100+ open PRs. If you have any time/interest, we could use your help!

Thanks for joining!

paocorrales.github.io/intro-datatable