Find the materials here: paocorrales.github.io/intro-datatable
All materials in this course are under the license CC-BY-SA 4.0.
Pao, Elio, Kelly
Share your name and how many hours it took you to arrive to Salzburg
You will need:
Wi-Fi network name
userr2024
salzburg
TODO-ADD-LATER
At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.
It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.
If you can, try the code we will show you on the screen.
For the exercises, work in teams and use the sticky notes!
🟪 “I’m stuck and need help!”
🟩 “I finished the exercise”
The general data.table syntax looks like this:
DT[i, j, by]
Where DT
is a data.table object, the i
argument is used for filtering and joining operations, the j
argument can summarise and transform, and the by
argument defines the groups to which to apply these operations.
You can read the syntax as “In these rows, do this, grouped by that”.
It is very concise but easy to read (sometimes).
spotify_popularity
) higher in the 2003 ranking? Compute the correlation (hint: there are missing values, so you will need to use use = "complete.obs"
).How may bands in the Latin genre appeared in the raking of 2020?
What the average of the birth year for each artist or band?
What is the mean raking of each album over the years?
Take 15 minutes.
15:00
Are bands more successful than solo artists?
rolling_stone[, is_band := artist_member_count > 1] |>
_[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
by = is_band]
# Notice that due to reference semantics, this operation adds the
# is_band column to the data.table. You can avoid this by using
# an expression in the by argument.
rolling_stone |>
_[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
by = .(is_band = artist_member_count > 1)]
What is the proportion of albums recorded in a Studio and their mean position in 2020?
What is the mean number of years between an artist debut album and the release of their first top 500 album (see the column years_between
) for each genre?
What is the mean rank on each ranking year for gender?
What is the proportion of Male to Female artists for each decade in the 2003 ranking?
Has this changed between the different ranking years? (You need to first melt and then dcast)
rolling_stone |>
melt(id.vars = c("release_year", "album_id", "artist_gender"),
measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
variable.name = "rank_year",
value.name = "rank") |>
_[, .(N = sum(!is.na(rank))),
by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |>
dcast(decade + rank_year ~ artist_gender, value.var = "N") |>
_[, ratio := Male / (Male + Female)] |>
ggplot(aes(decade, ratio)) +
geom_line(aes(color = rank_year))
In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table.
paocorrales.github.io/intro-datatable