Find the materials here: paocorrales.github.io/intro-datatable
All materials in this course are under the license CC-BY-SA 4.0.
Pao, Elio
Share your name and what’s your favourite ice-cream flavour.
You will need:
At its core, data.table provides an enhanced version of data.frames that are faster, more memory efficient and can be manipulated using a more concise syntax.
It also provides a whole set of extra functions for reading from and writing to tabular files, reshaping data between long and wide formats, joining datasets and much more.
If you can, try the code we will show you on the screen.
If you’re stuck or need help, tell us in the chat.
Most R functions and methods uses copy-on-modify.
This code returns a new tibble that is a copy of my_data
with a new column but it doesn’t modify my_data
.
data.table uses modify-in-place, which means that objects are not copied when modified. This code doesn’t create a new copy of my_data
but it rather modifies my_data
directly.
The general data.table syntax looks like this:
DT[i, j, by]
Where DT
is a data.table object, the i
argument is used for filtering and joining operations, the j
argument can summarise and transform, and the by
argument defines the groups to which to apply these operations.
You can read the syntax as “In these rows, do this, grouped by that”.
It is very concise but easy to read (sometimes).
sort_name
or clean_name
)?ave_age_at_top_500
) that are in the top 30 in 2020 (rank_2020
).spotify_popularity
) higher in the 2003 ranking (column rank_2003
)? Compute the correlation between the two columns (hint: there are missing values, so you will need to use use = "complete.obs"
).genre
) appeared in the raking of 2020?artist_gender
) bands were included in the 2020 ranking (column rank_2020
)?artist_birth_year_sum
and artist_member_count
are relevant)?rank_2003
, rank_2012
and rank_2020
and album_id
).Take 15 minutes.
15:00
artist_member_count
greater than 1) more successful than solo artists?rolling_stone[, is_band := artist_member_count > 1] |>
_[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
by = is_band]
# Notice that due to reference semantics, this operation adds the
# is_band column to the data.table. You can avoid this by using
# an expression in the by argument.
rolling_stone |>
_[, .(mean_rank_2003 = mean(rank_2003, na.rm = TRUE),
mean_rank_2012 = mean(rank_2012, na.rm = TRUE),
mean_rank_2020 = mean(rank_2020, na.rm = TRUE)),
by = .(is_band = artist_member_count > 1)]
years_between
) for each genre?What is the mean rank on each ranking year for gender?
What is the proportion of Male to Female artists for each decade in the 2003 ranking?
Has this changed between the different ranking years? (You need to first melt and then dcast)
rolling_stone |>
melt(id.vars = c("release_year", "album_id", "artist_gender"),
measure.vars = c("rank_2003", "rank_2012", "rank_2020"),
variable.name = "rank_year",
value.name = "rank") |>
_[, .(N = sum(!is.na(rank))),
by = .(decade = floor(release_year / 10) * 10, artist_gender, rank_year)] |>
dcast(decade + rank_year ~ artist_gender, value.var = "N") |>
_[, ratio := Male / (Male + Female)] |>
ggplot(aes(decade, ratio)) +
geom_line(aes(color = rank_year))
In 2023-2025, National Science Foundation has provided funds to support expanding the ecosystem of users and contributors around data.table.