Data manipulation

Creating summary tables

In spreadsheet, pivot tables or dynamic tables are powerful tools for summarizing data in different ways. We can create these tables using the group_by() and summarize() functions from the tidyverse package.

Let’s take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our data set.

We can use group_by() to group our data by species and summarize() to calculate the number of cases and the average body_mass_g for each grouping.

penguins |> 
  group_by(species) |> 
  summarise(count = n(),
            mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 3
  species   count mean_body_mass
  <chr>     <int>          <dbl>
1 Adelie      152          3701.
2 Chinstrap    68          3733.
3 Gentoo      124          5076.

The function n() returns the number of cases and the function mean computes the average. The parameter na.rm = TRUE tells the function to remove non available values (NA) before performing the calculation.

Question: What if you don’t use group_by() before summarise()? Let’s try it and discuss what’s going on.

penguins |> 
  summarise(count = n(),
            mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 1 × 2
  count mean_body_mass
  <int>          <dbl>
1   344          4202.

So if we don’t group_by first, we will get a single summary statistic (n and mean in this case) for the whole data set.

Now, we want the average of the body mass in kilograms and with two decimals. We can change our summarise() call to get that calculation.

penguins |> 
  group_by(species) |> 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))
# A tibble: 3 × 3
  species   count mean_body_mass
  <chr>     <int>          <dbl>
1 Adelie      152           3.7 
2 Chinstrap    68           3.73
3 Gentoo      124           5.08

First we divide body mass by 1000 to transform from grams to kilograms and then we use the function round() to get only two decimals.

Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.

So let’s add a variable assignment to that first line:

summary <- penguins |> 
  group_by(species) |> 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))

Formatig the table

To make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.

We can do that with the gt package.

summary |> 
  gt() |> 
  cols_label(
    species = "Specie",
    count = "Count",
    mean_body_mass = "Mean Body Mass"
  ) |> 
  cols_align(
    align = "center",
    columns = c(count, mean_body_mass)
  )
Specie Count Mean Body Mass
Adelie 152 3.70
Chinstrap 68 3.73
Gentoo 124 5.08

We can do a lot more with other functions (or “verbs”) of the tidyverse. Let’s see a couple more

Creating new variables

There are a lot of times in which we want to operate beyond the original data. This is often done by adding a column. We do this with the mutate() function.

We can add a new variable (called prop) representing the proportion of penguins in each species with by using mutate().

penguins |> 
  group_by(species) |> 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2)) |>
  mutate(prop = count/nrow(penguins))
# A tibble: 3 × 4
  species   count mean_body_mass  prop
  <chr>     <int>          <dbl> <dbl>
1 Adelie      152           3.7  0.442
2 Chinstrap    68           3.73 0.198
3 Gentoo      124           5.08 0.360

Filtering data

In our report we also make plot by species. To do this we need to keep only the data of the specie we want to analyse. We use the filter() function to filter rows in a data frame, this function returns a new data frame with only the rows that satisfy a condition.

We can use logical operators (>, <, >=, <=, ==) to create conditions from numeric variables. These operators are also useful for dates and text strings. Look at the code to see a condition for text type of data.

penguins |> 
  filter(species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Activity:

  • Write code to filter all the penguins in the Torgersen island.

  • Now write the code to filter all the penguins with flipper lenght greater than 200.