In spreadsheet, pivot tables or dynamic tables are powerful tools for summarizing data in different ways. We can create these tables using the group_by() and summarize() functions from the tidyverse package.
Let’s take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our data set.
We can use group_by() to group our data by species and summarize() to calculate the number of cases and the average body_mass_g for each grouping.
The function n() returns the number of cases and the function mean computes the average. The parameter na.rm = TRUE tells the function to remove non available values (NA) before performing the calculation.
Question: What if you don’t use group_by() before summarise()? Let’s try it and discuss what’s going on.
To make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.
We can do a lot more with other functions (or “verbs”) of the tidyverse. Let’s see a couple more
Creating new variables
There are a lot of times in which we want to operate beyond the original data. This is often done by adding a column. We do this with the mutate() function.
We can add a new variable (called prop) representing the proportion of penguins in each species with by using mutate().
In our report we also make plot by species. To do this we need to keep only the data of the specie we want to analyse. We use the filter() function to filter rows in a data frame, this function returns a new data frame with only the rows that satisfy a condition.
We can use logical operators (>, <, >=, <=, ==) to create conditions from numeric variables. These operators are also useful for dates and text strings. Look at the code to see a condition for text type of data.
Write code to filter all the penguins in the Torgersen island.
Now write the code to filter all the penguins with flipper lenght greater than 200.
Source Code
---title: "Data manipulation"format: html: toc: true toc-location: left code-tools: source: true highlight-style: tango---```{r setup, include=FALSE}knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE)library(tidyverse)library(gt)penguins <- read_csv("data/penguins.csv")```## Creating summary tablesIn spreadsheet, pivot tables or dynamic tables are powerful tools for summarizing data in different ways. We can create these tables using the `group_by()` and `summarize()` functions from the tidyverse package. Let's take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our data set.We can use `group_by()` to group our data by **species** and `summarize()` to calculate the number of cases and the average **body_mass_g** for each grouping.```{r}penguins |>group_by(species) |>summarise(count =n(),mean_body_mass =mean(body_mass_g, na.rm =TRUE))```The function `n()` returns the number of cases and the function `mean` computes the average. The parameter `na.rm = TRUE` tells the function to remove non available values (`NA`) before performing the calculation.> Question: What if you don’t use `group_by()` before `summarise()`? Let’s try it and discuss what’s going on.```{r}penguins |>summarise(count =n(),mean_body_mass =mean(body_mass_g, na.rm =TRUE))```So if we don't group_by first, we will get a single summary statistic (n and mean in this case) for the whole data set.Now, we want the average of the body mass in kilograms and with two decimals. We can change our `summarise()` call to get that calculation. ```{r}penguins |>group_by(species) |>summarise(count =n(),mean_body_mass =round(mean(body_mass_g/1000, na.rm =TRUE), 2))```First we divide body mass by 1000 to transform from grams to kilograms and then we use the function `round()` to get only two decimals.Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.So let's add a variable assignment to that first line:```{r}summary <- penguins |>group_by(species) |>summarise(count =n(),mean_body_mass =round(mean(body_mass_g/1000, na.rm =TRUE), 2))```## Formatig the tableTo make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.We can do that with the **gt** package.```{r}summary |>gt() |>cols_label(species ="Specie",count ="Count",mean_body_mass ="Mean Body Mass" ) |>cols_align(align ="center",columns =c(count, mean_body_mass) )```We can do a lot more with other functions (or "verbs") of the tidyverse. Let's see a couple more## Creating new variablesThere are a lot of times in which we want to operate beyond the original data. This is often done by adding a column. We do this with the `mutate()` function.We can add a new variable (called `prop`) representing the proportion of penguins in each **species** with by using `mutate()`. ```{r}penguins |>group_by(species) |>summarise(count =n(),mean_body_mass =round(mean(body_mass_g/1000, na.rm =TRUE), 2)) |>mutate(prop = count/nrow(penguins))```## Filtering dataIn our report we also make plot by species. To do this we need to keep only the data of the specie we want to analyse. We use the `filter()` function to filter rows in a data frame, this function returns a new data frame with only the rows that satisfy a condition.We can use logical operators (`>`, `<`, `>=`, `<=`, `==`) to create conditions from numeric variables. These operators are also useful for dates and text strings. Look at the code to see a condition for text type of data.```{r}penguins |>filter(species =="Gentoo")```> Activity: >> * Write code to filter all the penguins in the Torgersen island.>> * Now write the code to filter all the penguins with flipper lenght greater than 200.