---
title: "Introduction to survey data tabulation in R"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to survey data tabulation in R}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# R for beginners
If you are new to R, you could have a look at one of the many introductions.
For instance:
* [this](https://rstudio-education.github.io/hopr/) is nice, but also becomes
challenging quite quickly.
* [R for Data Science](https://r4ds.had.co.nz) is the most complete reference.
* [here](https://paulvanderlaken.com/2017/10/18/learn-r/) is another resource,
but already a bit outdated (2017 is a long time in the R world).
* [swirl](https://swirlstats.com/students.html) is an interactive approach.
* [in this reddit](https://www.reddit.com/r/rstats/comments/eh5e4q/what_is_the_best_way_to_learn_r_for_a_complete/) you might still find other resources.
In the beginning, R can be very frustrating and it takes time.
# Setup
Click here to show steps in setup
Load packages needed for this document:
```{r setup}
library(datadaptor)
library(tidyr)
library(stringr)
suppressMessages(library(dplyr))
```
```{r child="../man/fragments/gt_table_stylers.Rmd", include=FALSE}
```
# Survey data
## Toy survey
We'll first define a small toy survey:
### Q1. what's your age? {#Q1}
### Q2. What's your gender? {#Q2}
* male
* female
* non-binary
### Q3. Select all that applies {#Q3}
* I'm curious
* I'm cautious
* I'm lazy
* I don't care
### Q4. Please rate the following items {#Q4}
*Use a scale from 1 = "I really like it" to 5 = "I don't like it at all".*
* Sports
* Watching TV
* Reading
## Survey data in R
The raw data of surveys is often stored in SPSS .sav files.
SPSS files can be imported in R with the [haven](https://haven.tidyverse.org/)
package. This yields labelled
tabular data in data frames containing variables of type `haven::labelled()`.
Another nice introduction to labelled data in R can be found
[here](https://www.pipinghotdata.com/posts/2020-12-23-leveraging-labelled-data-in-r/).
### Labelled variables {#labvars}
Before looking at the whole table, let's first have a look how
one labelled variable (corresponding to one column in tabular survey data) looks in R.
Let's define a toy labelled variable `v`:
```{r}
v <- haven::labelled(
1:3,
label = "Variable label of v",
labels = c(
"value label of value 1" = 1,
"value label of value 2" = 2,
"value label of value 3" = 3
)
)
```
... and then print it:
```{r}
v
```
We can have a different look at the structure of this variable using the function `str()`:
```{r}
str(v)
```
`v` is a labelled integer containing the values
`r glue::glue_collapse(v, sep = ", ", last = " and ")`.
The label information is stored in extra attributes of the variable. First, there's
the **variable label** in the attribute "label", containing the string "`r attr(v, "label")`":
```{r}
attr(v, "label")
```
Second, a labelled variable can contain **value labels** in the attribute "labels":
```{r}
attr(v, "labels")
```
This is a named vector where label strings can be assigned to values, in our case
the `r length(attr(v, "labels"))` strings
`r paste0('"', names(attr(v, "labels")), '"') %>% glue::glue_collapse(sep = ", ", last = " and ")`
are assigned to the `r length(attr(v, "labels"))` values
`r attr(v, "labels") %>% glue::glue_collapse(sep = ", ", last = " and ")`.
In other words, a categorical variable with value
labels corresponds to a numeric vector with a kind of lookup table where a
value is assigned to each of the possible responses.
When a labelled variable inside a `tibble()` is printed, (parts of) the value labels
are printed in brackets ([...]) next to their values (cf. the print output of
`df` [below](#dfdef)):
```{r}
tibble(v)
```
## Making up a dataset of the toy survey {#dfdef}
```{r, include=FALSE}
n_resps <- 10
```
We define a small toy dataframe `df` showing how the responses of `r n_resps`
respondents to the above questionnaire could look like in R.
Click here to show code generating this dataframe
```{r}
set.seed(123)
tibble()
q1 <- haven::labelled(sample(18:88, 10, TRUE), label = c("what's your age?"), labels = NULL)
q2 <- haven::labelled(sample(c(NA, 1:3), 10, TRUE), label = c("What's your gender?"), labels = c(male = 1, female = 2, "non-binary" = 3))
q3_1 <- haven::labelled(sample(0:1, 10, TRUE), label = c("Select all that applies - I'm curious"), labels = c(unselected = 0, selected = 1))
q3_2 <- haven::labelled(sample(0:1, 10, TRUE), label = c("Select all that applies - I'm cautios"), labels = c(unselected = 0, selected = 1))
q3_3 <- haven::labelled(sample(0:1, 10, TRUE), label = c("Select all that applies - I'm lazy"), labels = c(unselected = 0, selected = 1))
q3_4 <- haven::labelled(sample(0:1, 10, TRUE), label = c("Select all that applies - I don't care"), labels = c(unselected = 0, selected = 1))
q4_vallabs <- c("1 = I like it" = 1, "2" = 2, "3" = 3, "4" = 4, "5 = I don`t like it at all" = 5)
q4_answers <-
q4_text <- 'Please rate the following items from 1 = "I really like it" to 5 = "I don`t like it at all'
q4_1 <- haven::labelled(sample(1:5, 10, TRUE), label = paste(q4_text, "Sports", sep = " - "), labels = q4_vallabs)
q4_2 <- haven::labelled(sample(1:5, 10, TRUE), label = paste(q4_text, "Watching TV", sep = " - "), labels = q4_vallabs)
q4_3 <- haven::labelled(sample(1:5, 10, TRUE), label = paste(q4_text, "Reading", sep = " - "), labels = q4_vallabs)
df <- tibble(id = 1:10, q1, q2, q3_1, q3_2, q3_3, q3_4, q4_1, q4_2, q4_3)
```
```{r}
df
```
Each row of the dataset `df` corresponds to one respondent
(specified by the first column `id`). Their answers to the 4 questions are
coded in the other columns. These columns are [labelled variables](#labvars) (of type `haven::labelled()`).
# Question types
Let's have a closer look at the different types of survey questions in this
questionnaire:
## Numeric variable
In addition to the values of age stored in [Q1](#Q1), the variable has the variable label
"`r attr(q1, "label")`":
```{r}
q1
```
## Categorical variable
Question [Q2](#Q2) in the questionnaire asks for a single answer of the pre-defined
choices in the value labels:
```{r}
q2
```
Like `q1`, the variable `q2` also contains numeric values (corresponding
to these value labels) and has a variable label.
If the respondent didn't reply to the question, the information is missing,
which is often stored as `NA` in R (as in the rows
`r which(is.na(q2)) %>% glue::glue_collapse(sep = ", ", last = " and ")`
of `q2`.
## Multiple choice questions
In contrast to [Q2](#Q2), the question [Q3](#Q3) allows for multiple answers. This information
usually isn't stored in a single variable. Instead you can define a variable for every
possible response. For instance, the first response option
"`r attr(q3_1, "label")`" is stored in the variable `q3_1`:
```{r}
q3_1
```
This is a binary numeric variable with values
* 0 (meaning "unselected"), and
* 1 (meaning "selected").
```{r}
attr(q3_1, "labels")
```
## Multiple answers on the same scale
A common pattern in questionnaires is to ask for a rating (or just a numeric
value) of multiple items on the same scale like in [Q4](#Q4).
This information can also be put in categorical (or numeric) variables. The
responses to the first item are stored in the first variable:
```{r}
q4_1
```
# Tabulation of frequency counts of survey data
When analyzing survey data it is of great interest to count how often the
responses occur. We'll have a look at some recurring tasks using the examples in
our toy survey.
There are also other R packages that can help to modify labelled data like
[labelled](https://larmarange.github.io/labelled/) or
[sjlabelled](https://strengejacke.github.io/sjlabelled/articles/labelleddata.html).
## Numeric variable
*This kind of tabulation is done with the type* **`cat`** *in the type column of the
"Questions" sheet in the Excel mapping file.*
Counting the values in numeric variables is straight-forward:
```{r}
df %>% count(q1)
```
In this case it might be more interesting to summarize the data to age categories,
like for instance
* 18 - 39
* 40 - 59
* 60 +
We can **rec**ode the variable with datadaptor, using
the `#REC` command on one of the `Free` sheets.
```{r echo=FALSE}
cdbs <- tribble(
~A, ~B, ~C, ~D, ~E,
"#REC", "q1", "kq1", "Age category", "",
"", "18", "39", "1", "18 - 39",
"", "40", "59", "2", "40 - 59",
"", "60", "Inf", "3", "60+",
)
a_str <- "- The #REC keyword in the first row sets up this command block.
- The following rows are empty."
b_str <- "- The string in the first row shows the name of the original variable that's recoded.
-The following rows specify the lower boundaries of the intervals summarising a value range of the original varable."
c_str <- "- The string in the first row shows the name of the created recoded variable.
- The following rows specify the upper boundaries of the intervals summarising a value range of the original varable."
d_str <- "- The string in the first row is assigned as the variable label of the recoded variable (if empty the variable label of the original variable is copied).
- The following rows specify the labelled values of the created variable."
e_str <- "- The first row is empty.
- The following rows specify the the value labels of the created variable."
excelish(cdbs) %>%
gt::cols_label(
A = gt::html(with_tooltip("A", a_str)),
B = gt::html(with_tooltip("B", b_str)),
C = gt::html(with_tooltip("C", c_str)),
D = gt::html(with_tooltip("D", d_str)),
E = gt::html(with_tooltip("E", e_str)),
) %>%
gt::tab_header(
"#REC command block",
"recode a numerical variable into a labelled variable summarising intervals",
) %>%
gt::tab_footnote(
footnote = "hover over the column names to show the required data in these cells",
locations = gt::cells_column_labels(
columns = everything()
)
)
```
Be careful, if you want to enter a string like "1 - 20" into excel because it
might turn it into a date 🙃.
Additionally, you might want to add statistics like the
mean or the median to the table:
```{r}
mean(q1)
median(q1)
```
If you want to put this table in a report, the question text should also be added
somewhere. In our case, you could get this text from the variable label:
```{r}
attr(q1, "label")
```
But usually it is better to copy the question title from the questionnaire, as
details of surveys tend to change during projects and the labels in the data
file might not be up-to-date.
## Categorical variable
*This kind of tabulation is also done with the type* **`cat`** *in the type column of the
"Questions" sheet in the Excel mapping file.*
The counting of the different replies in a categorical variable is the same as
for a numeric variable:
```{r}
df %>% count(q2)
```
If we want to assign a text to the cases where no answer was given, we can
directly modify the dataset `df`:
```{r}
df$q2[is.na(df$q2)] <- 4
```
Now, the cases where `q2` was NA before were set to 4:
```{r}
df %>% count(q2)
```
In order to label this new value 4, we need to modify the value labels of the variable:
```{r}
attr(df$q2, "labels") <- c(
attr(df$q2, "labels"),
"No answer" = 4
)
```
Now, we added the value label
"No answer" the new value 4:
```{r}
df %>% count(q2)
```
Alternatively, we can modify the variable with datadaptor, using
the `#IF` command on one of the `Free` sheets.
To **a**dd a **val**ue **l**abel with datadaptor, we can use
the `#AVALL` command on one of the `Free` sheets:
```{r echo=FALSE}
cdbs <- tribble(
~A, ~B, ~C,
"#IF", "is.na(q2)", "q2 = 4",
"#AVALL", "q2", "",
"", "4", "No answer",
)
excelish(cdbs)
```
## Multiple choice questions
*This kind of tabulation is done with the type* **`mdg`** *in the type column of the
"Questions" sheet in the Excel mapping file.*
Tabulating the answers of multiple choice questions as categorical variables
doesn't lead to a nice output:
```{r}
df %>% count(q3_1)
df %>% count(q3_2)
# ...
```
Instead, we'll first transform `df_q3` the data relevant for [Q3](#Q3):
```{r}
df_q3 <- df %>% select(q3_1:q3_4)
df_q3
```
... into a long format:
```{r}
df_q3_long <- df_q3 %>%
pivot_longer(everything(), names_to = "var")
df_q3_long
```
... and we'll only consider those cases where those variables were = 1 (selected):
```{r}
df_q3_long <- df_q3_long %>%
filter(value == 1)
df_q3_long
```
Now we can do the counting as above:
```{r}
df_q3_counts <- df_q3_long %>% count(var)
df_q3_counts
```
However, we want to show the variable labels in this frequency table. We can
obtain the variable labels from the dataset:
```{r}
varlabs <- df_q3 %>% gen_var_table()
varlabs
```
We just want to get rid of the questionnaire text and remove everything (`".*"`)
before the hyphen surrounded by spaces (`" - "`) in these strings:
```{r}
varlabs <- varlabs %>% mutate(varlab = str_remove(varlab, ".* - "))
varlabs
```
Now, we can add this information to our frequency table:
```{r}
full_join(
varlabs,
df_q3_counts
)
```
Alternatively, we can directly modify the variable labels with datadaptor, using
the `#VARL` command on one of the `Free` sheets:
```{r echo=FALSE}
cdbs <- tribble(
~A, ~B, ~C,
"#VARL", "q3_1", "I'm curious",
"#VARL", "q3_2", "I'm cautious",
"#VARL", "q3_3", "I'm lazy",
"#VARL", "q3_4", "I don't care",
)
excelish(cdbs)
```
## Multiple answers on the same scale
*This kind of tabulation is done with the type* **`mw`** *in the type column of the
"Questions" sheet in the Excel mapping file.*
The variables in [Q4](#Q4) are also categorical variables and it's of interest to plot
them each on their own, e.g., the first one:
```{r}
df %>% count(q4_1)
```
As all the answers to [Q4](#Q4) share the same scale, it's also useful to tabulate
statistics (such as the mean) in a single table:
```{r}
df_q4 <- df %>%
select(q4_1:q4_3)
df_q4_means <- df_q4 %>%
pivot_longer(everything(), names_to = "var") %>%
group_by(var) %>%
summarise(mean = mean(value))
df_q4_means
```
We'll also add the modified variable labels to the our summary of means table, as we already did for
the multiple choice question [Q3](#Q3)
```{r}
varlabs <- df_q4 %>% gen_var_table()
varlabs <- varlabs %>% mutate(varlab = str_remove(varlab, ".* - "))
varlabs
```
```{r}
full_join(
varlabs,
df_q4_means
)
```