This document provides a comprehensive guide to the dta
package, showcasing the sample datasets and functions available, along
with their usage examples. The package is designed for efficient data
management, transformation, and exploratory data analysis.
Sample datasets and dictionaries
The package includes several sample datasets and dictionaries. Below are descriptions and examples of how to load a few of them.
Datasets
data_bmi
: Sample data for body mass index (BMI)
calculations.
data("data_bmi")
dta_gtable(data_bmi)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
id | age | height | weight |
---|---|---|---|
STM/4921 | 50 | 1.64 | 59 |
STM/4396 | 34 | 1.98 | 57 |
STM/7908 | 50 | 1.95 | 84 |
STM/7243 | 39 | 1.52 | 63 |
STM/4801 | 52 | 1.69 | 65 |
STM/5134 | 50 | 1.71 | 73 |
STM/7138 | 35 | 1.73 | 46 |
STM/6802 | 72 | 1.98 | 70 |
STM/4420 | 42 | 1.62 | 103 |
STM/6351 | 40 | 1.89 | 96 |
STM/4933 | 38 | 1.91 | 67 |
STM/4303 | 37 | 1.56 | 75 |
STM/7465 | 45 | 1.62 | 44 |
STM/4587 | 67 | 1.38 | 51 |
STM/5320 | 44 | 1.37 | 63 |
data_cancer
: Sample data: Survival of cancer patients in
days.
data("data_cancer")
dta_gtable(data_cancer)
cancer_type | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
stomach | 124 | 42 | 25 | 45 | 412 | 51 | 1112 | 46 | 103 | 876 | 146 | 340 | 396 | ||||
bronchus | 81 | 461 | 20 | 450 | 246 | 166 | 63 | 64 | 155 | 859 | 151 | 166 | 37 | 223 | 138 | 72 | 245 |
colon | 248 | 372 | 189 | 1843 | 180 | 537 | 519 | 455 | 406 | 365 | 942 | 776 | 372 | 163 | 101 | 20 | 283 |
ovary | 1234 | 89 | 201 | 356 | 2970 | 456 |
data_misspelled
: Sample dataset with general
information.
data("data_misspelled")
dta_gtable(data_misspelled)
id | region | age | height | weight | blood_group | marital_status | education | ses | r | python | sas | stata | spss | excel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STM/7539 | southern | 46 | 1.85 | 77 | ba | Other | BSC | medium | n | n | yes | tru | n | n |
STM/7993 | South | 45 | 1.64 | 53 | AB | maried | Bsc | Midle | y | N | n | no | tru | N |
STM/7387 | southern | 37 | 1.61 | 75 | B | single | BSC | Middle | N | tru | no | Yes | No | n |
STM/5598 | West | 45 | 1.80 | 69 | B | Single | MSc | low | y | N | y | fasle | no | Yes |
STM/5901 | South | 51 | 1.81 | 53 | A+ | Others | Bachelors | Medium | no | yes | Yes | n | tru | Yes |
STM/7529 | North-East | 56 | 1.35 | 56 | O | Single | Bsc | Mediam | yes | fasle | no | yes | tru | no |
STM/7238 | North East | 37 | 1.44 | 89 | o | Single | Bachelers | High | No | n | Yes | No | yes | no |
STM/5417 | North-East | 50 | 1.67 | 73 | o | single | MSc | High | No | y | Yes | tru | N | yes |
STM/5907 | Central | 38 | 1.48 | 86 | o | Other | Bachelers | Midle | N | tru | y | yes | fasle | n |
STM/7877 | Central | 48 | 1.71 | 76 | O | Others | Bachelors | low | N | N | tru | n | fasle | no |
STM/4762 | South | 41 | 1.96 | 81 | ab | Singel | Bachelors | Middle | fasle | tru | no | no | no | no |
STM/7345 | North-East | 24 | 1.98 | 77 | a | Others | MSc | medium | no | Yes | n | No | N | Yes |
STM/6968 | wset | 34 | 1.98 | 75 | A+ | Other | Bachelers | low | yes | N | fasle | y | Yes | tru |
STM/5816 | West | 43 | 1.76 | 75 | B | maried | PhD | Medium | tru | y | n | Yes | tru | fasle |
STM/7707 | southern | 63 | 1.92 | 67 | o | maried | Bsc | Medium | No | yes | tru | Yes | No | tru |
STM/7449 | North East | 59 | 1.69 | 74 | AB | Other | MSc | Midle | N | fasle | no | no | N | No |
STM/6853 | southern | 35 | 1.43 | 72 | A | Married | Doctoral | Midle | tru | fasle | no | y | tru | fasle |
STM/5634 | North East | 22 | 1.63 | 72 | o | maried | Doctorate | Medium | No | y | N | y | No | tru |
STM/4584 | North East | 62 | 1.92 | 87 | AB | Other | Masters | Midle | Yes | fasle | yes | tru | yes | yes |
STM/4946 | southern | 41 | 1.40 | 93 | O | Singel | PhD | medium | fasle | N | tru | yes | n | n |
STM/6798 | North East | 60 | 1.95 | 109 | A | Married | Bachelors | low | No | N | N | Yes | y | Yes |
STM/4377 | North East | 44 | 1.50 | 44 | A | Others | BSC | low | yes | No | n | fasle | yes | n |
STM/5435 | Central | 30 | 1.96 | 58 | ba | single | MSc | Hihg | fasle | N | fasle | No | Yes | fasle |
STM/5562 | southern | 71 | 1.95 | 74 | AB | Others | Bsc | Midle | Yes | No | tru | Yes | No | yes |
STM/7617 | North East | 53 | 1.91 | 53 | a | Others | Doctorate | Low | No | N | fasle | y | N | N |
STM/6677 | South | 60 | 1.96 | 72 | B | Single | Bachelors | medium | n | fasle | tru | N | n | N |
STM/4773 | wset | 41 | 1.39 | 74 | A | Married | Bsc | High | fasle | No | n | yes | N | N |
STM/5416 | Central | 64 | 1.98 | 59 | A | maried | Bachelors | Middle | Yes | Yes | no | fasle | fasle | no |
STM/4805 | South | 27 | 1.91 | 53 | o | Singel | Bachelors | low | fasle | No | yes | yes | n | fasle |
STM/5493 | wset | 63 | 1.38 | 66 | ba | Married | MSc | medium | Yes | Yes | no | Yes | tru | No |
STM/7650 | South | 54 | 1.88 | 56 | ba | Others | MSc | Middle | y | fasle | N | No | yes | n |
STM/5458 | wset | 61 | 1.95 | 85 | a | single | BSC | Hihg | No | Yes | N | fasle | Yes | No |
STM/6102 | South | 35 | 1.71 | 62 | o | Others | Bsc | Low | y | No | Yes | Yes | N | fasle |
STM/7508 | wset | 29 | 1.77 | 82 | ba | Married | Doctorate | Mediam | no | no | N | No | yes | fasle |
STM/4121 | southern | 37 | 1.78 | 70 | B | Others | PhD | medium | No | Yes | y | N | N | fasle |
STM/4830 | North-East | 41 | 1.69 | 65 | a | Single | Doctorate | High | No | y | fasle | tru | Yes | no |
STM/6271 | Central | 58 | 1.56 | 77 | AB | maried | MSc | Mediam | n | fasle | N | tru | n | N |
STM/7818 | North East | 61 | 1.65 | 82 | a | maried | MSc | low | n | yes | fasle | tru | fasle | no |
STM/4671 | Central | 36 | 1.56 | 58 | A+ | Singel | MSc | High | n | n | yes | fasle | N | fasle |
STM/5864 | southern | 40 | 1.95 | 76 | o | Singel | Bsc | medium | tru | fasle | n | y | no | n |
STM/4951 | North East | 52 | 1.80 | 77 | B | Others | Bachelors | medium | tru | no | n | fasle | tru | yes |
STM/6287 | wset | 25 | 1.47 | 81 | O | maried | Doctorate | Middle | yes | no | tru | tru | fasle | fasle |
STM/4340 | North East | 50 | 1.95 | 84 | ab | Other | Bachelors | Middle | Yes | n | fasle | y | yes | No |
STM/5202 | North-East | 55 | 1.49 | 52 | o | Others | Doctorate | Middle | y | Yes | n | n | n | no |
STM/7939 | wset | 49 | 1.47 | 87 | B | maried | Masters | Hihg | fasle | tru | y | tru | y | y |
STM/4201 | North-East | 35 | 1.66 | 89 | ab | single | Bachelors | Midle | fasle | no | no | N | no | fasle |
STM/6411 | Central | 32 | 1.90 | 72 | B | single | PhD | High | tru | Yes | N | No | tru | tru |
STM/4685 | South | 68 | 1.52 | 85 | a | Single | PhD | Low | N | no | tru | No | Yes | N |
STM/5343 | North East | 48 | 1.36 | 54 | ba | Other | Bachelers | High | Yes | N | n | y | yes | Yes |
STM/5129 | wset | 55 | 1.72 | 86 | ba | Others | Doctorate | low | yes | Yes | yes | Yes | fasle | No |
Dictionaries
Dictionaries are data values that are applied to update, modify, or standardize the values in a specified dataset, enabling consistent and efficient data transformation.
dict_recode
: A dictionary with variable value mappings
with labels and ordered status.
data("dict_recode")
dta_gtable(dict_recode)
names | values | labels | is_ordered |
---|---|---|---|
region | 1 | Central | 0 |
2 | North East | ||
3 | South | ||
4 | West | ||
age_group | 1 | 20-29 | 1 |
2 | 30-39 | ||
3 | 40-49 | ||
4 | 50-59 | ||
5 | 60-69 | ||
6 | 70+ | ||
blood_group | 1 | A | 0 |
2 | B | ||
3 | AB | ||
4 | O | ||
marital_status | 1 | Single | 0 |
2 | Married | ||
3 | Other | ||
education | 1 | Bachelors | 1 |
2 | Masters | ||
3 | Doctorate | ||
employed | 0 | No | 0 |
1 | Yes | ||
ses | 1 | Low | 1 |
2 | Middle | ||
3 | High | ||
language | 1 | English | 0 |
2 | French | ||
3 | Spanish | ||
4 | Arabic | ||
5 | Mandarin | ||
6 | Other | ||
phone | 0 | None | 0 |
1 | Samsung | ||
2 | Apple | ||
3 | Xiaomi | ||
4 | OnePlus | ||
5 | |||
6 | Other | ||
transport | 1 | Walking | 0 |
2 | Bicycle | ||
3 | Car | ||
4 | Bus | ||
5 | Train | ||
r | 0 | No | 0 |
1 | Yes | ||
python | 0 | No | 0 |
1 | Yes | ||
sas | 0 | No | 0 |
1 | Yes | ||
stata | 0 | No | 0 |
1 | Yes | ||
spss | 0 | No | 0 |
1 | Yes | ||
excel | 0 | No | 0 |
1 | Yes |
dict_misspelled
: A dictionary with variable value
mappings with labels and ordered status.
data("dict_misspelled")
dta_gtable(dict_misspelled)
variable | old | new |
---|---|---|
region | southern | South |
wset | West | |
North-East | North East | |
blood_group | ba | AB |
ab | ||
o | O | |
A+ | A | |
a | ||
marital_status | maried | Married |
single | Single | |
Singel | ||
Others | Other | |
education | BSC | Bachelors |
Bsc | ||
Bachelers | ||
MSc | Masters | |
PhD | Doctorate | |
Doctoral | ||
ses | medium | Middle |
Midle | ||
Mediam | ||
low | Low | |
Hihg | High | |
.global | fasle | No |
N | ||
n | ||
no | ||
yes | Yes | |
y | ||
tru |
Functions
Data management and transformation functions
Retrieve column names from a data frame.
data(mtcars)
dta_columns(mtcars, .columns = starts_with("c"))
#> [1] "cyl" "carb"
dta_columns(mtcars, .columns = cyl:wt)
#> [1] "cyl" "disp" "hp" "drat" "wt"
dta_columns(mtcars, .columns = c(mpg, hp, vs, gear))
#> [1] "mpg" "hp" "vs" "gear"
Find duplicate rows based on specific columns.
df <- data.frame(
id = c(14, 20, 12, 32, 14, 23, 15, 12, 30, 14),
name = c(
"Mary", "Mark", "Faith", "David", "Mary", "Daniel", "Christine",
"Johnson", "Elizabeth", "Mary"
),
age = c(21, 18, 25, 17, 21, 24, 21, 19, 20, 21)
)
result <- dta_duplicates(df)
dta_gtable(result)
id | name | age |
---|---|---|
14 | Mary | 21 |
14 | Mary | 21 |
14 | Mary | 21 |
result2 <- dta_duplicates(df, .columns = id)
dta_gtable(result2)
id | name | age |
---|---|---|
14 | Mary | 21 |
12 | Faith | 25 |
14 | Mary | 21 |
12 | Johnson | 19 |
14 | Mary | 21 |
Assign variable labels to data frame or tibble columns.
dat <- data.frame(
age = c(25, 30, 35, 40),
gender = c("Male", "Female", "Female", "Male"),
income = c(50000, 60000, 55000, 65000)
)
names <- c("age", "income")
labels <- c("Age in years", "Annual income")
result <- dta_label(
dat, dict = NULL, .names = names, .labels = labels
)
dta_gtable(result)
Age in years | gender | Annual income |
---|---|---|
25 | Male | 50000 |
30 | Female | 60000 |
35 | Female | 55000 |
40 | Male | 65000 |
Recode variables in a data frame based on a dictionary.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data("data_sample")
glimpse(data_sample) # look at the data type column
#> Rows: 2,500
#> Columns: 21
#> $ id <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region <chr> "Central", "Central", "South", "West", "North East", "N…
#> $ age <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group <chr> "50-59", "40-49", "40-49", "30-39", "40-49", "50-59", "…
#> $ height <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group <chr> "AB", "B", "AB", "AB", "A", "A", "AB", "B", "A", "AB", …
#> $ marital_status <chr> "Married", "Married", "Married", "Single", "Single", "M…
#> $ education <chr> "Bachelors", "Bachelors", "Bachelors", "Bachelors", "Ba…
#> $ employed <chr> "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ ses <chr> "Middle", "Middle", "High", "Middle", "Low", "Middle", …
#> $ language <chr> "Mandarin", "French", "Arabic", "English", "Arabic", "M…
#> $ phone <chr> "OnePlus", "OnePlus", "Samsung", "OnePlus", "OnePlus", …
#> $ transport <chr> "Bicycle", "Train", "Car", "Bus", "Bus", "Bus", "Bus", …
#> $ gadgets_owned <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "Yes", "Yes…
#> $ python <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "N…
#> $ sas <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", …
#> $ stata <chr> "No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ spss <chr> "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No"…
#> $ excel <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "No", …
data("dict_recode")
dta_gtable(dict_recode)
names | values | labels | is_ordered |
---|---|---|---|
region | 1 | Central | 0 |
2 | North East | ||
3 | South | ||
4 | West | ||
age_group | 1 | 20-29 | 1 |
2 | 30-39 | ||
3 | 40-49 | ||
4 | 50-59 | ||
5 | 60-69 | ||
6 | 70+ | ||
blood_group | 1 | A | 0 |
2 | B | ||
3 | AB | ||
4 | O | ||
marital_status | 1 | Single | 0 |
2 | Married | ||
3 | Other | ||
education | 1 | Bachelors | 1 |
2 | Masters | ||
3 | Doctorate | ||
employed | 0 | No | 0 |
1 | Yes | ||
ses | 1 | Low | 1 |
2 | Middle | ||
3 | High | ||
language | 1 | English | 0 |
2 | French | ||
3 | Spanish | ||
4 | Arabic | ||
5 | Mandarin | ||
6 | Other | ||
phone | 0 | None | 0 |
1 | Samsung | ||
2 | Apple | ||
3 | Xiaomi | ||
4 | OnePlus | ||
5 | |||
6 | Other | ||
transport | 1 | Walking | 0 |
2 | Bicycle | ||
3 | Car | ||
4 | Bus | ||
5 | Train | ||
r | 0 | No | 0 |
1 | Yes | ||
python | 0 | No | 0 |
1 | Yes | ||
sas | 0 | No | 0 |
1 | Yes | ||
stata | 0 | No | 0 |
1 | Yes | ||
spss | 0 | No | 0 |
1 | Yes | ||
excel | 0 | No | 0 |
1 | Yes |
result <- dta_recode(
dat = data_sample,
dict = dict_recode,
is_force_sequential = TRUE
)
glimpse(result)
#> Rows: 2,500
#> Columns: 21
#> $ id <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region <fct> Central, Central, South, West, North East, North East, …
#> $ age <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group <ord> 50-59, 40-49, 40-49, 30-39, 40-49, 50-59, 50-59, 30-39,…
#> $ height <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group <fct> AB, B, AB, AB, A, A, AB, B, A, AB, AB, A, B, AB, A, B, …
#> $ marital_status <fct> Married, Married, Married, Single, Single, Married, Sin…
#> $ education <ord> Bachelors, Bachelors, Bachelors, Bachelors, Bachelors, …
#> $ employed <fct> Yes, No, No, Yes, Yes, No, Yes, No, Yes, Yes, Yes, No, …
#> $ ses <ord> Middle, Middle, High, Middle, Low, Middle, Low, Low, Mi…
#> $ language <fct> Mandarin, French, Arabic, English, Arabic, Mandarin, En…
#> $ phone <fct> OnePlus, OnePlus, Samsung, OnePlus, OnePlus, Samsung, O…
#> $ transport <fct> Bicycle, Train, Car, Bus, Bus, Bus, Bus, Train, Bicycle…
#> $ gadgets_owned <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r <fct> No, No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, No, N…
#> $ python <fct> No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes, …
#> $ sas <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
#> $ stata <fct> No, No, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, …
#> $ spss <fct> No, No, Yes, No, Yes, No, No, Yes, No, Yes, No, Yes, No…
#> $ excel <fct> Yes, No, No, No, No, No, No, No, No, No, No, No, Yes, N…
Correct misspelled data using a dictionary.
data("data_misspelled")
dta_gtable(head(data_misspelled))
id | region | age | height | weight | blood_group | marital_status | education | ses | r | python | sas | stata | spss | excel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STM/7539 | southern | 46 | 1.85 | 77 | ba | Other | BSC | medium | n | n | yes | tru | n | n |
STM/7993 | South | 45 | 1.64 | 53 | AB | maried | Bsc | Midle | y | N | n | no | tru | N |
STM/7387 | southern | 37 | 1.61 | 75 | B | single | BSC | Middle | N | tru | no | Yes | No | n |
STM/5598 | West | 45 | 1.80 | 69 | B | Single | MSc | low | y | N | y | fasle | no | Yes |
STM/5901 | South | 51 | 1.81 | 53 | A+ | Others | Bachelors | Medium | no | yes | Yes | n | tru | Yes |
STM/7529 | North-East | 56 | 1.35 | 56 | O | Single | Bsc | Mediam | yes | fasle | no | yes | tru | no |
data("dict_misspelled")
dta_gtable(dict_misspelled)
variable | old | new |
---|---|---|
region | southern | South |
wset | West | |
North-East | North East | |
blood_group | ba | AB |
ab | ||
o | O | |
A+ | A | |
a | ||
marital_status | maried | Married |
single | Single | |
Singel | ||
Others | Other | |
education | BSC | Bachelors |
Bsc | ||
Bachelers | ||
MSc | Masters | |
PhD | Doctorate | |
Doctoral | ||
ses | medium | Middle |
Midle | ||
Mediam | ||
low | Low | |
Hihg | High | |
.global | fasle | No |
N | ||
n | ||
no | ||
yes | Yes | |
y | ||
tru |
# Correct the misspelled entries in `dat` using the
# `dict` dictionary
result <- dta_replace(
dat = data_misspelled,
dict = dict_misspelled,
.name = variable,
.wrong = old,
.correct = new
)
dta_gtable(head(result))
id | region | age | height | weight | blood_group | marital_status | education | ses | r | python | sas | stata | spss | excel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STM/7539 | South | 46 | 1.85 | 77 | AB | Other | Bachelors | Middle | No | No | Yes | Yes | No | No |
STM/7993 | South | 45 | 1.64 | 53 | AB | Married | Bachelors | Middle | Yes | No | No | No | Yes | No |
STM/7387 | South | 37 | 1.61 | 75 | B | Single | Bachelors | Middle | No | Yes | No | Yes | No | No |
STM/5598 | West | 45 | 1.80 | 69 | B | Single | Masters | Low | Yes | No | Yes | No | No | Yes |
STM/5901 | South | 51 | 1.81 | 53 | A | Other | Bachelors | Medium | No | Yes | Yes | No | Yes | Yes |
STM/7529 | North East | 56 | 1.35 | 56 | O | Single | Bachelors | Middle | Yes | No | No | Yes | Yes | No |
Transpose a data frame with specified column as variable names
data("data_cancer")
dta_gtable(data_cancer)
cancer_type | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
stomach | 124 | 42 | 25 | 45 | 412 | 51 | 1112 | 46 | 103 | 876 | 146 | 340 | 396 | ||||
bronchus | 81 | 461 | 20 | 450 | 246 | 166 | 63 | 64 | 155 | 859 | 151 | 166 | 37 | 223 | 138 | 72 | 245 |
colon | 248 | 372 | 189 | 1843 | 180 | 537 | 519 | 455 | 406 | 365 | 942 | 776 | 372 | 163 | 101 | 20 | 283 |
ovary | 1234 | 89 | 201 | 356 | 2970 | 456 |
df <- dta_transpose(
dat = data_cancer, .column_to_use_as_variables = cancer_type
)
dta_gtable(df)
stomach | bronchus | colon | ovary |
---|---|---|---|
124 | 81 | 248 | 1234 |
42 | 461 | 372 | 89 |
25 | 20 | 189 | 201 |
45 | 450 | 1843 | 356 |
412 | 246 | 180 | 2970 |
51 | 166 | 537 | 456 |
1112 | 63 | 519 | |
46 | 64 | 455 | |
103 | 155 | 406 | |
876 | 859 | 365 | |
146 | 151 | 942 | |
340 | 166 | 776 | |
396 | 37 | 372 | |
223 | 163 | ||
138 | 101 | ||
72 | 20 | ||
245 | 283 |
Compute and generate functions
Calculate body mass index (BMI).
data("data_bmi")
dta_gtable(data_bmi)
id | age | height | weight |
---|---|---|---|
STM/4921 | 50 | 1.64 | 59 |
STM/4396 | 34 | 1.98 | 57 |
STM/7908 | 50 | 1.95 | 84 |
STM/7243 | 39 | 1.52 | 63 |
STM/4801 | 52 | 1.69 | 65 |
STM/5134 | 50 | 1.71 | 73 |
STM/7138 | 35 | 1.73 | 46 |
STM/6802 | 72 | 1.98 | 70 |
STM/4420 | 42 | 1.62 | 103 |
STM/6351 | 40 | 1.89 | 96 |
STM/4933 | 38 | 1.91 | 67 |
STM/4303 | 37 | 1.56 | 75 |
STM/7465 | 45 | 1.62 | 44 |
STM/4587 | 67 | 1.38 | 51 |
STM/5320 | 44 | 1.37 | 63 |
df <- dta_bmi(
dat = data_bmi,
.weight = weight,
.height = height,
name = body_mass_index,
digits = 2
)
dta_gtable(df)
id | age | height | weight | body_mass_index |
---|---|---|---|---|
STM/4921 | 50 | 1.64 | 59 | 21.94 |
STM/4396 | 34 | 1.98 | 57 | 14.54 |
STM/7908 | 50 | 1.95 | 84 | 22.09 |
STM/7243 | 39 | 1.52 | 63 | 27.27 |
STM/4801 | 52 | 1.69 | 65 | 22.76 |
STM/5134 | 50 | 1.71 | 73 | 24.96 |
STM/7138 | 35 | 1.73 | 46 | 15.37 |
STM/6802 | 72 | 1.98 | 70 | 17.86 |
STM/4420 | 42 | 1.62 | 103 | 39.25 |
STM/6351 | 40 | 1.89 | 96 | 26.87 |
STM/4933 | 38 | 1.91 | 67 | 18.37 |
STM/4303 | 37 | 1.56 | 75 | 30.82 |
STM/7465 | 45 | 1.62 | 44 | 16.77 |
STM/4587 | 67 | 1.38 | 51 | 26.78 |
STM/5320 | 44 | 1.37 | 63 | 33.57 |
Categorize BMI into weight categories.
data("data_bmicat")
dta_gtable(data_bmicat)
id | bmi |
---|---|
STM/4921 | 21.93635 |
STM/4396 | 14.53933 |
STM/7908 | 22.09073 |
STM/7243 | 27.26801 |
STM/4801 | 22.75831 |
STM/5134 | 24.96495 |
STM/7138 | 15.36971 |
STM/6802 | 17.85532 |
STM/4420 | 39.24707 |
STM/6351 | 26.87495 |
STM/4933 | 18.36572 |
STM/4303 | 30.81854 |
STM/7465 | 16.76574 |
STM/4587 | 26.78009 |
STM/5320 | 33.56599 |
# Categorize `bmi` into the standard BMI categories
df <- dta_bmicat(
dat = data_bmicat,
.bmi = bmi,
name = bmi_cat,
is_extended = FALSE,
as_factor = TRUE
)
dta_gtable(df)
id | bmi | bmi_cat |
---|---|---|
STM/4921 | 21.93635 | Healthy weight |
STM/4396 | 14.53933 | Underweight |
STM/7908 | 22.09073 | Healthy weight |
STM/7243 | 27.26801 | Overweight |
STM/4801 | 22.75831 | Healthy weight |
STM/5134 | 24.96495 | Healthy weight |
STM/7138 | 15.36971 | Underweight |
STM/6802 | 17.85532 | Underweight |
STM/4420 | 39.24707 | Obesity |
STM/6351 | 26.87495 | Overweight |
STM/4933 | 18.36572 | Underweight |
STM/4303 | 30.81854 | Obesity |
STM/7465 | 16.76574 | Underweight |
STM/4587 | 26.78009 | Overweight |
STM/5320 | 33.56599 | Obesity |
Split multiple response question column into binary columns.
data("data_gadgets")
dat <- data_gadgets
dta_gtable(dat)
gadgets_owned |
---|
Smartwatch, Tablet, Smartphone |
Tablet, Smartwatch, Smart TV, Desktop Computer |
Smartphone |
Laptop, Tablet |
Tablet, Smart TV, Digital Camera, Laptop |
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone |
Digital Camera, Smartphone, Desktop Computer, Smartwatch |
Smartwatch, Smart TV, Laptop, Smartphone |
Desktop Computer |
Smartphone, Laptop, Smart TV, Smartwatch |
Tablet |
Digital Camera, Tablet, Desktop Computer |
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop |
Tablet, Desktop Computer, Smart TV |
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop |
# Split `gadgets_owned` column into separate columns.
# The created columns will be logical (i.e. TRUE / FALSE).
df <- dta_mrq(
dat = dat,
.column = gadgets_owned,
delimeter = ", ",
is_clean_names = TRUE
)
dta_gtable(df)
gadgets_owned | Smartwatch | Tablet | Smartphone | Smart TV | Desktop Computer | Laptop | Digital Camera |
---|---|---|---|---|---|---|---|
Smartwatch, Tablet, Smartphone | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE |
Tablet, Smartwatch, Smart TV, Desktop Computer | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE |
Smartphone | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE |
Laptop, Tablet | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE |
Tablet, Smart TV, Digital Camera, Laptop | FALSE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE |
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE |
Digital Camera, Smartphone, Desktop Computer, Smartwatch | TRUE | FALSE | TRUE | FALSE | TRUE | FALSE | TRUE |
Smartwatch, Smart TV, Laptop, Smartphone | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | FALSE |
Desktop Computer | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE |
Smartphone, Laptop, Smart TV, Smartwatch | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | FALSE |
Tablet | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
Digital Camera, Tablet, Desktop Computer | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | TRUE |
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop | TRUE | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE |
Tablet, Desktop Computer, Smart TV | FALSE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE |
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop | FALSE | FALSE | TRUE | TRUE | TRUE | TRUE | TRUE |
Exploratory data analysis functions
Get frequency distribution of a specific column.
data("data_sample")
tab <- dta_freq(dat = data_sample, .column = region)
dta_gtable(tab)
Region of residence | Frequency | Percent |
---|---|---|
Central | 356 | 14.24% |
North East | 764 | 30.56% |
South | 851 | 34.04% |
West | 529 | 21.16% |
Total | 2500 | 100.00% |
# Remove the percentage symbol
tab2 <- dta_freq(
dat = data_sample,
.column = region,
is_sorted = TRUE,
is_decreasing = TRUE,
add_percent_symbol = FALSE
)
dta_gtable(tab2)
Region of residence | Frequency | Percent |
---|---|---|
South | 851 | 34.04 |
North East | 764 | 30.56 |
West | 529 | 21.16 |
Central | 356 | 14.24 |
Total | 2500 | 100.00 |
Frequency table for multiple response questions.
data("data_sample")
# An example with multiple response variables labelled
# as Yes / No
result <- dta_freq_mrq(
dat = data_sample,
.columns = r:excel,
value = "Yes",
name = "Programming proficiency"
)
dta_gtable(result)
Programming proficiency | Frequency | Responses | Cases |
---|---|---|---|
R | 1245 | 16.79% | 49.80% |
Python | 1267 | 17.09% | 50.68% |
SAS | 1266 | 17.08% | 50.64% |
Stata | 1225 | 16.53% | 49.00% |
SPSS | 1222 | 16.48% | 48.88% |
Microsoft Excel | 1188 | 16.03% | 47.52% |
Total | 7413 | 100.00% | 296.52% |
# Remove the percentage symbol
result2 <- dta_freq_mrq(
dat = data_sample,
.columns = r:excel,
value = "Yes",
name = "Programming proficiency",
add_percent_symbol = FALSE
)
dta_gtable(result2)
Programming proficiency | Frequency | Responses | Cases |
---|---|---|---|
R | 1245 | 16.79 | 49.80 |
Python | 1267 | 17.09 | 50.68 |
SAS | 1266 | 17.08 | 50.64 |
Stata | 1225 | 16.53 | 49.00 |
SPSS | 1222 | 16.48 | 48.88 |
Microsoft Excel | 1188 | 16.03 | 47.52 |
Total | 7413 | 100.00 | 296.52 |
Generate cross tabulations with optional percentages and totals.
data("data_sample")
df <- data_sample
# Crosstabulation of frequencies (counts)
result <- dta_crosstab(
dat = df, .row = region, .column = age_group
)
dta_gtable(result)
Variable | 20-29 | 30-39 | 40-49 | 50-59 | 60-69 | 70+ | Total |
---|---|---|---|---|---|---|---|
Central | 33 | 68 | 100 | 80 | 45 | 30 | 356 |
North East | 65 | 133 | 199 | 174 | 127 | 66 | 764 |
South | 94 | 151 | 177 | 206 | 154 | 69 | 851 |
West | 50 | 93 | 117 | 128 | 96 | 45 | 529 |
Total | 242 | 445 | 593 | 588 | 422 | 210 | 2500 |
# Calculate column percentages
result2 <- dta_crosstab(
dat = df,
.row = region,
.column = age_group,
cells = "col",
add_totals = "col"
)
dta_gtable(result2)
region/age_group | 20-29 | 30-39 | 40-49 | 50-59 | 60-69 | 70+ | Total |
---|---|---|---|---|---|---|---|
Central | 33 (13.64%) | 68 (15.28%) | 100 (16.86%) | 80 (13.61%) | 45 (10.66%) | 30 (14.29%) | 356 (14.24%) |
North East | 65 (26.86%) | 133 (29.89%) | 199 (33.56%) | 174 (29.59%) | 127 (30.09%) | 66 (31.43%) | 764 (30.56%) |
South | 94 (38.84%) | 151 (33.93%) | 177 (29.85%) | 206 (35.03%) | 154 (36.49%) | 69 (32.86%) | 851 (34.04%) |
West | 50 (20.66%) | 93 (20.90%) | 117 (19.73%) | 128 (21.77%) | 96 (22.75%) | 45 (21.43%) | 529 (21.16%) |
# Calculate row percentages
result3 <- dta_crosstab(
dat = df,
.row = region,
.column = age_group,
cells = "row",
add_totals = "row"
)
dta_gtable(result3)
region/age_group | 20-29 | 30-39 | 40-49 | 50-59 | 60-69 | 70+ |
---|---|---|---|---|---|---|
Central | 33 (9.27%) | 68 (19.10%) | 100 (28.09%) | 80 (22.47%) | 45 (12.64%) | 30 (8.43%) |
North East | 65 (8.51%) | 133 (17.41%) | 199 (26.05%) | 174 (22.77%) | 127 (16.62%) | 66 (8.64%) |
South | 94 (11.05%) | 151 (17.74%) | 177 (20.80%) | 206 (24.21%) | 154 (18.10%) | 69 (8.11%) |
West | 50 (9.45%) | 93 (17.58%) | 117 (22.12%) | 128 (24.20%) | 96 (18.15%) | 45 (8.51%) |
Total | 242 (9.68%) | 445 (17.80%) | 593 (23.72%) | 588 (23.52%) | 422 (16.88%) | 210 (8.40%) |
Other utilities
Convert numeric strings in a data frame or tibble to numeric numbers.
# A data frame with numeric character (a), characters (b) and numeric numbers (c)
df <- data.frame(
a = c("1", "2", "3"),
b = c("A", "B", "C"),
c = c(4, 5, 6)
)
str(df)
#> 'data.frame': 3 obs. of 3 variables:
#> $ a: chr "1" "2" "3"
#> $ b: chr "A" "B" "C"
#> $ c: num 4 5 6
df <- dta_to_numeric(df)
str(df)
#> tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#> $ a: num [1:3] 1 2 3
#> $ b: chr [1:3] "A" "B" "C"
#> $ c: num [1:3] 4 5 6