Skip to contents
# install.packages(dta) # uncomment if `dta` is not installed
library(dta)

This document provides a comprehensive guide to the dta package, showcasing the sample datasets and functions available, along with their usage examples. The package is designed for efficient data management, transformation, and exploratory data analysis.

Sample datasets and dictionaries

The package includes several sample datasets and dictionaries. Below are descriptions and examples of how to load a few of them.

Datasets

data_bmi: Sample data for body mass index (BMI) calculations.

data("data_bmi")
dta_gtable(data_bmi)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found
id age height weight
STM/4921 50 1.64 59
STM/4396 34 1.98 57
STM/7908 50 1.95 84
STM/7243 39 1.52 63
STM/4801 52 1.69 65
STM/5134 50 1.71 73
STM/7138 35 1.73 46
STM/6802 72 1.98 70
STM/4420 42 1.62 103
STM/6351 40 1.89 96
STM/4933 38 1.91 67
STM/4303 37 1.56 75
STM/7465 45 1.62 44
STM/4587 67 1.38 51
STM/5320 44 1.37 63

data_cancer: Sample data: Survival of cancer patients in days.

data("data_cancer")
dta_gtable(data_cancer)
cancer_type V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
stomach 124 42 25 45 412 51 1112 46 103 876 146 340 396



bronchus 81 461 20 450 246 166 63 64 155 859 151 166 37 223 138 72 245
colon 248 372 189 1843 180 537 519 455 406 365 942 776 372 163 101 20 283
ovary 1234 89 201 356 2970 456










data_misspelled: Sample dataset with general information.

data("data_misspelled")
dta_gtable(data_misspelled)
id region age height weight blood_group marital_status education ses r python sas stata spss excel
STM/7539 southern 46 1.85 77 ba Other BSC medium n n yes tru n n
STM/7993 South 45 1.64 53 AB maried Bsc Midle y N n no tru N
STM/7387 southern 37 1.61 75 B single BSC Middle N tru no Yes No n
STM/5598 West 45 1.80 69 B Single MSc low y N y fasle no Yes
STM/5901 South 51 1.81 53 A+ Others Bachelors Medium no yes Yes n tru Yes
STM/7529 North-East 56 1.35 56 O Single Bsc Mediam yes fasle no yes tru no
STM/7238 North East 37 1.44 89 o Single Bachelers High No n Yes No yes no
STM/5417 North-East 50 1.67 73 o single MSc High No y Yes tru N yes
STM/5907 Central 38 1.48 86 o Other Bachelers Midle N tru y yes fasle n
STM/7877 Central 48 1.71 76 O Others Bachelors low N N tru n fasle no
STM/4762 South 41 1.96 81 ab Singel Bachelors Middle fasle tru no no no no
STM/7345 North-East 24 1.98 77 a Others MSc medium no Yes n No N Yes
STM/6968 wset 34 1.98 75 A+ Other Bachelers low yes N fasle y Yes tru
STM/5816 West 43 1.76 75 B maried PhD Medium tru y n Yes tru fasle
STM/7707 southern 63 1.92 67 o maried Bsc Medium No yes tru Yes No tru
STM/7449 North East 59 1.69 74 AB Other MSc Midle N fasle no no N No
STM/6853 southern 35 1.43 72 A Married Doctoral Midle tru fasle no y tru fasle
STM/5634 North East 22 1.63 72 o maried Doctorate Medium No y N y No tru
STM/4584 North East 62 1.92 87 AB Other Masters Midle Yes fasle yes tru yes yes
STM/4946 southern 41 1.40 93 O Singel PhD medium fasle N tru yes n n
STM/6798 North East 60 1.95 109 A Married Bachelors low No N N Yes y Yes
STM/4377 North East 44 1.50 44 A Others BSC low yes No n fasle yes n
STM/5435 Central 30 1.96 58 ba single MSc Hihg fasle N fasle No Yes fasle
STM/5562 southern 71 1.95 74 AB Others Bsc Midle Yes No tru Yes No yes
STM/7617 North East 53 1.91 53 a Others Doctorate Low No N fasle y N N
STM/6677 South 60 1.96 72 B Single Bachelors medium n fasle tru N n N
STM/4773 wset 41 1.39 74 A Married Bsc High fasle No n yes N N
STM/5416 Central 64 1.98 59 A maried Bachelors Middle Yes Yes no fasle fasle no
STM/4805 South 27 1.91 53 o Singel Bachelors low fasle No yes yes n fasle
STM/5493 wset 63 1.38 66 ba Married MSc medium Yes Yes no Yes tru No
STM/7650 South 54 1.88 56 ba Others MSc Middle y fasle N No yes n
STM/5458 wset 61 1.95 85 a single BSC Hihg No Yes N fasle Yes No
STM/6102 South 35 1.71 62 o Others Bsc Low y No Yes Yes N fasle
STM/7508 wset 29 1.77 82 ba Married Doctorate Mediam no no N No yes fasle
STM/4121 southern 37 1.78 70 B Others PhD medium No Yes y N N fasle
STM/4830 North-East 41 1.69 65 a Single Doctorate High No y fasle tru Yes no
STM/6271 Central 58 1.56 77 AB maried MSc Mediam n fasle N tru n N
STM/7818 North East 61 1.65 82 a maried MSc low n yes fasle tru fasle no
STM/4671 Central 36 1.56 58 A+ Singel MSc High n n yes fasle N fasle
STM/5864 southern 40 1.95 76 o Singel Bsc medium tru fasle n y no n
STM/4951 North East 52 1.80 77 B Others Bachelors medium tru no n fasle tru yes
STM/6287 wset 25 1.47 81 O maried Doctorate Middle yes no tru tru fasle fasle
STM/4340 North East 50 1.95 84 ab Other Bachelors Middle Yes n fasle y yes No
STM/5202 North-East 55 1.49 52 o Others Doctorate Middle y Yes n n n no
STM/7939 wset 49 1.47 87 B maried Masters Hihg fasle tru y tru y y
STM/4201 North-East 35 1.66 89 ab single Bachelors Midle fasle no no N no fasle
STM/6411 Central 32 1.90 72 B single PhD High tru Yes N No tru tru
STM/4685 South 68 1.52 85 a Single PhD Low N no tru No Yes N
STM/5343 North East 48 1.36 54 ba Other Bachelers High Yes N n y yes Yes
STM/5129 wset 55 1.72 86 ba Others Doctorate low yes Yes yes Yes fasle No

Dictionaries

Dictionaries are data values that are applied to update, modify, or standardize the values in a specified dataset, enabling consistent and efficient data transformation.

dict_recode: A dictionary with variable value mappings with labels and ordered status.

data("dict_recode")
dta_gtable(dict_recode)
names values labels is_ordered
region 1 Central 0

2 North East

3 South

4 West
age_group 1 20-29 1

2 30-39

3 40-49

4 50-59

5 60-69

6 70+
blood_group 1 A 0

2 B

3 AB

4 O
marital_status 1 Single 0

2 Married

3 Other
education 1 Bachelors 1

2 Masters

3 Doctorate
employed 0 No 0

1 Yes
ses 1 Low 1

2 Middle

3 High
language 1 English 0

2 French

3 Spanish

4 Arabic

5 Mandarin

6 Other
phone 0 None 0

1 Samsung

2 Apple

3 Xiaomi

4 OnePlus

5 Google

6 Other
transport 1 Walking 0

2 Bicycle

3 Car

4 Bus

5 Train
r 0 No 0

1 Yes
python 0 No 0

1 Yes
sas 0 No 0

1 Yes
stata 0 No 0

1 Yes
spss 0 No 0

1 Yes
excel 0 No 0

1 Yes

dict_misspelled: A dictionary with variable value mappings with labels and ordered status.

data("dict_misspelled")
dta_gtable(dict_misspelled)
variable old new
region southern South

wset West

North-East North East
blood_group ba AB

ab

o O

A+ A

a
marital_status maried Married

single Single

Singel

Others Other
education BSC Bachelors

Bsc

Bachelers

MSc Masters

PhD Doctorate

Doctoral
ses medium Middle

Midle

Mediam

low Low

Hihg High
.global fasle No

N

n

no

yes Yes

y

tru

Functions

Data management and transformation functions

dta_columns()

Retrieve column names from a data frame.

data(mtcars)
dta_columns(mtcars, .columns = starts_with("c"))
#> [1] "cyl"  "carb"
dta_columns(mtcars, .columns = cyl:wt)
#> [1] "cyl"  "disp" "hp"   "drat" "wt"
dta_columns(mtcars, .columns = c(mpg, hp, vs, gear))
#> [1] "mpg"  "hp"   "vs"   "gear"

dta_duplicates()

Find duplicate rows based on specific columns.

df <- data.frame(
  id = c(14, 20, 12, 32, 14, 23, 15, 12, 30, 14),
  name = c(
   "Mary", "Mark", "Faith", "David", "Mary", "Daniel", "Christine",
   "Johnson", "Elizabeth", "Mary"
  ),
  age = c(21, 18, 25, 17, 21, 24, 21, 19, 20, 21)
)

result <- dta_duplicates(df)
dta_gtable(result)
id name age
14 Mary 21
14 Mary 21
14 Mary 21

result2 <- dta_duplicates(df, .columns = id)
dta_gtable(result2)
id name age
14 Mary 21
12 Faith 25
14 Mary 21
12 Johnson 19
14 Mary 21

dta_label()

Assign variable labels to data frame or tibble columns.

dat <- data.frame(
  age = c(25, 30, 35, 40),
  gender = c("Male", "Female", "Female", "Male"),
  income = c(50000, 60000, 55000, 65000)
)

names <- c("age", "income")
labels <- c("Age in years", "Annual income")

result <- dta_label(
  dat, dict = NULL, .names = names, .labels = labels
)

dta_gtable(result)
Age in years gender Annual income
25 Male 50000
30 Female 60000
35 Female 55000
40 Male 65000

dta_recode()

Recode variables in a data frame based on a dictionary.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data("data_sample")
glimpse(data_sample) # look at the data type column
#> Rows: 2,500
#> Columns: 21
#> $ id             <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region         <chr> "Central", "Central", "South", "West", "North East", "N…
#> $ age            <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group      <chr> "50-59", "40-49", "40-49", "30-39", "40-49", "50-59", "…
#> $ height         <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight         <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group    <chr> "AB", "B", "AB", "AB", "A", "A", "AB", "B", "A", "AB", …
#> $ marital_status <chr> "Married", "Married", "Married", "Single", "Single", "M…
#> $ education      <chr> "Bachelors", "Bachelors", "Bachelors", "Bachelors", "Ba…
#> $ employed       <chr> "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ ses            <chr> "Middle", "Middle", "High", "Middle", "Low", "Middle", …
#> $ language       <chr> "Mandarin", "French", "Arabic", "English", "Arabic", "M…
#> $ phone          <chr> "OnePlus", "OnePlus", "Samsung", "OnePlus", "OnePlus", …
#> $ transport      <chr> "Bicycle", "Train", "Car", "Bus", "Bus", "Bus", "Bus", …
#> $ gadgets_owned  <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r              <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "Yes", "Yes…
#> $ python         <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "N…
#> $ sas            <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", …
#> $ stata          <chr> "No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ spss           <chr> "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No"…
#> $ excel          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "No", …
data("dict_recode")
dta_gtable(dict_recode)
names values labels is_ordered
region 1 Central 0

2 North East

3 South

4 West
age_group 1 20-29 1

2 30-39

3 40-49

4 50-59

5 60-69

6 70+
blood_group 1 A 0

2 B

3 AB

4 O
marital_status 1 Single 0

2 Married

3 Other
education 1 Bachelors 1

2 Masters

3 Doctorate
employed 0 No 0

1 Yes
ses 1 Low 1

2 Middle

3 High
language 1 English 0

2 French

3 Spanish

4 Arabic

5 Mandarin

6 Other
phone 0 None 0

1 Samsung

2 Apple

3 Xiaomi

4 OnePlus

5 Google

6 Other
transport 1 Walking 0

2 Bicycle

3 Car

4 Bus

5 Train
r 0 No 0

1 Yes
python 0 No 0

1 Yes
sas 0 No 0

1 Yes
stata 0 No 0

1 Yes
spss 0 No 0

1 Yes
excel 0 No 0

1 Yes

result <- dta_recode(
  dat = data_sample,
  dict = dict_recode,
  is_force_sequential = TRUE
)
glimpse(result)
#> Rows: 2,500
#> Columns: 21
#> $ id             <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region         <fct> Central, Central, South, West, North East, North East, …
#> $ age            <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group      <ord> 50-59, 40-49, 40-49, 30-39, 40-49, 50-59, 50-59, 30-39,…
#> $ height         <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight         <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group    <fct> AB, B, AB, AB, A, A, AB, B, A, AB, AB, A, B, AB, A, B, …
#> $ marital_status <fct> Married, Married, Married, Single, Single, Married, Sin…
#> $ education      <ord> Bachelors, Bachelors, Bachelors, Bachelors, Bachelors, …
#> $ employed       <fct> Yes, No, No, Yes, Yes, No, Yes, No, Yes, Yes, Yes, No, …
#> $ ses            <ord> Middle, Middle, High, Middle, Low, Middle, Low, Low, Mi…
#> $ language       <fct> Mandarin, French, Arabic, English, Arabic, Mandarin, En…
#> $ phone          <fct> OnePlus, OnePlus, Samsung, OnePlus, OnePlus, Samsung, O…
#> $ transport      <fct> Bicycle, Train, Car, Bus, Bus, Bus, Bus, Train, Bicycle…
#> $ gadgets_owned  <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r              <fct> No, No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, No, N…
#> $ python         <fct> No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes, …
#> $ sas            <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
#> $ stata          <fct> No, No, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, …
#> $ spss           <fct> No, No, Yes, No, Yes, No, No, Yes, No, Yes, No, Yes, No…
#> $ excel          <fct> Yes, No, No, No, No, No, No, No, No, No, No, No, Yes, N…

dta_replace()

Correct misspelled data using a dictionary.

data("data_misspelled")
dta_gtable(head(data_misspelled))
id region age height weight blood_group marital_status education ses r python sas stata spss excel
STM/7539 southern 46 1.85 77 ba Other BSC medium n n yes tru n n
STM/7993 South 45 1.64 53 AB maried Bsc Midle y N n no tru N
STM/7387 southern 37 1.61 75 B single BSC Middle N tru no Yes No n
STM/5598 West 45 1.80 69 B Single MSc low y N y fasle no Yes
STM/5901 South 51 1.81 53 A+ Others Bachelors Medium no yes Yes n tru Yes
STM/7529 North-East 56 1.35 56 O Single Bsc Mediam yes fasle no yes tru no

data("dict_misspelled")
dta_gtable(dict_misspelled)
variable old new
region southern South

wset West

North-East North East
blood_group ba AB

ab

o O

A+ A

a
marital_status maried Married

single Single

Singel

Others Other
education BSC Bachelors

Bsc

Bachelers

MSc Masters

PhD Doctorate

Doctoral
ses medium Middle

Midle

Mediam

low Low

Hihg High
.global fasle No

N

n

no

yes Yes

y

tru

# Correct the misspelled entries in `dat` using the
# `dict` dictionary

result <- dta_replace(
  dat = data_misspelled, 
  dict = dict_misspelled, 
  .name = variable, 
  .wrong = old, 
  .correct = new
)
dta_gtable(head(result))
id region age height weight blood_group marital_status education ses r python sas stata spss excel
STM/7539 South 46 1.85 77 AB Other Bachelors Middle No No Yes Yes No No
STM/7993 South 45 1.64 53 AB Married Bachelors Middle Yes No No No Yes No
STM/7387 South 37 1.61 75 B Single Bachelors Middle No Yes No Yes No No
STM/5598 West 45 1.80 69 B Single Masters Low Yes No Yes No No Yes
STM/5901 South 51 1.81 53 A Other Bachelors Medium No Yes Yes No Yes Yes
STM/7529 North East 56 1.35 56 O Single Bachelors Middle Yes No No Yes Yes No

dta_transpose()

Transpose a data frame with specified column as variable names

data("data_cancer")
dta_gtable(data_cancer)
cancer_type V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
stomach 124 42 25 45 412 51 1112 46 103 876 146 340 396



bronchus 81 461 20 450 246 166 63 64 155 859 151 166 37 223 138 72 245
colon 248 372 189 1843 180 537 519 455 406 365 942 776 372 163 101 20 283
ovary 1234 89 201 356 2970 456











df <- dta_transpose(
  dat = data_cancer, .column_to_use_as_variables = cancer_type
)
dta_gtable(df)
stomach bronchus colon ovary
124 81 248 1234
42 461 372 89
25 20 189 201
45 450 1843 356
412 246 180 2970
51 166 537 456
1112 63 519
46 64 455
103 155 406
876 859 365
146 151 942
340 166 776
396 37 372

223 163

138 101

72 20

245 283

Compute and generate functions

dta_bmi()

Calculate body mass index (BMI).

data("data_bmi")
dta_gtable(data_bmi)
id age height weight
STM/4921 50 1.64 59
STM/4396 34 1.98 57
STM/7908 50 1.95 84
STM/7243 39 1.52 63
STM/4801 52 1.69 65
STM/5134 50 1.71 73
STM/7138 35 1.73 46
STM/6802 72 1.98 70
STM/4420 42 1.62 103
STM/6351 40 1.89 96
STM/4933 38 1.91 67
STM/4303 37 1.56 75
STM/7465 45 1.62 44
STM/4587 67 1.38 51
STM/5320 44 1.37 63

df <- dta_bmi(
  dat = data_bmi,
  .weight = weight,
  .height = height,
  name = body_mass_index,
  digits = 2
)
dta_gtable(df)
id age height weight body_mass_index
STM/4921 50 1.64 59 21.94
STM/4396 34 1.98 57 14.54
STM/7908 50 1.95 84 22.09
STM/7243 39 1.52 63 27.27
STM/4801 52 1.69 65 22.76
STM/5134 50 1.71 73 24.96
STM/7138 35 1.73 46 15.37
STM/6802 72 1.98 70 17.86
STM/4420 42 1.62 103 39.25
STM/6351 40 1.89 96 26.87
STM/4933 38 1.91 67 18.37
STM/4303 37 1.56 75 30.82
STM/7465 45 1.62 44 16.77
STM/4587 67 1.38 51 26.78
STM/5320 44 1.37 63 33.57

dta_bmicat()

Categorize BMI into weight categories.

data("data_bmicat")
dta_gtable(data_bmicat)
id bmi
STM/4921 21.93635
STM/4396 14.53933
STM/7908 22.09073
STM/7243 27.26801
STM/4801 22.75831
STM/5134 24.96495
STM/7138 15.36971
STM/6802 17.85532
STM/4420 39.24707
STM/6351 26.87495
STM/4933 18.36572
STM/4303 30.81854
STM/7465 16.76574
STM/4587 26.78009
STM/5320 33.56599

# Categorize `bmi` into the standard BMI categories

df <- dta_bmicat(
  dat = data_bmicat,
  .bmi = bmi,
  name = bmi_cat,
  is_extended = FALSE,
  as_factor = TRUE
)
dta_gtable(df)
id bmi bmi_cat
STM/4921 21.93635 Healthy weight
STM/4396 14.53933 Underweight
STM/7908 22.09073 Healthy weight
STM/7243 27.26801 Overweight
STM/4801 22.75831 Healthy weight
STM/5134 24.96495 Healthy weight
STM/7138 15.36971 Underweight
STM/6802 17.85532 Underweight
STM/4420 39.24707 Obesity
STM/6351 26.87495 Overweight
STM/4933 18.36572 Underweight
STM/4303 30.81854 Obesity
STM/7465 16.76574 Underweight
STM/4587 26.78009 Overweight
STM/5320 33.56599 Obesity

dta_mrq()

Split multiple response question column into binary columns.

data("data_gadgets")
dat <- data_gadgets
dta_gtable(dat)
gadgets_owned
Smartwatch, Tablet, Smartphone
Tablet, Smartwatch, Smart TV, Desktop Computer
Smartphone
Laptop, Tablet
Tablet, Smart TV, Digital Camera, Laptop
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone
Digital Camera, Smartphone, Desktop Computer, Smartwatch
Smartwatch, Smart TV, Laptop, Smartphone
Desktop Computer
Smartphone, Laptop, Smart TV, Smartwatch
Tablet
Digital Camera, Tablet, Desktop Computer
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop
Tablet, Desktop Computer, Smart TV
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop

# Split `gadgets_owned` column into separate columns.
# The created columns will be logical (i.e. TRUE / FALSE).

df <- dta_mrq(
  dat = dat,
  .column = gadgets_owned,
  delimeter = ", ",
  is_clean_names = TRUE
)
dta_gtable(df)
gadgets_owned Smartwatch Tablet Smartphone Smart TV Desktop Computer Laptop Digital Camera
Smartwatch, Tablet, Smartphone TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Tablet, Smartwatch, Smart TV, Desktop Computer TRUE TRUE FALSE TRUE TRUE FALSE FALSE
Smartphone FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Laptop, Tablet FALSE TRUE FALSE FALSE FALSE TRUE FALSE
Tablet, Smart TV, Digital Camera, Laptop FALSE TRUE FALSE TRUE FALSE TRUE TRUE
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Digital Camera, Smartphone, Desktop Computer, Smartwatch TRUE FALSE TRUE FALSE TRUE FALSE TRUE
Smartwatch, Smart TV, Laptop, Smartphone TRUE FALSE TRUE TRUE FALSE TRUE FALSE
Desktop Computer FALSE FALSE FALSE FALSE TRUE FALSE FALSE
Smartphone, Laptop, Smart TV, Smartwatch TRUE FALSE TRUE TRUE FALSE TRUE FALSE
Tablet FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Digital Camera, Tablet, Desktop Computer FALSE TRUE FALSE FALSE TRUE FALSE TRUE
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop TRUE FALSE FALSE TRUE TRUE TRUE TRUE
Tablet, Desktop Computer, Smart TV FALSE TRUE FALSE TRUE TRUE FALSE FALSE
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop FALSE FALSE TRUE TRUE TRUE TRUE TRUE

Exploratory data analysis functions

dta_freq()

Get frequency distribution of a specific column.

data("data_sample")
tab <- dta_freq(dat = data_sample, .column = region)
dta_gtable(tab)
Region of residence Frequency Percent
Central 356 14.24%
North East 764 30.56%
South 851 34.04%
West 529 21.16%
Total 2500 100.00%

# Remove the percentage symbol

tab2 <- dta_freq(
  dat = data_sample,
  .column = region,
  is_sorted = TRUE,
  is_decreasing = TRUE,
  add_percent_symbol = FALSE
)
dta_gtable(tab2)
Region of residence Frequency Percent
South 851 34.04
North East 764 30.56
West 529 21.16
Central 356 14.24
Total 2500 100.00

dta_freq_mrq()

Frequency table for multiple response questions.

data("data_sample")

# An example with multiple response variables labelled
# as Yes / No

result <- dta_freq_mrq(
  dat = data_sample,
  .columns = r:excel,
  value = "Yes",
  name = "Programming proficiency"
)
dta_gtable(result)
Programming proficiency Frequency Responses Cases
R 1245 16.79% 49.80%
Python 1267 17.09% 50.68%
SAS 1266 17.08% 50.64%
Stata 1225 16.53% 49.00%
SPSS 1222 16.48% 48.88%
Microsoft Excel 1188 16.03% 47.52%
Total 7413 100.00% 296.52%

# Remove the percentage symbol

result2 <- dta_freq_mrq(
  dat = data_sample,
  .columns = r:excel,
  value = "Yes",
  name = "Programming proficiency",
  add_percent_symbol = FALSE
)
dta_gtable(result2)
Programming proficiency Frequency Responses Cases
R 1245 16.79 49.80
Python 1267 17.09 50.68
SAS 1266 17.08 50.64
Stata 1225 16.53 49.00
SPSS 1222 16.48 48.88
Microsoft Excel 1188 16.03 47.52
Total 7413 100.00 296.52

dta_crosstab()

Generate cross tabulations with optional percentages and totals.

data("data_sample")
df <- data_sample

# Crosstabulation of frequencies (counts)

result <- dta_crosstab(
  dat = df, .row = region, .column = age_group
)
dta_gtable(result)
Variable 20-29 30-39 40-49 50-59 60-69 70+ Total
Central 33 68 100 80 45 30 356
North East 65 133 199 174 127 66 764
South 94 151 177 206 154 69 851
West 50 93 117 128 96 45 529
Total 242 445 593 588 422 210 2500

# Calculate column percentages

result2 <- dta_crosstab(
  dat = df, 
  .row = region,
  .column = age_group,
  cells = "col",
  add_totals = "col"
)
dta_gtable(result2)
region/age_group 20-29 30-39 40-49 50-59 60-69 70+ Total
Central 33 (13.64%) 68 (15.28%) 100 (16.86%) 80 (13.61%) 45 (10.66%) 30 (14.29%) 356 (14.24%)
North East 65 (26.86%) 133 (29.89%) 199 (33.56%) 174 (29.59%) 127 (30.09%) 66 (31.43%) 764 (30.56%)
South 94 (38.84%) 151 (33.93%) 177 (29.85%) 206 (35.03%) 154 (36.49%) 69 (32.86%) 851 (34.04%)
West 50 (20.66%) 93 (20.90%) 117 (19.73%) 128 (21.77%) 96 (22.75%) 45 (21.43%) 529 (21.16%)

# Calculate row percentages

result3 <- dta_crosstab(
  dat = df,
  .row = region,
  .column = age_group,
  cells = "row",
  add_totals = "row"
)
dta_gtable(result3)
region/age_group 20-29 30-39 40-49 50-59 60-69 70+
Central 33 (9.27%) 68 (19.10%) 100 (28.09%) 80 (22.47%) 45 (12.64%) 30 (8.43%)
North East 65 (8.51%) 133 (17.41%) 199 (26.05%) 174 (22.77%) 127 (16.62%) 66 (8.64%)
South 94 (11.05%) 151 (17.74%) 177 (20.80%) 206 (24.21%) 154 (18.10%) 69 (8.11%)
West 50 (9.45%) 93 (17.58%) 117 (22.12%) 128 (24.20%) 96 (18.15%) 45 (8.51%)
Total 242 (9.68%) 445 (17.80%) 593 (23.72%) 588 (23.52%) 422 (16.88%) 210 (8.40%)

Other utilities

dta_to_numeric()

Convert numeric strings in a data frame or tibble to numeric numbers.

# A data frame with numeric character (a), characters (b) and numeric numbers (c)

df <- data.frame(
  a = c("1", "2", "3"),
  b = c("A", "B", "C"),
  c = c(4, 5, 6)
)
str(df)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ a: chr  "1" "2" "3"
#>  $ b: chr  "A" "B" "C"
#>  $ c: num  4 5 6

df <- dta_to_numeric(df)
str(df)
#> tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:3] 1 2 3
#>  $ b: chr [1:3] "A" "B" "C"
#>  $ c: num [1:3] 4 5 6

Conclusion

This document demonstrates the functionality and usage of the dta package, helping users manage, analyze, and transform their data efficiently. For more details, consult the package documentation.