Efficient Data Management and Manipulation • dta

# install.packages(dta) # uncomment if `dta` is not installed
library(dta)

This document provides a comprehensive guide to the dta package, showcasing the sample datasets and functions available, along with their usage examples. The package is designed for efficient data management, transformation, and exploratory data analysis.

Sample datasets and dictionaries

The package includes several sample datasets and dictionaries. Below are descriptions and examples of how to load a few of them.

Datasets

data_bmi: Sample data for body mass index (BMI) calculations.

data("data_bmi")
dta_gtable(data_bmi)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found

id	age	height	weight
STM/4921	50	1.64	59
STM/4396	34	1.98	57
STM/7908	50	1.95	84
STM/7243	39	1.52	63
STM/4801	52	1.69	65
STM/5134	50	1.71	73
STM/7138	35	1.73	46
STM/6802	72	1.98	70
STM/4420	42	1.62	103
STM/6351	40	1.89	96
STM/4933	38	1.91	67
STM/4303	37	1.56	75
STM/7465	45	1.62	44
STM/4587	67	1.38	51
STM/5320	44	1.37	63

data_cancer: Sample data: Survival of cancer patients in days.

data("data_cancer")
dta_gtable(data_cancer)

cancer_type	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17
stomach	124	42	25	45	412	51	1112	46	103	876	146	340	396
bronchus	81	461	20	450	246	166	63	64	155	859	151	166	37	223	138	72	245
colon	248	372	189	1843	180	537	519	455	406	365	942	776	372	163	101	20	283
ovary	1234	89	201	356	2970	456

data_misspelled: Sample dataset with general information.

data("data_misspelled")
dta_gtable(data_misspelled)

id	region	age	height	weight	blood_group	marital_status	education	ses	r	python	sas	stata	spss	excel
STM/7539	southern	46	1.85	77	ba	Other	BSC	medium	n	n	yes	tru	n	n
STM/7993	South	45	1.64	53	AB	maried	Bsc	Midle	y	N	n	no	tru	N
STM/7387	southern	37	1.61	75	B	single	BSC	Middle	N	tru	no	Yes	No	n
STM/5598	West	45	1.80	69	B	Single	MSc	low	y	N	y	fasle	no	Yes
STM/5901	South	51	1.81	53	A+	Others	Bachelors	Medium	no	yes	Yes	n	tru	Yes
STM/7529	North-East	56	1.35	56	O	Single	Bsc	Mediam	yes	fasle	no	yes	tru	no
STM/7238	North East	37	1.44	89	o	Single	Bachelers	High	No	n	Yes	No	yes	no
STM/5417	North-East	50	1.67	73	o	single	MSc	High	No	y	Yes	tru	N	yes
STM/5907	Central	38	1.48	86	o	Other	Bachelers	Midle	N	tru	y	yes	fasle	n
STM/7877	Central	48	1.71	76	O	Others	Bachelors	low	N	N	tru	n	fasle	no
STM/4762	South	41	1.96	81	ab	Singel	Bachelors	Middle	fasle	tru	no	no	no	no
STM/7345	North-East	24	1.98	77	a	Others	MSc	medium	no	Yes	n	No	N	Yes
STM/6968	wset	34	1.98	75	A+	Other	Bachelers	low	yes	N	fasle	y	Yes	tru
STM/5816	West	43	1.76	75	B	maried	PhD	Medium	tru	y	n	Yes	tru	fasle
STM/7707	southern	63	1.92	67	o	maried	Bsc	Medium	No	yes	tru	Yes	No	tru
STM/7449	North East	59	1.69	74	AB	Other	MSc	Midle	N	fasle	no	no	N	No
STM/6853	southern	35	1.43	72	A	Married	Doctoral	Midle	tru	fasle	no	y	tru	fasle
STM/5634	North East	22	1.63	72	o	maried	Doctorate	Medium	No	y	N	y	No	tru
STM/4584	North East	62	1.92	87	AB	Other	Masters	Midle	Yes	fasle	yes	tru	yes	yes
STM/4946	southern	41	1.40	93	O	Singel	PhD	medium	fasle	N	tru	yes	n	n
STM/6798	North East	60	1.95	109	A	Married	Bachelors	low	No	N	N	Yes	y	Yes
STM/4377	North East	44	1.50	44	A	Others	BSC	low	yes	No	n	fasle	yes	n
STM/5435	Central	30	1.96	58	ba	single	MSc	Hihg	fasle	N	fasle	No	Yes	fasle
STM/5562	southern	71	1.95	74	AB	Others	Bsc	Midle	Yes	No	tru	Yes	No	yes
STM/7617	North East	53	1.91	53	a	Others	Doctorate	Low	No	N	fasle	y	N	N
STM/6677	South	60	1.96	72	B	Single	Bachelors	medium	n	fasle	tru	N	n	N
STM/4773	wset	41	1.39	74	A	Married	Bsc	High	fasle	No	n	yes	N	N
STM/5416	Central	64	1.98	59	A	maried	Bachelors	Middle	Yes	Yes	no	fasle	fasle	no
STM/4805	South	27	1.91	53	o	Singel	Bachelors	low	fasle	No	yes	yes	n	fasle
STM/5493	wset	63	1.38	66	ba	Married	MSc	medium	Yes	Yes	no	Yes	tru	No
STM/7650	South	54	1.88	56	ba	Others	MSc	Middle	y	fasle	N	No	yes	n
STM/5458	wset	61	1.95	85	a	single	BSC	Hihg	No	Yes	N	fasle	Yes	No
STM/6102	South	35	1.71	62	o	Others	Bsc	Low	y	No	Yes	Yes	N	fasle
STM/7508	wset	29	1.77	82	ba	Married	Doctorate	Mediam	no	no	N	No	yes	fasle
STM/4121	southern	37	1.78	70	B	Others	PhD	medium	No	Yes	y	N	N	fasle
STM/4830	North-East	41	1.69	65	a	Single	Doctorate	High	No	y	fasle	tru	Yes	no
STM/6271	Central	58	1.56	77	AB	maried	MSc	Mediam	n	fasle	N	tru	n	N
STM/7818	North East	61	1.65	82	a	maried	MSc	low	n	yes	fasle	tru	fasle	no
STM/4671	Central	36	1.56	58	A+	Singel	MSc	High	n	n	yes	fasle	N	fasle
STM/5864	southern	40	1.95	76	o	Singel	Bsc	medium	tru	fasle	n	y	no	n
STM/4951	North East	52	1.80	77	B	Others	Bachelors	medium	tru	no	n	fasle	tru	yes
STM/6287	wset	25	1.47	81	O	maried	Doctorate	Middle	yes	no	tru	tru	fasle	fasle
STM/4340	North East	50	1.95	84	ab	Other	Bachelors	Middle	Yes	n	fasle	y	yes	No
STM/5202	North-East	55	1.49	52	o	Others	Doctorate	Middle	y	Yes	n	n	n	no
STM/7939	wset	49	1.47	87	B	maried	Masters	Hihg	fasle	tru	y	tru	y	y
STM/4201	North-East	35	1.66	89	ab	single	Bachelors	Midle	fasle	no	no	N	no	fasle
STM/6411	Central	32	1.90	72	B	single	PhD	High	tru	Yes	N	No	tru	tru
STM/4685	South	68	1.52	85	a	Single	PhD	Low	N	no	tru	No	Yes	N
STM/5343	North East	48	1.36	54	ba	Other	Bachelers	High	Yes	N	n	y	yes	Yes
STM/5129	wset	55	1.72	86	ba	Others	Doctorate	low	yes	Yes	yes	Yes	fasle	No

Dictionaries

Dictionaries are data values that are applied to update, modify, or standardize the values in a specified dataset, enabling consistent and efficient data transformation.

dict_recode: A dictionary with variable value mappings with labels and ordered status.

data("dict_recode")
dta_gtable(dict_recode)

names	values	labels	is_ordered
region	1	Central	0
	2	North East
	3	South
	4	West
age_group	1	20-29	1
	2	30-39
	3	40-49
	4	50-59
	5	60-69
	6	70+
blood_group	1	A	0
	2	B
	3	AB
	4	O
marital_status	1	Single	0
	2	Married
	3	Other
education	1	Bachelors	1
	2	Masters
	3	Doctorate
employed	0	No	0
	1	Yes
ses	1	Low	1
	2	Middle
	3	High
language	1	English	0
	2	French
	3	Spanish
	4	Arabic
	5	Mandarin
	6	Other
phone	0	None	0
	1	Samsung
	2	Apple
	3	Xiaomi
	4	OnePlus
	5	Google
	6	Other
transport	1	Walking	0
	2	Bicycle
	3	Car
	4	Bus
	5	Train
r	0	No	0
	1	Yes
python	0	No	0
	1	Yes
sas	0	No	0
	1	Yes
stata	0	No	0
	1	Yes
spss	0	No	0
	1	Yes
excel	0	No	0
	1	Yes

dict_misspelled: A dictionary with variable value mappings with labels and ordered status.

data("dict_misspelled")
dta_gtable(dict_misspelled)

variable	old	new
region	southern	South
	wset	West
	North-East	North East
blood_group	ba	AB
	ab
	o	O
	A+	A
	a
marital_status	maried	Married
	single	Single
	Singel
	Others	Other
education	BSC	Bachelors
	Bsc
	Bachelers
	MSc	Masters
	PhD	Doctorate
	Doctoral
ses	medium	Middle
	Midle
	Mediam
	low	Low
	Hihg	High
.global	fasle	No
	N
	n
	no
	yes	Yes
	y
	tru

Functions

Data management and transformation functions

dta_columns()

Retrieve column names from a data frame.

data(mtcars)
dta_columns(mtcars, .columns = starts_with("c"))
#> [1] "cyl"  "carb"
dta_columns(mtcars, .columns = cyl:wt)
#> [1] "cyl"  "disp" "hp"   "drat" "wt"
dta_columns(mtcars, .columns = c(mpg, hp, vs, gear))
#> [1] "mpg"  "hp"   "vs"   "gear"

dta_duplicates()

Find duplicate rows based on specific columns.

df <- data.frame(
  id = c(14, 20, 12, 32, 14, 23, 15, 12, 30, 14),
  name = c(
   "Mary", "Mark", "Faith", "David", "Mary", "Daniel", "Christine",
   "Johnson", "Elizabeth", "Mary"
  ),
  age = c(21, 18, 25, 17, 21, 24, 21, 19, 20, 21)
)

result <- dta_duplicates(df)
dta_gtable(result)

id	name	age
14	Mary	21
14	Mary	21
14	Mary	21


result2 <- dta_duplicates(df, .columns = id)
dta_gtable(result2)

id	name	age
14	Mary	21
12	Faith	25
14	Mary	21
12	Johnson	19
14	Mary	21

dta_label()

Assign variable labels to data frame or tibble columns.

dat <- data.frame(
  age = c(25, 30, 35, 40),
  gender = c("Male", "Female", "Female", "Male"),
  income = c(50000, 60000, 55000, 65000)
)

names <- c("age", "income")
labels <- c("Age in years", "Annual income")

result <- dta_label(
  dat, dict = NULL, .names = names, .labels = labels
)

dta_gtable(result)

Age in years	gender	Annual income
25	Male	50000
30	Female	60000
35	Female	55000
40	Male	65000

dta_recode()

Recode variables in a data frame based on a dictionary.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data("data_sample")
glimpse(data_sample) # look at the data type column
#> Rows: 2,500
#> Columns: 21
#> $ id             <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region         <chr> "Central", "Central", "South", "West", "North East", "N…
#> $ age            <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group      <chr> "50-59", "40-49", "40-49", "30-39", "40-49", "50-59", "…
#> $ height         <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight         <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group    <chr> "AB", "B", "AB", "AB", "A", "A", "AB", "B", "A", "AB", …
#> $ marital_status <chr> "Married", "Married", "Married", "Single", "Single", "M…
#> $ education      <chr> "Bachelors", "Bachelors", "Bachelors", "Bachelors", "Ba…
#> $ employed       <chr> "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ ses            <chr> "Middle", "Middle", "High", "Middle", "Low", "Middle", …
#> $ language       <chr> "Mandarin", "French", "Arabic", "English", "Arabic", "M…
#> $ phone          <chr> "OnePlus", "OnePlus", "Samsung", "OnePlus", "OnePlus", …
#> $ transport      <chr> "Bicycle", "Train", "Car", "Bus", "Bus", "Bus", "Bus", …
#> $ gadgets_owned  <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r              <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "Yes", "Yes…
#> $ python         <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "N…
#> $ sas            <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", …
#> $ stata          <chr> "No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Ye…
#> $ spss           <chr> "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No"…
#> $ excel          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "No", …
data("dict_recode")
dta_gtable(dict_recode)

names	values	labels	is_ordered
region	1	Central	0
	2	North East
	3	South
	4	West
age_group	1	20-29	1
	2	30-39
	3	40-49
	4	50-59
	5	60-69
	6	70+
blood_group	1	A	0
	2	B
	3	AB
	4	O
marital_status	1	Single	0
	2	Married
	3	Other
education	1	Bachelors	1
	2	Masters
	3	Doctorate
employed	0	No	0
	1	Yes
ses	1	Low	1
	2	Middle
	3	High
language	1	English	0
	2	French
	3	Spanish
	4	Arabic
	5	Mandarin
	6	Other
phone	0	None	0
	1	Samsung
	2	Apple
	3	Xiaomi
	4	OnePlus
	5	Google
	6	Other
transport	1	Walking	0
	2	Bicycle
	3	Car
	4	Bus
	5	Train
r	0	No	0
	1	Yes
python	0	No	0
	1	Yes
sas	0	No	0
	1	Yes
stata	0	No	0
	1	Yes
spss	0	No	0
	1	Yes
excel	0	No	0
	1	Yes


result <- dta_recode(
  dat = data_sample,
  dict = dict_recode,
  is_force_sequential = TRUE
)
glimpse(result)
#> Rows: 2,500
#> Columns: 21
#> $ id             <chr> "STM/7539", "STM/7993", "STM/7387", "STM/5598", "STM/59…
#> $ region         <fct> Central, Central, South, West, North East, North East, …
#> $ age            <dbl> 56, 46, 45, 37, 45, 51, 56, 37, 50, 38, 48, 41, 24, 34,…
#> $ age_group      <ord> 50-59, 40-49, 40-49, 30-39, 40-49, 50-59, 50-59, 30-39,…
#> $ height         <dbl> 1.70, 1.57, 1.47, 1.67, 1.69, 1.90, 1.85, 1.64, 1.61, 1…
#> $ weight         <dbl> 73, 53, 85, 77, 53, 75, 69, 53, 56, 89, 73, 86, 76, 81,…
#> $ blood_group    <fct> AB, B, AB, AB, A, A, AB, B, A, AB, AB, A, B, AB, A, B, …
#> $ marital_status <fct> Married, Married, Married, Single, Single, Married, Sin…
#> $ education      <ord> Bachelors, Bachelors, Bachelors, Bachelors, Bachelors, …
#> $ employed       <fct> Yes, No, No, Yes, Yes, No, Yes, No, Yes, Yes, Yes, No, …
#> $ ses            <ord> Middle, Middle, High, Middle, Low, Middle, Low, Low, Mi…
#> $ language       <fct> Mandarin, French, Arabic, English, Arabic, Mandarin, En…
#> $ phone          <fct> OnePlus, OnePlus, Samsung, OnePlus, OnePlus, Samsung, O…
#> $ transport      <fct> Bicycle, Train, Car, Bus, Bus, Bus, Bus, Train, Bicycle…
#> $ gadgets_owned  <chr> "Smart TV, Tablet, Desktop Computer, Digital Camera, Sm…
#> $ r              <fct> No, No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, No, N…
#> $ python         <fct> No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes, …
#> $ sas            <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
#> $ stata          <fct> No, No, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, …
#> $ spss           <fct> No, No, Yes, No, Yes, No, No, Yes, No, Yes, No, Yes, No…
#> $ excel          <fct> Yes, No, No, No, No, No, No, No, No, No, No, No, Yes, N…

dta_replace()

Correct misspelled data using a dictionary.

data("data_misspelled")
dta_gtable(head(data_misspelled))

id	region	age	height	weight	blood_group	marital_status	education	ses	r	python	sas	stata	spss	excel
STM/7539	southern	46	1.85	77	ba	Other	BSC	medium	n	n	yes	tru	n	n
STM/7993	South	45	1.64	53	AB	maried	Bsc	Midle	y	N	n	no	tru	N
STM/7387	southern	37	1.61	75	B	single	BSC	Middle	N	tru	no	Yes	No	n
STM/5598	West	45	1.80	69	B	Single	MSc	low	y	N	y	fasle	no	Yes
STM/5901	South	51	1.81	53	A+	Others	Bachelors	Medium	no	yes	Yes	n	tru	Yes
STM/7529	North-East	56	1.35	56	O	Single	Bsc	Mediam	yes	fasle	no	yes	tru	no


data("dict_misspelled")
dta_gtable(dict_misspelled)

variable	old	new
region	southern	South
	wset	West
	North-East	North East
blood_group	ba	AB
	ab
	o	O
	A+	A
	a
marital_status	maried	Married
	single	Single
	Singel
	Others	Other
education	BSC	Bachelors
	Bsc
	Bachelers
	MSc	Masters
	PhD	Doctorate
	Doctoral
ses	medium	Middle
	Midle
	Mediam
	low	Low
	Hihg	High
.global	fasle	No
	N
	n
	no
	yes	Yes
	y
	tru


# Correct the misspelled entries in `dat` using the
# `dict` dictionary

result <- dta_replace(
  dat = data_misspelled, 
  dict = dict_misspelled, 
  .name = variable, 
  .wrong = old, 
  .correct = new
)
dta_gtable(head(result))

id	region	age	height	weight	blood_group	marital_status	education	ses	r	python	sas	stata	spss	excel
STM/7539	South	46	1.85	77	AB	Other	Bachelors	Middle	No	No	Yes	Yes	No	No
STM/7993	South	45	1.64	53	AB	Married	Bachelors	Middle	Yes	No	No	No	Yes	No
STM/7387	South	37	1.61	75	B	Single	Bachelors	Middle	No	Yes	No	Yes	No	No
STM/5598	West	45	1.80	69	B	Single	Masters	Low	Yes	No	Yes	No	No	Yes
STM/5901	South	51	1.81	53	A	Other	Bachelors	Medium	No	Yes	Yes	No	Yes	Yes
STM/7529	North East	56	1.35	56	O	Single	Bachelors	Middle	Yes	No	No	Yes	Yes	No

dta_transpose()

Transpose a data frame with specified column as variable names

data("data_cancer")
dta_gtable(data_cancer)

cancer_type	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17
stomach	124	42	25	45	412	51	1112	46	103	876	146	340	396
bronchus	81	461	20	450	246	166	63	64	155	859	151	166	37	223	138	72	245
colon	248	372	189	1843	180	537	519	455	406	365	942	776	372	163	101	20	283
ovary	1234	89	201	356	2970	456


df <- dta_transpose(
  dat = data_cancer, .column_to_use_as_variables = cancer_type
)
dta_gtable(df)

stomach	bronchus	colon	ovary
124	81	248	1234
42	461	372	89
25	20	189	201
45	450	1843	356
412	246	180	2970
51	166	537	456
1112	63	519
46	64	455
103	155	406
876	859	365
146	151	942
340	166	776
396	37	372
	223	163
	138	101
	72	20
	245	283

Compute and generate functions

dta_bmi()

Calculate body mass index (BMI).

data("data_bmi")
dta_gtable(data_bmi)

id	age	height	weight
STM/4921	50	1.64	59
STM/4396	34	1.98	57
STM/7908	50	1.95	84
STM/7243	39	1.52	63
STM/4801	52	1.69	65
STM/5134	50	1.71	73
STM/7138	35	1.73	46
STM/6802	72	1.98	70
STM/4420	42	1.62	103
STM/6351	40	1.89	96
STM/4933	38	1.91	67
STM/4303	37	1.56	75
STM/7465	45	1.62	44
STM/4587	67	1.38	51
STM/5320	44	1.37	63


df <- dta_bmi(
  dat = data_bmi,
  .weight = weight,
  .height = height,
  name = body_mass_index,
  digits = 2
)
dta_gtable(df)

id	age	height	weight	body_mass_index
STM/4921	50	1.64	59	21.94
STM/4396	34	1.98	57	14.54
STM/7908	50	1.95	84	22.09
STM/7243	39	1.52	63	27.27
STM/4801	52	1.69	65	22.76
STM/5134	50	1.71	73	24.96
STM/7138	35	1.73	46	15.37
STM/6802	72	1.98	70	17.86
STM/4420	42	1.62	103	39.25
STM/6351	40	1.89	96	26.87
STM/4933	38	1.91	67	18.37
STM/4303	37	1.56	75	30.82
STM/7465	45	1.62	44	16.77
STM/4587	67	1.38	51	26.78
STM/5320	44	1.37	63	33.57

dta_bmicat()

Categorize BMI into weight categories.

data("data_bmicat")
dta_gtable(data_bmicat)

id	bmi
STM/4921	21.93635
STM/4396	14.53933
STM/7908	22.09073
STM/7243	27.26801
STM/4801	22.75831
STM/5134	24.96495
STM/7138	15.36971
STM/6802	17.85532
STM/4420	39.24707
STM/6351	26.87495
STM/4933	18.36572
STM/4303	30.81854
STM/7465	16.76574
STM/4587	26.78009
STM/5320	33.56599


# Categorize `bmi` into the standard BMI categories

df <- dta_bmicat(
  dat = data_bmicat,
  .bmi = bmi,
  name = bmi_cat,
  is_extended = FALSE,
  as_factor = TRUE
)
dta_gtable(df)

id	bmi	bmi_cat
STM/4921	21.93635	Healthy weight
STM/4396	14.53933	Underweight
STM/7908	22.09073	Healthy weight
STM/7243	27.26801	Overweight
STM/4801	22.75831	Healthy weight
STM/5134	24.96495	Healthy weight
STM/7138	15.36971	Underweight
STM/6802	17.85532	Underweight
STM/4420	39.24707	Obesity
STM/6351	26.87495	Overweight
STM/4933	18.36572	Underweight
STM/4303	30.81854	Obesity
STM/7465	16.76574	Underweight
STM/4587	26.78009	Overweight
STM/5320	33.56599	Obesity

dta_mrq()

Split multiple response question column into binary columns.

data("data_gadgets")
dat <- data_gadgets
dta_gtable(dat)

gadgets_owned
Smartwatch, Tablet, Smartphone
Tablet, Smartwatch, Smart TV, Desktop Computer
Smartphone
Laptop, Tablet
Tablet, Smart TV, Digital Camera, Laptop
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone
Digital Camera, Smartphone, Desktop Computer, Smartwatch
Smartwatch, Smart TV, Laptop, Smartphone
Desktop Computer
Smartphone, Laptop, Smart TV, Smartwatch
Tablet
Digital Camera, Tablet, Desktop Computer
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop
Tablet, Desktop Computer, Smart TV
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop


# Split `gadgets_owned` column into separate columns.
# The created columns will be logical (i.e. TRUE / FALSE).

df <- dta_mrq(
  dat = dat,
  .column = gadgets_owned,
  delimeter = ", ",
  is_clean_names = TRUE
)
dta_gtable(df)

gadgets_owned	Smartwatch	Tablet	Smartphone	Smart TV	Desktop Computer	Laptop	Digital Camera
Smartwatch, Tablet, Smartphone	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE
Tablet, Smartwatch, Smart TV, Desktop Computer	TRUE	TRUE	FALSE	TRUE	TRUE	FALSE	FALSE
Smartphone	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE
Laptop, Tablet	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE
Tablet, Smart TV, Digital Camera, Laptop	FALSE	TRUE	FALSE	TRUE	FALSE	TRUE	TRUE
Laptop, Desktop Computer, Digital Camera, Smart TV, Smartphone	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE
Digital Camera, Smartphone, Desktop Computer, Smartwatch	TRUE	FALSE	TRUE	FALSE	TRUE	FALSE	TRUE
Smartwatch, Smart TV, Laptop, Smartphone	TRUE	FALSE	TRUE	TRUE	FALSE	TRUE	FALSE
Desktop Computer	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE
Smartphone, Laptop, Smart TV, Smartwatch	TRUE	FALSE	TRUE	TRUE	FALSE	TRUE	FALSE
Tablet	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
Digital Camera, Tablet, Desktop Computer	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	TRUE
Digital Camera, Desktop Computer, Smart TV, Smartwatch, Laptop	TRUE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE
Tablet, Desktop Computer, Smart TV	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	FALSE
Digital Camera, Desktop Computer, Smart TV, Smartphone, Laptop	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE

Exploratory data analysis functions

dta_freq()

Get frequency distribution of a specific column.

data("data_sample")
tab <- dta_freq(dat = data_sample, .column = region)
dta_gtable(tab)

Region of residence	Frequency	Percent
Central	356	14.24%
North East	764	30.56%
South	851	34.04%
West	529	21.16%
Total	2500	100.00%


# Remove the percentage symbol

tab2 <- dta_freq(
  dat = data_sample,
  .column = region,
  is_sorted = TRUE,
  is_decreasing = TRUE,
  add_percent_symbol = FALSE
)
dta_gtable(tab2)

Region of residence	Frequency	Percent
South	851	34.04
North East	764	30.56
West	529	21.16
Central	356	14.24
Total	2500	100.00

dta_freq_mrq()

Frequency table for multiple response questions.

data("data_sample")

# An example with multiple response variables labelled
# as Yes / No

result <- dta_freq_mrq(
  dat = data_sample,
  .columns = r:excel,
  value = "Yes",
  name = "Programming proficiency"
)
dta_gtable(result)

Programming proficiency	Frequency	Responses	Cases
R	1245	16.79%	49.80%
Python	1267	17.09%	50.68%
SAS	1266	17.08%	50.64%
Stata	1225	16.53%	49.00%
SPSS	1222	16.48%	48.88%
Microsoft Excel	1188	16.03%	47.52%
Total	7413	100.00%	296.52%


# Remove the percentage symbol

result2 <- dta_freq_mrq(
  dat = data_sample,
  .columns = r:excel,
  value = "Yes",
  name = "Programming proficiency",
  add_percent_symbol = FALSE
)
dta_gtable(result2)

Programming proficiency	Frequency	Responses	Cases
R	1245	16.79	49.80
Python	1267	17.09	50.68
SAS	1266	17.08	50.64
Stata	1225	16.53	49.00
SPSS	1222	16.48	48.88
Microsoft Excel	1188	16.03	47.52
Total	7413	100.00	296.52

dta_crosstab()

Generate cross tabulations with optional percentages and totals.

data("data_sample")
df <- data_sample

# Crosstabulation of frequencies (counts)

result <- dta_crosstab(
  dat = df, .row = region, .column = age_group
)
dta_gtable(result)

Variable	20-29	30-39	40-49	50-59	60-69	70+	Total
Central	33	68	100	80	45	30	356
North East	65	133	199	174	127	66	764
South	94	151	177	206	154	69	851
West	50	93	117	128	96	45	529
Total	242	445	593	588	422	210	2500


# Calculate column percentages

result2 <- dta_crosstab(
  dat = df, 
  .row = region,
  .column = age_group,
  cells = "col",
  add_totals = "col"
)
dta_gtable(result2)

region/age_group	20-29	30-39	40-49	50-59	60-69	70+	Total
Central	33 (13.64%)	68 (15.28%)	100 (16.86%)	80 (13.61%)	45 (10.66%)	30 (14.29%)	356 (14.24%)
North East	65 (26.86%)	133 (29.89%)	199 (33.56%)	174 (29.59%)	127 (30.09%)	66 (31.43%)	764 (30.56%)
South	94 (38.84%)	151 (33.93%)	177 (29.85%)	206 (35.03%)	154 (36.49%)	69 (32.86%)	851 (34.04%)
West	50 (20.66%)	93 (20.90%)	117 (19.73%)	128 (21.77%)	96 (22.75%)	45 (21.43%)	529 (21.16%)


# Calculate row percentages

result3 <- dta_crosstab(
  dat = df,
  .row = region,
  .column = age_group,
  cells = "row",
  add_totals = "row"
)
dta_gtable(result3)

region/age_group	20-29	30-39	40-49	50-59	60-69	70+
Central	33 (9.27%)	68 (19.10%)	100 (28.09%)	80 (22.47%)	45 (12.64%)	30 (8.43%)
North East	65 (8.51%)	133 (17.41%)	199 (26.05%)	174 (22.77%)	127 (16.62%)	66 (8.64%)
South	94 (11.05%)	151 (17.74%)	177 (20.80%)	206 (24.21%)	154 (18.10%)	69 (8.11%)
West	50 (9.45%)	93 (17.58%)	117 (22.12%)	128 (24.20%)	96 (18.15%)	45 (8.51%)
Total	242 (9.68%)	445 (17.80%)	593 (23.72%)	588 (23.52%)	422 (16.88%)	210 (8.40%)

Other utilities

dta_to_numeric()

Convert numeric strings in a data frame or tibble to numeric numbers.

# A data frame with numeric character (a), characters (b) and numeric numbers (c)

df <- data.frame(
  a = c("1", "2", "3"),
  b = c("A", "B", "C"),
  c = c(4, 5, 6)
)
str(df)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ a: chr  "1" "2" "3"
#>  $ b: chr  "A" "B" "C"
#>  $ c: num  4 5 6

df <- dta_to_numeric(df)
str(df)
#> tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:3] 1 2 3
#>  $ b: chr [1:3] "A" "B" "C"
#>  $ c: num [1:3] 4 5 6

Conclusion

This document demonstrates the functionality and usage of the dta package, helping users manage, analyze, and transform their data efficiently. For more details, consult the package documentation.