Correct misspelled data using a dictionary

dta_replace() corrects misspelled entries in a data frame or tibble based on a provided dictionary. The dictionary specifies the correct values for misspelled entries in a specified column.

Usage

dta_replace(dat, dict, .name, .wrong, .correct)

Arguments

dat: A data frame or tibble containing the data to be corrected.
dict: A data frame or tibble serving as the dictionary, with columns specifying the correct and incorrect spellings.
.name: The column in both dat and dict to match entries by (e.g., a unique identifier).
.wrong: The column in dict containing the misspelled values to be corrected.
.correct: The column in dict containing the correct values for the misspelled entries.

Value

A data frame or tibble with corrected entries.

Details

The function first validates that dat and dict are data frames or tibbles. It then fills missing values in the dict for the columns specified in .name and .correct, using a downward fill strategy. Finally, it replaces misspelled values in dat using a dictionary lookup facilitated by matchmaker::match_df().

Examples

# Example data with misspelled characters / strings

data("data_misspelled")
dta_gtable(head(data_misspelled))


  id
      region
      age
      height
      weight
      blood_group
      marital_status
      education
      ses
      r
      python
      sas
      stata
      spss
      excel
    
STM/7539
southern
46
1.85
77
ba
Other
BSC
medium
n
n
yes
tru
n
n
STM/7993
South
45
1.64
53
AB
maried
Bsc
Midle
y
N
n
no
tru
N
STM/7387
southern
37
1.61
75
B
single
BSC
Middle
N
tru
no
Yes
No
n
STM/5598
West
45
1.80
69
B
Single
MSc
low
y
N
y
fasle
no
Yes
STM/5901
South
51
1.81
53
A+
Others
Bachelors
Medium
no
yes
Yes
n
tru
Yes
STM/7529
North-East
56
1.35
56
O
Single
Bsc
Mediam
yes
fasle
no
yes
tru
no

data("dict_misspelled")
dta_gtable(dict_misspelled)


  variable
      old
      new
    
region
southern
South


wset
West


North-East
North East
blood_group
ba
AB


ab



o
O


A+
A


a

marital_status
maried
Married


single
Single


Singel



Others
Other
education
BSC
Bachelors


Bsc



Bachelers



MSc
Masters


PhD
Doctorate


Doctoral

ses
medium
Middle


Midle



Mediam



low
Low


Hihg
High
.global
fasle
No


N



n



no



yes
Yes


y



tru


# Correct the misspelled entries in `dat` using the
# `dict` dictionary

result <- dta_replace(
  dat = data_misspelled, 
  dict = dict_misspelled, 
  .name = variable, 
  .wrong = old, 
  .correct = new
)
dta_gtable(head(result))


  id
      region
      age
      height
      weight
      blood_group
      marital_status
      education
      ses
      r
      python
      sas
      stata
      spss
      excel
    
STM/7539
South
46
1.85
77
AB
Other
Bachelors
Middle
No
No
Yes
Yes
No
No
STM/7993
South
45
1.64
53
AB
Married
Bachelors
Middle
Yes
No
No
No
Yes
No
STM/7387
South
37
1.61
75
B
Single
Bachelors
Middle
No
Yes
No
Yes
No
No
STM/5598
West
45
1.80
69
B
Single
Masters
Low
Yes
No
Yes
No
No
Yes
STM/5901
South
51
1.81
53
A
Other
Bachelors
Medium
No
Yes
Yes
No
Yes
Yes
STM/7529
North East
56
1.35
56
O
Single
Bachelors
Middle
Yes
No
No
Yes
Yes
No

id	region	age	height	weight	blood_group	marital_status	education	ses	r	python	sas	stata	spss	excel
STM/7539	southern	46	1.85	77	ba	Other	BSC	medium	n	n	yes	tru	n	n
STM/7993	South	45	1.64	53	AB	maried	Bsc	Midle	y	N	n	no	tru	N
STM/7387	southern	37	1.61	75	B	single	BSC	Middle	N	tru	no	Yes	No	n
STM/5598	West	45	1.80	69	B	Single	MSc	low	y	N	y	fasle	no	Yes
STM/5901	South	51	1.81	53	A+	Others	Bachelors	Medium	no	yes	Yes	n	tru	Yes
STM/7529	North-East	56	1.35	56	O	Single	Bsc	Mediam	yes	fasle	no	yes	tru	no

variable	old	new
region	southern	South
	wset	West
	North-East	North East
blood_group	ba	AB
	ab
	o	O
	A+	A
	a
marital_status	maried	Married
	single	Single
	Singel
	Others	Other
education	BSC	Bachelors
	Bsc
	Bachelers
	MSc	Masters
	PhD	Doctorate
	Doctoral
ses	medium	Middle
	Midle
	Mediam
	low	Low
	Hihg	High
.global	fasle	No
	N
	n
	no
	yes	Yes
	y
	tru