Skip to contents

dta_replace() corrects misspelled entries in a data frame or tibble based on a provided dictionary. The dictionary specifies the correct values for misspelled entries in a specified column.

Usage

dta_replace(dat, dict, .name, .wrong, .correct)

Arguments

dat

A data frame or tibble containing the data to be corrected.

dict

A data frame or tibble serving as the dictionary, with columns specifying the correct and incorrect spellings.

.name

The column in both dat and dict to match entries by (e.g., a unique identifier).

.wrong

The column in dict containing the misspelled values to be corrected.

.correct

The column in dict containing the correct values for the misspelled entries.

Value

A data frame or tibble with corrected entries.

Details

The function first validates that dat and dict are data frames or tibbles. It then fills missing values in the dict for the columns specified in .name and .correct, using a downward fill strategy. Finally, it replaces misspelled values in dat using a dictionary lookup facilitated by matchmaker::match_df().

Examples

# Example data with misspelled characters / strings

data("data_misspelled")
dta_gtable(head(data_misspelled))
id region age height weight blood_group marital_status education ses r python sas stata spss excel
STM/7539 southern 46 1.85 77 ba Other BSC medium n n yes tru n n
STM/7993 South 45 1.64 53 AB maried Bsc Midle y N n no tru N
STM/7387 southern 37 1.61 75 B single BSC Middle N tru no Yes No n
STM/5598 West 45 1.80 69 B Single MSc low y N y fasle no Yes
STM/5901 South 51 1.81 53 A+ Others Bachelors Medium no yes Yes n tru Yes
STM/7529 North-East 56 1.35 56 O Single Bsc Mediam yes fasle no yes tru no
data("dict_misspelled") dta_gtable(dict_misspelled)
variable old new
region southern South

wset West

North-East North East
blood_group ba AB

ab

o O

A+ A

a
marital_status maried Married

single Single

Singel

Others Other
education BSC Bachelors

Bsc

Bachelers

MSc Masters

PhD Doctorate

Doctoral
ses medium Middle

Midle

Mediam

low Low

Hihg High
.global fasle No

N

n

no

yes Yes

y

tru
# Correct the misspelled entries in `dat` using the # `dict` dictionary result <- dta_replace( dat = data_misspelled, dict = dict_misspelled, .name = variable, .wrong = old, .correct = new ) dta_gtable(head(result))
id region age height weight blood_group marital_status education ses r python sas stata spss excel
STM/7539 South 46 1.85 77 AB Other Bachelors Middle No No Yes Yes No No
STM/7993 South 45 1.64 53 AB Married Bachelors Middle Yes No No No Yes No
STM/7387 South 37 1.61 75 B Single Bachelors Middle No Yes No Yes No No
STM/5598 West 45 1.80 69 B Single Masters Low Yes No Yes No No Yes
STM/5901 South 51 1.81 53 A Other Bachelors Medium No Yes Yes No Yes Yes
STM/7529 North East 56 1.35 56 O Single Bachelors Middle Yes No No Yes Yes No