Building an Internal R Package

Author

Eric Shearer / eshearer@ochca.com

Published

January 7, 2025

Why build a package?

” The concept is that anytime anyone on the team solves a problem that they think others might encounter, they can generalize their code, include it into Rbnb and now the whole team can access it.” - How R Helps Airbnb Make the Most of Its Data (PDF link)

“We build packages to develop collaborative solutions to common problems, to standardize the visual presentation of our work, and to avoid reinventing the wheel.” - Using R packages and education to scale Data Science at Airbnb (link)

Examples of “common problems” in our day to day work:

  • Recoding demographics (e.g. age groups, race/ethnicity)
  • Standardizing addresses prior to geocoding
  • Making unique id’s to link two datasets together
  • Suppressing or removing sensitive data before publishing
  • Making epidemic curves/maps/other data visualizations
  • Converting dates to higher levels for aggregating (e.g. mmwr week or year, week ending date)

Benefits

  • Eliminate copy & pasting code from script to script to script to script to…
  • Standardization across data visualizations and data clean-up
  • Reduces barriers/time to completing more meaningful analysis

Challenges

Building packages is rewarding but not without challenges:

  • Ideas (existing vs. new)
  • ALL. THE. TESTING. (and documenting it)
  • Deployment/distribution

Building the R package

Tools

usethis is a workflow package: sets up GitHub, pkgdown website, templating.

testthat automates unit testing, which describes what you expect a function to do.

roxygen2 describes your functions in formal documentation.

devtools builds the actual package, checks for unregistered dependencies, runs checks to ensure package can be ran on multiple systems (Windows, macOS, Ubuntu), and executes unit testing from testthat.

Functions

How do we decide what goes into the package? Using the example below, there are some obvious candidates we can generalize into functions: age, gender, sexual orientation, and race/ethnicity.

cases <- udf %>%
  filter(RStatus %in% c("Confirmed","Probable","Suspect")) %>%
  left_join(pos_lab, by = "IncidentID") %>%
  mutate(
    across(c("DtEpisode","DtOnset","DtLabCollect"), ~ as.Date(., "%m/%d/%Y")),
    AgeGroup = case_when(
      Age %in% 0:15 ~ "0-15",
      Age %in% 16:24 ~ "16-24",
      Age %in% 25:34 ~ "25-34",
      Age %in% 35:44 ~ "35-44",
      Age %in% 45:54 ~ "45-54",
      Age %in% 55:64 ~ "55-64",
      Age > 64 ~ "65+"
    ),
    Gender = case_when(
      Gender == "M" ~ "Male",
      Gender == "F" ~ "Female",
      Gender == "TF" ~ "Transgender woman",
      Gender == "TM" ~ "Transgender man",
      Gender %in% c("U","D") ~ "Missing/Unknown",
      Gender == "I" ~ "Identity Not Listed",
      Gender == "G" ~ "Genderqueer/Non-binary",
      is.na(Gender) ~ "Missing/Unknown"
    ),
    CTCIAdtlDemOrient = case_when(
      CTCIAdtlDemOrient == "BIS" ~ "Bisexual",
      CTCIAdtlDemOrient %in% c("DNK","UNK","NOT","DEC") ~ "Unknown",
      CTCIAdtlDemOrient == "HET" ~ "Heterosexual or straight",
      CTCIAdtlDemOrient == "HOM" ~ "Gay, lesbian, or same gender-loving",
      is.na(CTCIAdtlDemOrient) ~ "Unknown",
      TRUE ~ CTCIAdtlDemOrient),
    RaceEthnicity = ifelse(Ethnicity == "Hispanic or Latino", Ethnicity, Race),
    RaceEthnicity = case_when(
      RaceEthnicity == "Hispanic or Latino" ~ "Hispanic/Latinx",
      RaceEthnicity == "Black or African American" ~ "Black/African American",
      RaceEthnicity == "Native Hawaiian or Other Pacific Islander" ~ "NHOPI",
      RaceEthnicity == "American Indian or Alaska Native" ~ "AI/AN",
      TRUE ~ RaceEthnicity),
    CliTxOrthopoxTx = case_when(
      CliTxOrthopoxTx == "YT" ~ "Tecovirimat",
      CliTxOrthopoxTx == "YO" ~ "Yes, Not Specified",
      CliTxOrthopoxTx == "N" ~ "None",
      CliTxOrthopoxTx == "DK" ~ "Unknown",
      is.na(CliTxOrthopoxTx) ~ "Unknown",
      TRUE ~ "Unknown"),
    homeless = case_when(
      EpiGrpSetLTExp_HML_1 == "HML" ~ "Y",
      EpiGrpSetLTExp_HML_2 == "HML" ~ "Y",
      EpiGrpSetLTExp_HML_3 == "HML" ~ "Y",
      City == "Homeless" ~ "Y",
      Zip == "99999" ~ "Y",
      TRUE ~ "N"),
    HOSPHOSPITALIZED = case_when(
      HOSPHOSPITALIZED == "Y" ~ "Yes",
      HOSPHOSPITALIZED == "N" ~ "No",
      is.na(HOSPHOSPITALIZED) ~ "Unknown",
      TRUE ~ "Unknown"),
    monkeypox_pcr = case_when(
      LabLRSSpecTstMeth_1_1 == 1 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_1_2 == 1 & LabLRSSpecRslt_2 == 2 ~ 1,
      LabLRSSpecTstMeth_1_3 == 1 & LabLRSSpecRslt_3 == 2 ~ 1,
      TRUE ~ 0),
    orthopox_pcr = case_when(
      LabLRSSpecTstMeth_4_1 == 4 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_5_1 == 5 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_4_2 == 4 & LabLRSSpecRslt_2 == 2 ~ 1,
      LabLRSSpecTstMeth_5_2 == 5 & LabLRSSpecRslt_2 == 2 ~ 1,
      TRUE ~ 0),
    case_type = case_when(
      test_type == "Monkeypox PCR" ~ "Confirmed", #elr
      monkeypox_pcr == 1 ~ "Confirmed", #manually entered
      test_type %in% c("Non-variola Orthopox PCR","Orthopoxvirus PCR") ~ "Probable", #elr
      orthopox_pcr == 1 ~ "Probable", #manually entered
      TRUE ~ "Suspect")
  ) %>%
  select(Disease, IncidentID, Gender = Sex, Public_Orientation = CTCIAdtlDemOrient, Age, AgeGroup, RaceEthnicity, City, Zip, RStatus, case_type, Investigator,
         FinalDispo, DtEpisode, DtOnset, DtLabCollect, Gender_Partner, treatment = CliTxOrthopoxTx, tx_type = CliTxOrthopoxTxSpcfy, homeless, hospitalized = HOSPHOSPITALIZED,
         Outcome = OUTCOMEOUTCOME, DtDeath, NOTES)

Resources


Package overview

Specs

  • Built for familiar data systems:
    • CalREDIE
    • CAIR2
    • Vital Records
  • 40+ functions
  • Includes Rmarkdown template for documenting R projects/analysis workflow
  • Very low dependencies, mostly written in base R
  • Extensive unit testing to ensure functions do what we want them to do

Full documentation: https://ericmshearer.github.io/OCepi/

To install: devtools::install_github("ericmshearer/OCepi")

Example Use Case

The following use case aims to demonstrate the different ways {OCepi} has streamlined how we go from raw data to data visualization.

To start, we’ll load simulated outbreak data included in the package.

library(gt)
library(sf)
library(dplyr)
library(OCepi)
library(ggplot2)
library(patchwork)

dis_x <- linelist
Ethnicity Race Gender Age SexualOrientation SpecimenDate
Non-Hispanic or Latino Multiple Races M 46 HET 2022-06-07
Unknown Unknown M 4 HET 2022-06-09
Non-Hispanic or Latino White F 52 UNK 2022-06-07
Non-Hispanic or Latino White F 77 UNK 2022-06-11
Unknown American Indian or Alaska Native M 71 HET 2022-06-10
Non-Hispanic or Latino Other M 70 HET 2022-06-09
Non-Hispanic or Latino Black or African American F 11 HET 2022-06-08
Non-Hispanic or Latino Black or African American F 8 HET 2022-06-12
Hispanic or Latino American Indian or Alaska Native F 41 HET 2022-06-12
Non-Hispanic or Latino Black or African American M 56 HET 2022-06-10

Clean up the data

First things first: recode ethnicity and race to one variable, recode age to age groups, and recode gender and sexual orientation abbreviation to full names.

Note

All of our recoding functions aim to be compatible with CalREDIE, VRBIS, CAIR2, and BioSense/ESSENCE.

dis_x <- dis_x |>
  mutate(
    race_ethnicity = recode_race(Ethnicity, Race, abbr_names = FALSE),
    age_groups = age_groups(Age, type = "covid"),
    Gender = recode_gender(Gender),
    SexualOrientation = recode_orientation(SexualOrientation),
    week_ending = week_ending_date(SpecimenDate)
  )
Ethnicity Race Gender Age SexualOrientation SpecimenDate race_ethnicity age_groups week_ending
Non-Hispanic or Latino Multiple Races Male 46 Heterosexual or straight 2022-06-07 Multiple Races 45-54 2022-06-11
Unknown Unknown Male 4 Heterosexual or straight 2022-06-09 Missing/Unknown 0-17 2022-06-11
Non-Hispanic or Latino White Female 52 Missing/Unknown 2022-06-07 White 45-54 2022-06-11
Non-Hispanic or Latino White Female 77 Missing/Unknown 2022-06-11 White 75-84 2022-06-11
Unknown American Indian or Alaska Native Male 71 Heterosexual or straight 2022-06-10 American Indian/Alaska Native 65-74 2022-06-11
Non-Hispanic or Latino Other Male 70 Heterosexual or straight 2022-06-09 Other 65-74 2022-06-11
Non-Hispanic or Latino Black or African American Female 11 Heterosexual or straight 2022-06-08 Black/African American 0-17 2022-06-11
Non-Hispanic or Latino Black or African American Female 8 Heterosexual or straight 2022-06-12 Black/African American 0-17 2022-06-18
Hispanic or Latino American Indian or Alaska Native Female 41 Heterosexual or straight 2022-06-12 Hispanic/Latinx 35-44 2022-06-18
Non-Hispanic or Latino Black or African American Male 56 Heterosexual or straight 2022-06-10 Black/African American 55-64 2022-06-11

For OCepi::age_groups(x, type = "decade"), we combined our most common ways to bin age into one function via presets. Examples: Pertussis, West Nile Virus, decade, chronic Hepatitis C, flu vax. In OCepi::recode_race(ethnicity, race), we provided the argument abbr_names = FALSE in case you want to shorten long names e.g. Native Hawaiian/Other Pacific Islander -> NHOPI.

Summarize Data

Next is to run some basic frequencies using OCepi::add_percent(x, digits = 1, multiply = TRUE) and create labels using OCepi::n_percent(n, percent, reverse = TRUE, n_suppress = x). We have tried to incorporate as many ways to customize the output based on our experience: how many digits to round to, suppressing low values, multiplying a fraction by 100 or not, order for labels (n then % or % then n).

dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE, n_suppress = 10)
  ) |>
  gt() |>
  apollo_table(size = 14)
race_ethnicity n percent label
American Indian/Alaska Native 10 9.5 9.5% (10)
Asian 7 6.7 **
Black/African American 16 15.2 15.2% (16)
Hispanic/Latinx 19 18.1 18.1% (19)
Missing/Unknown 8 7.6 **
Multiple Races 6 5.7 **
Native Hawaiian/Other Pacific Islander 10 9.5 9.5% (10)
Other 15 14.3 14.3% (15)
White 14 13.3 13.3% (14)

Calling OCepi::add_percent() is much simpler than round(n / sum(n) * 100, digits = 1), or OCepi::n_percent() versus sprintf("%s (%s%%), n, percent). Not only are the {OCepi} functions simpler, you gain tons of flexibility in how you want your summarized data to look.

We also have the ability to calculate incidence rates using OCepi::rate_per_100k(n, pop_denom, digits = 1) or time between dates using OCepi::time_between(recent_date, older_date, unit = c("days")).

Data Visualizations

Now that we have summarized data, we can build our data visualizations. To achieve unified aesthetics across our four surveillance branches, we developed theme_apollo(direction = x) and apollo_label(). Our theme is designed to work vertical/horizontal orientations as well as maps.

Code
dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  ) |>
  ggplot(aes(x = race_ethnicity, y = percent)) +
  geom_bar(stat = "identity") +
  scale_x_discrete() +
  scale_y_continuous(expand = c(0,0), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo() +
  apollo_label(aes(label = label), vjust = -0.4) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  )

Overall this looks much nicer than {ggplot2} right out of the box, but we can further improve by wrapping the long labels using OCepi::wrap_labels() and filling the bars with OCepi::cdcd_color(). Our version of label wrapping will break/wrap the text at whatever delimiter you want (ex: tab, comma, slash, and/or space). This may not always produce the desired outcome, so we recommend trying scales::label_wrap(15) or coord_flip() as another approach.

Code
dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  ) |>
  ggplot(aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity", fill = cdcd_color("dodgers")) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(aes(label = label), hjust = -0.3) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

Extending ggplot2

There are other ways {OCepi} can help elevate data visualizations, specifically around highlighting groups of interest. It may not be obvious in the plot above which group makes up the greatest proportion of cases. {OCepi} offers two solutions: OCepi::highlight_geom() and OCepi::desaturate_geom(). One version highlights your important group and fades the rest to light grey, the other highlights the important group and desaturates the rest. At their core, these functions use dplyr::filter() to achieve the desired effect. Examples: percent == max(percent), n > 50, or gender %in% c("Female").

In the following example, the group(s) making up the highest proportion of cases for that variable will be highlighted.

Code
re_tbl <- dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  )

l <- ggplot(data = re_tbl, aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity") +
  highlight_geom(percent == max(percent), pal = cdcd_color("london pink")) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(data = re_tbl, aes(label = label), hjust = -0.2) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

r <- ggplot(data = re_tbl, aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity") +
  desaturate_geom(percent == max(percent), pal = cdcd_color("dodgers"), desaturate = 0.8) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(data = re_tbl, aes(label = label), hjust = -0.2) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

l + r

We also designed the functions to work with facet_grid() and facet_wrap().

Code
hpi <- data.frame(
  stringsAsFactors = FALSE,
              Year = c(2021L,2021L,2021L,2021L,
                       2021L,2022L,2022L,2022L,2022L,2022L,2023L,2023L,
                       2023L,2023L,2023L,2024L,2024L,2024L,2024L,2024L),
      hpi_quartile = c("1","2","3","4","Unknown",
                       "1","2","3","4","Unknown","1","2","3","4",
                       "Unknown","1","2","3","4","Unknown"),
                 n = c(2L,12L,4L,7L,1L,5L,11L,
                       6L,6L,2L,1L,11L,5L,1L,1L,1L,3L,5L,5L,2L),
           Percent = c(8L,46L,15L,27L,4L,17L,37L,
                       20L,20L,7L,5L,58L,26L,5L,5L,6L,19L,31L,31L,
                       12L),
             Label = c("8% (2)","46% (12)","15% (4)",
                       "27% (7)","4% (1)","17% (5)","37% (11)","20% (6)",
                       "20% (6)","7% (2)","5% (1)","58% (11)","26% (5)",
                       "5% (1)","5% (1)","6% (1)","19% (3)","31% (5)",
                       "31% (5)","12% (2)")
)

ggplot(data = hpi, aes(x = hpi_quartile, y = Percent)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Year, nrow = 2, scales = "free_x") +
  desaturate_geom(Percent == max(Percent), pal = cdcd_color("plum"), desaturate = 0.8) +
  scale_y_continuous(expand = c(0,0), labels = scales::label_percent(scale = 1), limits = c(0,70)) +
  theme_apollo() +
  apollo_label(data = hpi, aes(label = Label), vjust = -0.4) +
  labs(
    title = "Distribution of Disease Y by Healthy Places Index, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "HPI Quartile",
    y = "Proportion (%)",
    caption = "HPI Quartiles range 1 to 4. Higher quartiles represent healthier community conditions."
  )

Other examples of highlight/desaturate_geom():

Code
zip_map <- oc_zip_sf
zip_map$n_cases <- sample(1:99, 86)
    
#desaturate - map
ggplot(data = zip_map) +
  geom_sf() +
  desaturate_geom(n_cases > 80, pal = cdcd_color("dodgers"), desaturate = 0.8, linewidth = 0.5) +
  geom_sf_text(data = zip_map, aes(label = Zip)) +
  theme_apollo(direction = "map") +
  labs(
    title = "Disease X Cases by Zip Code",
    subtitle = "PHS/Communicable Disease Control",
    caption = "*Note: zip codes with >80 cases are highlighted."
  )

Code
covid <- read.csv("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/046cdd2b-31e5-4d34-9ed3-b48cdbc4be7a/download/covid19cases_test.csv", na = "")

covid <- covid |>
  mutate(
    date = as.Date(date),
    rate = rate_per_100k(cases, population, digits = 1)
  ) |>
  arrange(date) |>
  group_by(area) |>
  mutate(rate_ma_7 = zoo::rollmean(rate, k = 7, fill = 0, align = "right")) |>
  ungroup() |>
  filter(area %in% c("Orange","Los Angeles","San Diego"))

covid_first_wave <- filter(covid, date >= "2020-10-01", date <= "2021-01-05")

ggplot(data = covid_first_wave, aes(x = date, y = rate_ma_7, color = area)) +
  geom_line(linewidth = 1.2) +
  apollo_label(data = end_points(covid_first_wave, date = date), aes(label = area, color = area), hjust = -0.1, color = NULL) +
  highlight_geom(area == "Orange", pal = cdcd_color("orange")) +
  scale_x_date(date_labels = "%m/%y", expand = expansion(add = c(0,14))) +
  scale_y_continuous(expand = c(0,0), breaks = c(0,25,50,75,100,125,150)) +
  theme_apollo(legend = "Hide") +
  labs(
    title = "COVID-19 Incidence Rates by SoCal County",
    subtitle = "PHS/Communicable Disease Control",
    x = "Date",
    y = "Rate per 100,000",
    color = "LHJ"
  )

Honorable Mentions

end_points()

Direct labeling is often very helpful, particularly line graphs with >1 group. To achieve this, we designed OCepi::end_points(df, date = x, group_by = y). If your groups all end at the same end point, you can ignore group_by. When groups end at different time points, use group_by.

Code
ggplot(data = covid_first_wave, aes(x = date, y = rate_ma_7, color = area, linetype = area)) +
  geom_line(linewidth = 1.2) +
  scale_x_date(date_labels = "%m/%y", expand = expansion(add = c(0,14))) +
  scale_y_continuous(expand = c(0,0), breaks = c(0,25,50,75,100,125,150)) +
  geom_text(data = end_points(covid_first_wave, date = date), aes(label = area, color = area), hjust = -0.1, size = 4.5) +
  theme_apollo(legend = "Hide") +
  labs(
    title = "COVID-19 Incidence Rates by SoCal County",
    subtitle = "PHS/Communicable Disease Control",
    x = "Date",
    y = "Rate per 100,000",
    color = "LHJ"
  ) +
  scale_color_manual(values = cdcd_color("mustard","light blue","title color"))

pos()/neg()

We have begun to condense all the variants of “positive” and “negative” from ELR data into OCepi::pos()/OCepi::neg(). Both functions contain SNOMED and string patterns. Please note: casing is set to upper case - adjust your dataset accordingly.

collapse controls whether to keep output as vector or collapse into one long string separated by “|”:

pos()
 [1] "POSITIVE"         "REACTIVE"         "DETECTED"         "10828004"        
 [5] "260373001"        "840533007"        "PCRP11"           "PDETD"           
 [9] "COVPRE"           "11214006"         "Positive for IgG" "POS"             
[13] "DECTECTED"        "REA"              "PPOSI"            "REAC"            
pos(collapse = TRUE)
[1] "POSITIVE|REACTIVE|DETECTED|10828004|260373001|840533007|PCRP11|PDETD|COVPRE|11214006|Positive for IgG|POS|DECTECTED|REA|PPOSI|REAC"

Now used within dplyr::case_when statement:

elr <- data.frame(Results = c("POSITIVE","DETECTED","POS"))

elr <- elr |>
  mutate(
    #use case 1
    test_results = case_when(
      Results %in% pos(collapse = FALSE) ~ "Positive"
    ),
    #use case 2
    test_results2 = case_when(
      grepl(pos(collapse = TRUE), Results, ignore.case = TRUE) ~ "Positive"
    )
  )

print(elr)
   Results test_results test_results2
1 POSITIVE     Positive      Positive
2 DETECTED     Positive      Positive
3      POS     Positive      Positive

As we find more use cases, we will add new ways “positive” and “negative” show up in the data. Another idea is to add an argument to add results on the fly:

#this is just an idea, not live yet#
pos(collapse = FALSE, add_in = c("POSITIVE!!!","MOREPOSITIVE","POSITIVEPOSITIVE"))

match_id_*

When two datasets have no matching primary key, OCepi::match_id_*() creates a primary key to join on. Probabilistic matching is excellent but computationally intensive (especially at scale with a so-so tech stack). When the id’s are made, all of the string is capitalized. Depending on how messy your address and phone number variables are, you may want to clean them up using OCepi::clean_address(x, keep_extra = TRUE) or OCepi::clean_phone(). At it’s core, the id consists of: first four letters of first/last name and date of birth.

#variant 1 - uses first 10 characters of address
match_id_1("Mickey","Mouse","1955-07-17","1313 Disneyland Dr")
[1] "MICKMOUS1955-07-171313 Disne"
#variant 1 - standardizing address
match_id_1("Mickey","Mouse","1955-07-17",clean_address("1313 Disneyland Dr"))
[1] "MICKMOUS1955-07-171313 Disne"
#variant 2 - uses full address
match_id_2("Mickey","Mouse","1955-07-17","1313 Disneyland Dr")
[1] "MICKMOUS1955-07-171313 Disneyland Dr"
#variant 2 - standardizing address
match_id_2("Mickey","Mouse","1955-07-17",clean_address("1313 Disneyland Dr"))
[1] "MICKMOUS1955-07-171313 Disneyland Drive"
#variant 3
match_id_3("Mickey","Mouse","1955-07-17","714-781-4636")
[1] "MICKMOUS1955-07-17714-781-4636"
#variant 3 - standardizing phone number
match_id_3("Mickey","Mouse","1955-07-17",clean_phone("714-781-4636"))
[1] "MICKMOUS1955-07-177147814636"
#variant 4
match_id_4("Mickey","Mouse","1955-07-17")
[1] "MICKMOUS1955-07-17"

After merging on primary key, we recommend assessing the quality of the match and comparing using the other id methods.

Example:

udf <- udf |>
  mutate(
    match_key = match_id_1(FirstName, LastName, DOB, Address)
  )

vrbis <- vrbis |>
  mutate(
    match_key = match_id_1(FirstName, LastName, DOB, Address)
  )

out <- left_join(udf, vrbis, by = "match_key")