Why build a package?

” The concept is that anytime anyone on the team solves a problem that they think others might encounter, they can generalize their code, include it into Rbnb and now the whole team can access it.” - How R Helps Airbnb Make the Most of Its Data (PDF link)

“We build packages to develop collaborative solutions to common problems, to standardize the visual presentation of our work, and to avoid reinventing the wheel.” - Using R packages and education to scale Data Science at Airbnb (link)

Examples of “common problems” in our day to day work:

Recoding demographics (e.g. age groups, race/ethnicity)
Standardizing addresses prior to geocoding
Making unique id’s to link two datasets together
Suppressing or removing sensitive data before publishing
Making epidemic curves/maps/other data visualizations
Converting dates to higher levels for aggregating (e.g. mmwr week or year, week ending date)

Benefits

Eliminate copy & pasting code from script to script to script to script to…
Standardization across data visualizations and data clean-up
Reduces barriers/time to completing more meaningful analysis

Challenges

Building packages is rewarding but not without challenges:

Ideas (existing vs. new)
ALL. THE. TESTING. (and documenting it)
Deployment/distribution

Building the R package

Tools

usethis is a workflow package: sets up GitHub, pkgdown website, templating.

testthat automates unit testing, which describes what you expect a function to do.

roxygen2 describes your functions in formal documentation.

devtools builds the actual package, checks for unregistered dependencies, runs checks to ensure package can be ran on multiple systems (Windows, macOS, Ubuntu), and executes unit testing from testthat.

Functions

How do we decide what goes into the package? Using the example below, there are some obvious candidates we can generalize into functions: age, gender, sexual orientation, and race/ethnicity.

cases <- udf %>%
  filter(RStatus %in% c("Confirmed","Probable","Suspect")) %>%
  left_join(pos_lab, by = "IncidentID") %>%
  mutate(
    across(c("DtEpisode","DtOnset","DtLabCollect"), ~ as.Date(., "%m/%d/%Y")),
    AgeGroup = case_when(
      Age %in% 0:15 ~ "0-15",
      Age %in% 16:24 ~ "16-24",
      Age %in% 25:34 ~ "25-34",
      Age %in% 35:44 ~ "35-44",
      Age %in% 45:54 ~ "45-54",
      Age %in% 55:64 ~ "55-64",
      Age > 64 ~ "65+"
    ),
    Gender = case_when(
      Gender == "M" ~ "Male",
      Gender == "F" ~ "Female",
      Gender == "TF" ~ "Transgender woman",
      Gender == "TM" ~ "Transgender man",
      Gender %in% c("U","D") ~ "Missing/Unknown",
      Gender == "I" ~ "Identity Not Listed",
      Gender == "G" ~ "Genderqueer/Non-binary",
      is.na(Gender) ~ "Missing/Unknown"
    ),
    CTCIAdtlDemOrient = case_when(
      CTCIAdtlDemOrient == "BIS" ~ "Bisexual",
      CTCIAdtlDemOrient %in% c("DNK","UNK","NOT","DEC") ~ "Unknown",
      CTCIAdtlDemOrient == "HET" ~ "Heterosexual or straight",
      CTCIAdtlDemOrient == "HOM" ~ "Gay, lesbian, or same gender-loving",
      is.na(CTCIAdtlDemOrient) ~ "Unknown",
      TRUE ~ CTCIAdtlDemOrient),
    RaceEthnicity = ifelse(Ethnicity == "Hispanic or Latino", Ethnicity, Race),
    RaceEthnicity = case_when(
      RaceEthnicity == "Hispanic or Latino" ~ "Hispanic/Latinx",
      RaceEthnicity == "Black or African American" ~ "Black/African American",
      RaceEthnicity == "Native Hawaiian or Other Pacific Islander" ~ "NHOPI",
      RaceEthnicity == "American Indian or Alaska Native" ~ "AI/AN",
      TRUE ~ RaceEthnicity),
    CliTxOrthopoxTx = case_when(
      CliTxOrthopoxTx == "YT" ~ "Tecovirimat",
      CliTxOrthopoxTx == "YO" ~ "Yes, Not Specified",
      CliTxOrthopoxTx == "N" ~ "None",
      CliTxOrthopoxTx == "DK" ~ "Unknown",
      is.na(CliTxOrthopoxTx) ~ "Unknown",
      TRUE ~ "Unknown"),
    homeless = case_when(
      EpiGrpSetLTExp_HML_1 == "HML" ~ "Y",
      EpiGrpSetLTExp_HML_2 == "HML" ~ "Y",
      EpiGrpSetLTExp_HML_3 == "HML" ~ "Y",
      City == "Homeless" ~ "Y",
      Zip == "99999" ~ "Y",
      TRUE ~ "N"),
    HOSPHOSPITALIZED = case_when(
      HOSPHOSPITALIZED == "Y" ~ "Yes",
      HOSPHOSPITALIZED == "N" ~ "No",
      is.na(HOSPHOSPITALIZED) ~ "Unknown",
      TRUE ~ "Unknown"),
    monkeypox_pcr = case_when(
      LabLRSSpecTstMeth_1_1 == 1 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_1_2 == 1 & LabLRSSpecRslt_2 == 2 ~ 1,
      LabLRSSpecTstMeth_1_3 == 1 & LabLRSSpecRslt_3 == 2 ~ 1,
      TRUE ~ 0),
    orthopox_pcr = case_when(
      LabLRSSpecTstMeth_4_1 == 4 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_5_1 == 5 & LabLRSSpecRslt_1 == 2 ~ 1,
      LabLRSSpecTstMeth_4_2 == 4 & LabLRSSpecRslt_2 == 2 ~ 1,
      LabLRSSpecTstMeth_5_2 == 5 & LabLRSSpecRslt_2 == 2 ~ 1,
      TRUE ~ 0),
    case_type = case_when(
      test_type == "Monkeypox PCR" ~ "Confirmed", #elr
      monkeypox_pcr == 1 ~ "Confirmed", #manually entered
      test_type %in% c("Non-variola Orthopox PCR","Orthopoxvirus PCR") ~ "Probable", #elr
      orthopox_pcr == 1 ~ "Probable", #manually entered
      TRUE ~ "Suspect")
  ) %>%
  select(Disease, IncidentID, Gender = Sex, Public_Orientation = CTCIAdtlDemOrient, Age, AgeGroup, RaceEthnicity, City, Zip, RStatus, case_type, Investigator,
         FinalDispo, DtEpisode, DtOnset, DtLabCollect, Gender_Partner, treatment = CliTxOrthopoxTx, tx_type = CliTxOrthopoxTxSpcfy, homeless, hospitalized = HOSPHOSPITALIZED,
         Outcome = OUTCOMEOUTCOME, DtDeath, NOTES)

Resources

Package overview

Specs

Built for familiar data systems:
- CalREDIE
- CAIR2
- Vital Records
40+ functions
Includes Rmarkdown template for documenting R projects/analysis workflow
Very low dependencies, mostly written in base R
Extensive unit testing to ensure functions do what we want them to do

Full documentation: https://ericmshearer.github.io/OCepi/

To install: devtools::install_github("ericmshearer/OCepi")

Example Use Case

The following use case aims to demonstrate the different ways {OCepi} has streamlined how we go from raw data to data visualization.

To start, we’ll load simulated outbreak data included in the package.

library(gt)
library(sf)
library(dplyr)
library(OCepi)
library(ggplot2)
library(patchwork)

dis_x <- linelist

Ethnicity	Race	Gender	Age	SexualOrientation	SpecimenDate
Non-Hispanic or Latino	Multiple Races	M	46	HET	2022-06-07
Unknown	Unknown	M	4	HET	2022-06-09
Non-Hispanic or Latino	White	F	52	UNK	2022-06-07
Non-Hispanic or Latino	White	F	77	UNK	2022-06-11
Unknown	American Indian or Alaska Native	M	71	HET	2022-06-10
Non-Hispanic or Latino	Other	M	70	HET	2022-06-09
Non-Hispanic or Latino	Black or African American	F	11	HET	2022-06-08
Non-Hispanic or Latino	Black or African American	F	8	HET	2022-06-12
Hispanic or Latino	American Indian or Alaska Native	F	41	HET	2022-06-12
Non-Hispanic or Latino	Black or African American	M	56	HET	2022-06-10

Clean up the data

First things first: recode ethnicity and race to one variable, recode age to age groups, and recode gender and sexual orientation abbreviation to full names.

Note

All of our recoding functions aim to be compatible with CalREDIE, VRBIS, CAIR2, and BioSense/ESSENCE.

dis_x <- dis_x |>
  mutate(
    race_ethnicity = recode_race(Ethnicity, Race, abbr_names = FALSE),
    age_groups = age_groups(Age, type = "covid"),
    Gender = recode_gender(Gender),
    SexualOrientation = recode_orientation(SexualOrientation),
    week_ending = week_ending_date(SpecimenDate)
  )

Ethnicity	Race	Gender	Age	SexualOrientation	SpecimenDate	race_ethnicity	age_groups	week_ending
Non-Hispanic or Latino	Multiple Races	Male	46	Heterosexual or straight	2022-06-07	Multiple Races	45-54	2022-06-11
Unknown	Unknown	Male	4	Heterosexual or straight	2022-06-09	Missing/Unknown	0-17	2022-06-11
Non-Hispanic or Latino	White	Female	52	Missing/Unknown	2022-06-07	White	45-54	2022-06-11
Non-Hispanic or Latino	White	Female	77	Missing/Unknown	2022-06-11	White	75-84	2022-06-11
Unknown	American Indian or Alaska Native	Male	71	Heterosexual or straight	2022-06-10	American Indian/Alaska Native	65-74	2022-06-11
Non-Hispanic or Latino	Other	Male	70	Heterosexual or straight	2022-06-09	Other	65-74	2022-06-11
Non-Hispanic or Latino	Black or African American	Female	11	Heterosexual or straight	2022-06-08	Black/African American	0-17	2022-06-11
Non-Hispanic or Latino	Black or African American	Female	8	Heterosexual or straight	2022-06-12	Black/African American	0-17	2022-06-18
Hispanic or Latino	American Indian or Alaska Native	Female	41	Heterosexual or straight	2022-06-12	Hispanic/Latinx	35-44	2022-06-18
Non-Hispanic or Latino	Black or African American	Male	56	Heterosexual or straight	2022-06-10	Black/African American	55-64	2022-06-11

For OCepi::age_groups(x, type = "decade"), we combined our most common ways to bin age into one function via presets. Examples: Pertussis, West Nile Virus, decade, chronic Hepatitis C, flu vax. In OCepi::recode_race(ethnicity, race), we provided the argument abbr_names = FALSE in case you want to shorten long names e.g. Native Hawaiian/Other Pacific Islander -> NHOPI.

Summarize Data

Next is to run some basic frequencies using OCepi::add_percent(x, digits = 1, multiply = TRUE) and create labels using OCepi::n_percent(n, percent, reverse = TRUE, n_suppress = x). We have tried to incorporate as many ways to customize the output based on our experience: how many digits to round to, suppressing low values, multiplying a fraction by 100 or not, order for labels (n then % or % then n).

dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE, n_suppress = 10)
  ) |>
  gt() |>
  apollo_table(size = 14)

race_ethnicity	n	percent	label
American Indian/Alaska Native	10	9.5	9.5% (10)
Asian	7	6.7	**
Black/African American	16	15.2	15.2% (16)
Hispanic/Latinx	19	18.1	18.1% (19)
Missing/Unknown	8	7.6	**
Multiple Races	6	5.7	**
Native Hawaiian/Other Pacific Islander	10	9.5	9.5% (10)
Other	15	14.3	14.3% (15)
White	14	13.3	13.3% (14)

Calling OCepi::add_percent() is much simpler than round(n / sum(n) * 100, digits = 1), or OCepi::n_percent() versus sprintf("%s (%s%%), n, percent). Not only are the {OCepi} functions simpler, you gain tons of flexibility in how you want your summarized data to look.

We also have the ability to calculate incidence rates using OCepi::rate_per_100k(n, pop_denom, digits = 1) or time between dates using OCepi::time_between(recent_date, older_date, unit = c("days")).

Data Visualizations

Now that we have summarized data, we can build our data visualizations. To achieve unified aesthetics across our four surveillance branches, we developed theme_apollo(direction = x) and apollo_label(). Our theme is designed to work vertical/horizontal orientations as well as maps.

Code

dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  ) |>
  ggplot(aes(x = race_ethnicity, y = percent)) +
  geom_bar(stat = "identity") +
  scale_x_discrete() +
  scale_y_continuous(expand = c(0,0), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo() +
  apollo_label(aes(label = label), vjust = -0.4) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  )

Overall this looks much nicer than {ggplot2} right out of the box, but we can further improve by wrapping the long labels using OCepi::wrap_labels() and filling the bars with OCepi::cdcd_color(). Our version of label wrapping will break/wrap the text at whatever delimiter you want (ex: tab, comma, slash, and/or space). This may not always produce the desired outcome, so we recommend trying scales::label_wrap(15) or coord_flip() as another approach.

Code

dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  ) |>
  ggplot(aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity", fill = cdcd_color("dodgers")) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(aes(label = label), hjust = -0.3) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

Extending ggplot2

There are other ways {OCepi} can help elevate data visualizations, specifically around highlighting groups of interest. It may not be obvious in the plot above which group makes up the greatest proportion of cases. {OCepi} offers two solutions: OCepi::highlight_geom() and OCepi::desaturate_geom(). One version highlights your important group and fades the rest to light grey, the other highlights the important group and desaturates the rest. At their core, these functions use dplyr::filter() to achieve the desired effect. Examples: percent == max(percent), n > 50, or gender %in% c("Female").

In the following example, the group(s) making up the highest proportion of cases for that variable will be highlighted.

Code

re_tbl <- dis_x |>
  count(race_ethnicity) |>
  mutate(
    percent = add_percent(n, digits = 1),
    label = n_percent(n, percent, reverse = TRUE)
  )

l <- ggplot(data = re_tbl, aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity") +
  highlight_geom(percent == max(percent), pal = cdcd_color("london pink")) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(data = re_tbl, aes(label = label), hjust = -0.2) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

r <- ggplot(data = re_tbl, aes(x = forcats::fct_rev(race_ethnicity), y = percent)) +
  geom_bar(stat = "identity") +
  desaturate_geom(percent == max(percent), pal = cdcd_color("dodgers"), desaturate = 0.8) +
  scale_x_discrete(labels = wrap_labels(delim = "/")) +
  scale_y_continuous(expand = expansion(add = c(0,2)), limits = c(0,25), labels = scales::label_percent(scale = 1)) +
  theme_apollo(direction = "horizontal") +
  apollo_label(data = re_tbl, aes(label = label), hjust = -0.2) +
  labs(
    title = "Disease X Cases by Race/Ethnicity, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "Race/Ethnicity",
    y = "Proportion (%)"
  ) +
  coord_flip()

l + r

We also designed the functions to work with facet_grid() and facet_wrap().

Code

hpi <- data.frame(
  stringsAsFactors = FALSE,
              Year = c(2021L,2021L,2021L,2021L,
                       2021L,2022L,2022L,2022L,2022L,2022L,2023L,2023L,
                       2023L,2023L,2023L,2024L,2024L,2024L,2024L,2024L),
      hpi_quartile = c("1","2","3","4","Unknown",
                       "1","2","3","4","Unknown","1","2","3","4",
                       "Unknown","1","2","3","4","Unknown"),
                 n = c(2L,12L,4L,7L,1L,5L,11L,
                       6L,6L,2L,1L,11L,5L,1L,1L,1L,3L,5L,5L,2L),
           Percent = c(8L,46L,15L,27L,4L,17L,37L,
                       20L,20L,7L,5L,58L,26L,5L,5L,6L,19L,31L,31L,
                       12L),
             Label = c("8% (2)","46% (12)","15% (4)",
                       "27% (7)","4% (1)","17% (5)","37% (11)","20% (6)",
                       "20% (6)","7% (2)","5% (1)","58% (11)","26% (5)",
                       "5% (1)","5% (1)","6% (1)","19% (3)","31% (5)",
                       "31% (5)","12% (2)")
)

ggplot(data = hpi, aes(x = hpi_quartile, y = Percent)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Year, nrow = 2, scales = "free_x") +
  desaturate_geom(Percent == max(Percent), pal = cdcd_color("plum"), desaturate = 0.8) +
  scale_y_continuous(expand = c(0,0), labels = scales::label_percent(scale = 1), limits = c(0,70)) +
  theme_apollo() +
  apollo_label(data = hpi, aes(label = Label), vjust = -0.4) +
  labs(
    title = "Distribution of Disease Y by Healthy Places Index, 2024",
    subtitle = "PHS/Communicable Disease Control",
    x = "HPI Quartile",
    y = "Proportion (%)",
    caption = "HPI Quartiles range 1 to 4. Higher quartiles represent healthier community conditions."
  )

Other examples of highlight/desaturate_geom():

Code

zip_map <- oc_zip_sf
zip_map$n_cases <- sample(1:99, 86)
    
#desaturate - map
ggplot(data = zip_map) +
  geom_sf() +
  desaturate_geom(n_cases > 80, pal = cdcd_color("dodgers"), desaturate = 0.8, linewidth = 0.5) +
  geom_sf_text(data = zip_map, aes(label = Zip)) +
  theme_apollo(direction = "map") +
  labs(
    title = "Disease X Cases by Zip Code",
    subtitle = "PHS/Communicable Disease Control",
    caption = "*Note: zip codes with >80 cases are highlighted."
  )

Code

covid <- read.csv("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/046cdd2b-31e5-4d34-9ed3-b48cdbc4be7a/download/covid19cases_test.csv", na = "")

covid <- covid |>
  mutate(
    date = as.Date(date),
    rate = rate_per_100k(cases, population, digits = 1)
  ) |>
  arrange(date) |>
  group_by(area) |>
  mutate(rate_ma_7 = zoo::rollmean(rate, k = 7, fill = 0, align = "right")) |>
  ungroup() |>
  filter(area %in% c("Orange","Los Angeles","San Diego"))

covid_first_wave <- filter(covid, date >= "2020-10-01", date <= "2021-01-05")

ggplot(data = covid_first_wave, aes(x = date, y = rate_ma_7, color = area)) +
  geom_line(linewidth = 1.2) +
  apollo_label(data = end_points(covid_first_wave, date = date), aes(label = area, color = area), hjust = -0.1, color = NULL) +
  highlight_geom(area == "Orange", pal = cdcd_color("orange")) +
  scale_x_date(date_labels = "%m/%y", expand = expansion(add = c(0,14))) +
  scale_y_continuous(expand = c(0,0), breaks = c(0,25,50,75,100,125,150)) +
  theme_apollo(legend = "Hide") +
  labs(
    title = "COVID-19 Incidence Rates by SoCal County",
    subtitle = "PHS/Communicable Disease Control",
    x = "Date",
    y = "Rate per 100,000",
    color = "LHJ"
  )

Honorable Mentions

end_points()

Direct labeling is often very helpful, particularly line graphs with >1 group. To achieve this, we designed OCepi::end_points(df, date = x, group_by = y). If your groups all end at the same end point, you can ignore group_by. When groups end at different time points, use group_by.

Code

ggplot(data = covid_first_wave, aes(x = date, y = rate_ma_7, color = area, linetype = area)) +
  geom_line(linewidth = 1.2) +
  scale_x_date(date_labels = "%m/%y", expand = expansion(add = c(0,14))) +
  scale_y_continuous(expand = c(0,0), breaks = c(0,25,50,75,100,125,150)) +
  geom_text(data = end_points(covid_first_wave, date = date), aes(label = area, color = area), hjust = -0.1, size = 4.5) +
  theme_apollo(legend = "Hide") +
  labs(
    title = "COVID-19 Incidence Rates by SoCal County",
    subtitle = "PHS/Communicable Disease Control",
    x = "Date",
    y = "Rate per 100,000",
    color = "LHJ"
  ) +
  scale_color_manual(values = cdcd_color("mustard","light blue","title color"))

pos()/neg()

We have begun to condense all the variants of “positive” and “negative” from ELR data into OCepi::pos()/OCepi::neg(). Both functions contain SNOMED and string patterns. Please note: casing is set to upper case - adjust your dataset accordingly.

collapse controls whether to keep output as vector or collapse into one long string separated by “|”:

pos()

 [1] "POSITIVE"         "REACTIVE"         "DETECTED"         "10828004"        
 [5] "260373001"        "840533007"        "PCRP11"           "PDETD"           
 [9] "COVPRE"           "11214006"         "Positive for IgG" "POS"             
[13] "DECTECTED"        "REA"              "PPOSI"            "REAC"

pos(collapse = TRUE)

[1] "POSITIVE|REACTIVE|DETECTED|10828004|260373001|840533007|PCRP11|PDETD|COVPRE|11214006|Positive for IgG|POS|DECTECTED|REA|PPOSI|REAC"

Now used within dplyr::case_when statement:

elr <- data.frame(Results = c("POSITIVE","DETECTED","POS"))

elr <- elr |>
  mutate(
    #use case 1
    test_results = case_when(
      Results %in% pos(collapse = FALSE) ~ "Positive"
    ),
    #use case 2
    test_results2 = case_when(
      grepl(pos(collapse = TRUE), Results, ignore.case = TRUE) ~ "Positive"
    )
  )

print(elr)

   Results test_results test_results2
1 POSITIVE     Positive      Positive
2 DETECTED     Positive      Positive
3      POS     Positive      Positive

As we find more use cases, we will add new ways “positive” and “negative” show up in the data. Another idea is to add an argument to add results on the fly:

#this is just an idea, not live yet#
pos(collapse = FALSE, add_in = c("POSITIVE!!!","MOREPOSITIVE","POSITIVEPOSITIVE"))

match_id_*

When two datasets have no matching primary key, OCepi::match_id_*() creates a primary key to join on. Probabilistic matching is excellent but computationally intensive (especially at scale with a so-so tech stack). When the id’s are made, all of the string is capitalized. Depending on how messy your address and phone number variables are, you may want to clean them up using OCepi::clean_address(x, keep_extra = TRUE) or OCepi::clean_phone(). At it’s core, the id consists of: first four letters of first/last name and date of birth.

#variant 1 - uses first 10 characters of address
match_id_1("Mickey","Mouse","1955-07-17","1313 Disneyland Dr")

[1] "MICKMOUS1955-07-171313 Disne"

#variant 1 - standardizing address
match_id_1("Mickey","Mouse","1955-07-17",clean_address("1313 Disneyland Dr"))

[1] "MICKMOUS1955-07-171313 Disne"

#variant 2 - uses full address
match_id_2("Mickey","Mouse","1955-07-17","1313 Disneyland Dr")

[1] "MICKMOUS1955-07-171313 Disneyland Dr"

#variant 2 - standardizing address
match_id_2("Mickey","Mouse","1955-07-17",clean_address("1313 Disneyland Dr"))

[1] "MICKMOUS1955-07-171313 Disneyland Drive"

#variant 3
match_id_3("Mickey","Mouse","1955-07-17","714-781-4636")

[1] "MICKMOUS1955-07-17714-781-4636"

#variant 3 - standardizing phone number
match_id_3("Mickey","Mouse","1955-07-17",clean_phone("714-781-4636"))

[1] "MICKMOUS1955-07-177147814636"

#variant 4
match_id_4("Mickey","Mouse","1955-07-17")

[1] "MICKMOUS1955-07-17"

After merging on primary key, we recommend assessing the quality of the match and comparing using the other id methods.

Example:

udf <- udf |>
  mutate(
    match_key = match_id_1(FirstName, LastName, DOB, Address)
  )

vrbis <- vrbis |>
  mutate(
    match_key = match_id_1(FirstName, LastName, DOB, Address)
  )

out <- left_join(udf, vrbis, by = "match_key")