Data Privacy and Documentation Workflows • devkit

Introduction

When sharing datasets or publishing packages containing data, developers must ensure that: 1. Sensitive Personally Identifiable Information (PII) is anonymized. 2. Datasets are thoroughly documented with standard data dictionaries. 3. Package functions are covered by reliable test suites.

devkit provides modules to streamline data masking, roxygen2 documentation generation, and unit-test scaffolding.

🔐 Anonymizing Personally Identifiable Information (PII)

Before sharing research data or package datasets, PII like names, email addresses, phone numbers, and exact locations must be scrambled or removed.

mask_identity() runs an interactive console wizard that reads a dataframe, prompts you to select columns containing sensitive data, and applies appropriate masking algorithms (e.g., scrambling strings, grouping ages, or replacing values with random identifiers).

Example: Masking a Patient Dataset

Imagine we have a dummy clinical dataset containing sensitive columns:

# Create a dummy patient dataset
patient_data <- data.frame(
  patient_id = 1:5,
  name = c("Alice Smith", "Bob Jones", "Charlie Brown", "Diana Prince", "Evan Wright"),
  age = c(34, 45, 23, 56, 41),
  email = c("alice@mail.com", "bob@mail.com", "charlie@mail.com", "diana@mail.com", "evan@mail.com"),
  diagnosis = c("Flu", "Cold", "Flu", "Allergy", "Healthy"),
  stringsAsFactors = FALSE
)

# Run the interactive masking wizard
masked_data <- mask_identity(patient_data)

# The wizard will prompt you:
# 1. Scramble/Anonymize the 'name' column? Yes -> replaces names with scrambled strings (e.g., 'Ujdfn Hsoiu')
# 2. Scramble/Anonymize the 'email' column? Yes -> replaces emails with random strings (e.g., 'mask_1@example.com')
# 3. Apply category grouping to 'age'? Yes -> groups exact ages into ranges (e.g., '30-39', '40-49')

# Verify the masked dataset
head(masked_data)

📝 Dictating Data Dictionaries

CRAN requires that all package datasets are documented using a @format roxygen2 block listing the column names and their descriptions. Documenting this manually is tedious.

dictate_dictionary() runs an interactive wizard that inspects your dataframe’s column names and classes, prompts you to input description bullets for each column, and generates a pre-formatted roxygen2 documentation block ready to be pasted into your package code files.

# Create a dummy sales dataframe
sales_df <- data.frame(
  transaction_id = 1001:1003,
  amount_usd = c(12.50, 45.00, 120.99),
  category = c("Book", "Electronics", "Clothing"),
  stringsAsFactors = FALSE
)

# Generate a roxygen2 data dictionary interactively
dict_res <- dictate_dictionary(sales_df)

# The console wizard will prompt you for descriptions:
# - 'transaction_id': Unique transaction identifier
# - 'amount_usd': Transaction amount in US Dollars
# - 'category': Category of item purchased

# Print the generated roxygen2 lines
cat(dict_res$roxygen_block, sep = "\n")

The output will be formatted like:

#' @format A data frame with 3 rows and 3 variables:
#' \describe{
#'   \item{transaction_id}{Unique transaction identifier}
#'   \item{amount_usd}{Transaction amount in US Dollars}
#'   \item{category}{Category of item purchased}
#' }

🧪 Scaffolding Unit Tests

Writing test suites for your functions ensures code reliability. scaffold_tests() creates test files under tests/testthat/ with structural boilerplate matching your function’s signature and return type.

# Scaffold a test file for the function 'calculate_mean'
scaffold_tests(target_func = "calculate_mean")

This generates tests/testthat/test-calculate_mean.R with pre-configured assertions:

test_that("calculate_mean works as expected", {
  # Add your assertions here
  # expect_equal(calculate_mean(x), expected_value)
})