Preprocesses data for unsupervised anomaly detection by handling identifiers, scaling numerical features, and encoding categorical variables.
Arguments
- data
A data frame containing the data to be preprocessed.
- id_cols
Character vector of column names to exclude from scoring (e.g., patient IDs, encounter IDs). If NULL, attempts to auto-detect common ID column patterns.
- exclude_cols
Character vector of additional columns to exclude from scoring. Default is NULL.
- scale_method
Character string indicating the scaling method for numerical variables. Options: "mad" (Median Absolute Deviation, default), "minmax" (min-max normalization), or "none" (no scaling).
Value
A list containing:
- prepared_data
A numeric matrix ready for anomaly detection
- metadata
A list with mapping information:
original_data: The original data frame
id_cols: Column names used as identifiers
numeric_cols: Column names of numeric variables
categorical_cols: Column names of categorical variables
excluded_cols: Column names excluded from scoring
Examples
data <- data.frame(
patient_id = 1:20,
age = rnorm(20, 50, 15),
cost = rnorm(20, 10000, 5000),
gender = sample(c("M", "F"), 20, replace = TRUE)
)
prep_result <- prep_for_anomaly(data, id_cols = "patient_id")
