Skip to contents

Preprocesses data for unsupervised anomaly detection by handling identifiers, scaling numerical features, and encoding categorical variables.

Usage

prep_for_anomaly(
  data,
  id_cols = NULL,
  exclude_cols = NULL,
  scale_method = "mad"
)

Arguments

data

A data frame containing the data to be preprocessed.

id_cols

Character vector of column names to exclude from scoring (e.g., patient IDs, encounter IDs). If NULL, attempts to auto-detect common ID column patterns.

exclude_cols

Character vector of additional columns to exclude from scoring. Default is NULL.

scale_method

Character string indicating the scaling method for numerical variables. Options: "mad" (Median Absolute Deviation, default), "minmax" (min-max normalization), or "none" (no scaling).

Value

A list containing:

prepared_data

A numeric matrix ready for anomaly detection

metadata

A list with mapping information:

  • original_data: The original data frame

  • id_cols: Column names used as identifiers

  • numeric_cols: Column names of numeric variables

  • categorical_cols: Column names of categorical variables

  • excluded_cols: Column names excluded from scoring

Examples

data <- data.frame(
  patient_id = 1:20,
  age = rnorm(20, 50, 15),
  cost = rnorm(20, 10000, 5000),
  gender = sample(c("M", "F"), 20, replace = TRUE)
)
prep_result <- prep_for_anomaly(data, id_cols = "patient_id")