Score Anomalies Using Unsupervised Machine Learning

Usage

score_anomaly(
  data,
  method = "iforest",
  contamination = 0.05,
  ground_truth_col = NULL,
  id_cols = NULL,
  exclude_cols = NULL,
  ...
)

Arguments

data

A data frame containing the data to be scored.

method

Character string indicating the anomaly detection method. Options: "iforest" (Isolation Forest, default) or "lof" (Local Outlier Factor).

contamination

Numeric value between 0 and 1 indicating the expected proportion of anomalies in the data. Default is 0.05 (5

ground_truth_colCharacter string naming a column in data that contains binary ground truth labels (0/1 or FALSE/TRUE) for known anomalies. If provided, benchmarking metrics will be calculated. Default is NULL.

id_colsCharacter vector of column names to exclude from scoring. Passed to prep_for_anomaly().

exclude_colsCharacter vector of additional columns to exclude. Passed to prep_for_anomaly().

...Additional arguments passed to the underlying algorithm. For Isolation Forest: ntrees, sample_size, max_depth. For LOF: minPts (number of neighbors; deprecated k is converted to minPts).

A data frame with the original data plus an anomaly_score column. If ground_truth_col is provided, the result includes an attribute benchmark_metrics containing: auc_roc (Area Under the ROC Curve), auc_pr (Area Under the Precision-Recall Curve), top_k_recall (List of recall values for top K records: K = 10, 50, 100, 500), and contamination_rate (Actual proportion flagged as anomalous). Calculates anomaly scores for each record using Isolation Forest or Local Outlier Factor algorithms. Optionally evaluates performance against ground truth labels for benchmarking.