AI-Driven Anomaly Detection for Data Quality • autoFlagR

An R package for automated data quality auditing using unsupervised machine learning. autoFlagR provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.

Features

Automated Preprocessing: Handles identifiers, scales numerical features, and encodes categorical variables
Multiple AI Algorithms: Supports Isolation Forest and Local Outlier Factor (LOF) methods
Benchmarking Metrics: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available
Professional Reports: Generates PDF, HTML, and DOCX reports with visualizations and prioritized audit listings
Tidy Interface: Designed to work seamlessly with the tidyverse

Installation

install.packages("autoFlagR")

Quick Start

library(autoFlagR)
library(dplyr)

# Example data
data <- data.frame(
  patient_id = 1:1000,
  age = rnorm(1000, 50, 15),
  cost = rnorm(1000, 10000, 5000),
  length_of_stay = rpois(1000, 5),
  gender = sample(c("M", "F"), 1000, replace = TRUE)
)

# Score anomalies
scored_data <- score_anomaly(data, method = "iforest", contamination = 0.05)

# Flag top anomalies
flagged_data <- flag_top_anomalies(scored_data, contamination = 0.05)

# Generate comprehensive PDF report
generate_audit_report(data, filename = "my_audit_report", output_format = "pdf")

Core Functions

`prep_for_anomaly()`

Prepares data for anomaly detection by handling identifiers, scaling numerical features, and encoding categorical variables.

`score_anomaly()`

Calculates anomaly scores using Isolation Forest or Local Outlier Factor algorithms. Optionally evaluates performance against ground truth labels.

`flag_top_anomalies()`

Categorizes records as anomalous or normal based on their anomaly scores.

`generate_audit_report()`

Executes the complete pipeline and generates a professional PDF, HTML, or DOCX report with visualizations and prioritized audit listings.

Benchmarking with Ground Truth

When you have labeled data (e.g., from a synthetic dataset with known errors), you can evaluate the performance of the anomaly detection:

# Create data with known errors
data <- data.frame(
  patient_id = 1:1000,
  age = rnorm(1000, 50, 15),
  cost = rnorm(1000, 10000, 5000),
  is_error = sample(c(0, 1), 1000, replace = TRUE, prob = c(0.95, 0.05))
)

# Score with ground truth
scored_data <- score_anomaly(data, ground_truth_col = "is_error")

# Extract benchmarking metrics
metrics <- extract_benchmark_metrics(scored_data)
print(metrics$auc_roc)  # Area Under ROC Curve
print(metrics$auc_pr)   # Area Under Precision-Recall Curve
print(metrics$top_k_recall)  # Top-K Recall values

Report Contents

The generated PDF, HTML, or DOCX report includes:

Executive Summary: Key metrics and overall anomaly rate
Benchmarking Results: AUC-ROC, AUC-PR, and Top-K Recall (if ground truth provided)
Anomaly Score Distribution: Histogram showing the distribution of scores
Prioritized Audit Listing: Table of top N most anomalous records
Bivariate Visualizations: Scatter plots highlighting anomalies
Variable Distribution Comparisons: Histograms comparing normal vs. anomalous records
Technical Appendix: Algorithm details and column information

Citation

To cite autoFlagR in publications, use:

citation("autoFlagR")

License

MIT License

Contributing

Contributions are welcome! Please feel free to:

Submit issues and bug reports
Propose new features
Submit pull requests
Improve documentation

Contact

Package Maintainer: Vikrant Dev Rathore
GitHub: vikrant31/autoFlagR
Issues: GitHub Issues

autoFlagR: AI-Driven Anomaly Detection for Data Quality