Skip to contents

CRAN status License: MIT

An R package for automated data quality auditing using unsupervised machine learning. autoFlagR provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.

Features

  • Automated Preprocessing: Handles identifiers, scales numerical features, and encodes categorical variables
  • Multiple AI Algorithms: Supports Isolation Forest and Local Outlier Factor (LOF) methods
  • Benchmarking Metrics: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available
  • Professional Reports: Generates PDF, HTML, and DOCX reports with visualizations and prioritized audit listings
  • Tidy Interface: Designed to work seamlessly with the tidyverse

Installation

install.packages("autoFlagR")

Quick Start

library(autoFlagR)
library(dplyr)

# Example data
data <- data.frame(
  patient_id = 1:1000,
  age = rnorm(1000, 50, 15),
  cost = rnorm(1000, 10000, 5000),
  length_of_stay = rpois(1000, 5),
  gender = sample(c("M", "F"), 1000, replace = TRUE)
)

# Score anomalies
scored_data <- score_anomaly(data, method = "iforest", contamination = 0.05)

# Flag top anomalies
flagged_data <- flag_top_anomalies(scored_data, contamination = 0.05)

# Generate comprehensive PDF report
generate_audit_report(data, filename = "my_audit_report", output_format = "pdf")

Core Functions

prep_for_anomaly()

Prepares data for anomaly detection by handling identifiers, scaling numerical features, and encoding categorical variables.

score_anomaly()

Calculates anomaly scores using Isolation Forest or Local Outlier Factor algorithms. Optionally evaluates performance against ground truth labels.

flag_top_anomalies()

Categorizes records as anomalous or normal based on their anomaly scores.

generate_audit_report()

Executes the complete pipeline and generates a professional PDF, HTML, or DOCX report with visualizations and prioritized audit listings.

Benchmarking with Ground Truth

When you have labeled data (e.g., from a synthetic dataset with known errors), you can evaluate the performance of the anomaly detection:

# Create data with known errors
data <- data.frame(
  patient_id = 1:1000,
  age = rnorm(1000, 50, 15),
  cost = rnorm(1000, 10000, 5000),
  is_error = sample(c(0, 1), 1000, replace = TRUE, prob = c(0.95, 0.05))
)

# Score with ground truth
scored_data <- score_anomaly(data, ground_truth_col = "is_error")

# Extract benchmarking metrics
metrics <- extract_benchmark_metrics(scored_data)
print(metrics$auc_roc)  # Area Under ROC Curve
print(metrics$auc_pr)   # Area Under Precision-Recall Curve
print(metrics$top_k_recall)  # Top-K Recall values

Report Contents

The generated PDF, HTML, or DOCX report includes:

  1. Executive Summary: Key metrics and overall anomaly rate
  2. Benchmarking Results: AUC-ROC, AUC-PR, and Top-K Recall (if ground truth provided)
  3. Anomaly Score Distribution: Histogram showing the distribution of scores
  4. Prioritized Audit Listing: Table of top N most anomalous records
  5. Bivariate Visualizations: Scatter plots highlighting anomalies
  6. Variable Distribution Comparisons: Histograms comparing normal vs. anomalous records
  7. Technical Appendix: Algorithm details and column information

Citation

To cite autoFlagR in publications, use:

citation("autoFlagR")

License

MIT License

Contributing

Contributions are welcome! Please feel free to:

  • Submit issues and bug reports
  • Propose new features
  • Submit pull requests
  • Improve documentation

Contact