An R package for automated data quality auditing using unsupervised machine learning. autoFlagR provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.
Features
- Automated Preprocessing: Handles identifiers, scales numerical features, and encodes categorical variables
- Multiple AI Algorithms: Supports Isolation Forest and Local Outlier Factor (LOF) methods
- Benchmarking Metrics: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available
- Professional Reports: Generates PDF, HTML, and DOCX reports with visualizations and prioritized audit listings
- Tidy Interface: Designed to work seamlessly with the tidyverse
Installation
install.packages("autoFlagR")Quick Start
library(autoFlagR)
library(dplyr)
# Example data
data <- data.frame(
patient_id = 1:1000,
age = rnorm(1000, 50, 15),
cost = rnorm(1000, 10000, 5000),
length_of_stay = rpois(1000, 5),
gender = sample(c("M", "F"), 1000, replace = TRUE)
)
# Score anomalies
scored_data <- score_anomaly(data, method = "iforest", contamination = 0.05)
# Flag top anomalies
flagged_data <- flag_top_anomalies(scored_data, contamination = 0.05)
# Generate comprehensive PDF report
generate_audit_report(data, filename = "my_audit_report", output_format = "pdf")Core Functions
prep_for_anomaly()
Prepares data for anomaly detection by handling identifiers, scaling numerical features, and encoding categorical variables.
Benchmarking with Ground Truth
When you have labeled data (e.g., from a synthetic dataset with known errors), you can evaluate the performance of the anomaly detection:
# Create data with known errors
data <- data.frame(
patient_id = 1:1000,
age = rnorm(1000, 50, 15),
cost = rnorm(1000, 10000, 5000),
is_error = sample(c(0, 1), 1000, replace = TRUE, prob = c(0.95, 0.05))
)
# Score with ground truth
scored_data <- score_anomaly(data, ground_truth_col = "is_error")
# Extract benchmarking metrics
metrics <- extract_benchmark_metrics(scored_data)
print(metrics$auc_roc) # Area Under ROC Curve
print(metrics$auc_pr) # Area Under Precision-Recall Curve
print(metrics$top_k_recall) # Top-K Recall valuesReport Contents
The generated PDF, HTML, or DOCX report includes:
- Executive Summary: Key metrics and overall anomaly rate
- Benchmarking Results: AUC-ROC, AUC-PR, and Top-K Recall (if ground truth provided)
- Anomaly Score Distribution: Histogram showing the distribution of scores
- Prioritized Audit Listing: Table of top N most anomalous records
- Bivariate Visualizations: Scatter plots highlighting anomalies
- Variable Distribution Comparisons: Histograms comparing normal vs. anomalous records
- Technical Appendix: Algorithm details and column information
Contributing
Contributions are welcome! Please feel free to:
- Submit issues and bug reports
- Propose new features
- Submit pull requests
- Improve documentation
Contact
- Package Maintainer: Vikrant Dev Rathore
- GitHub: vikrant31/autoFlagR
- Issues: GitHub Issues
