Apache Airflow is widely used to orchestrate ETL pipelines, but failure handling in large-scale environments remains largely reactive. While Airflow provides strong scheduling and execution primitives, identifying root causes and detecting silent data issues still requires significant manual effort.
This article presents an approach implemented in a production data platform to improve failure detection and diagnosis using a combination of large language models (LLMs), statistical methods, and traditional machine learning. The system focuses on three areas: log-based failure classification, data integrity anomaly detection, and predictive failure modeling.