Adaptive Data Enrichment Pre-Processing System (Adeps) For Duplicate Detection, Outlier Handling, Imputation, And Encoding
Keywords:
Machine learning, duplicate detection, outlier handling, imputation, and categorical encodingAbstract
The Adaptive Data Enrichment Pre-processing System (ADEPS) is a comprehensive and flexible framework designed to optimize data quality for analytical and machine learning tasks. ADEPS integrates four critical preprocessing functions: duplicate detection, outlier handling, imputation, and categorical encoding. Each component is developed to address common data quality issues that can adversely affect model accuracy and reliability. ADEPS’s duplicate detection uses advanced similarity algorithms to identify redundant entries, ensuring dataset integrity. Outlier handling leverages clustering and normalization techniques to effectively identify and process anomalies. For missing values, enhanced MICE-based imputation fills gaps using adaptive modeling with error terms, while categorical encoding techniques, such as Target Encoding, transform high-cardinality categorical data for machine compatibility. The ADEPS framework enhances model performance by delivering a high-quality, enriched dataset ready for robust analysis and predictive modeling. Its modular design also allows for adjustments based on data type, resource requirements, and analysis needs, making it suitable for a wide range of applications.