ABSTRACT
Data quality challenges represent the primary barrier to successful AI/ML implementation in industrial settings. Through analysis of high-profile failures - including IBM Watson for Oncology's $4B loss due to training on synthetic rather than real patient data, and a global retailer's $20M forecasting project collapse from fragmented data sources—this session examines the systemic issues that cause 70% of AI projects to fail specifically due to data quality problems.
Real-World Impact Analysis: Using housing price prediction as a practical example, we demonstrate the quantitative impact of data preprocessing on model performance. Our comparative analysis shows a 27.3% improvement in prediction accuracy (RMSE reduction from baseline) when proper data cleaning, feature engineering, and preprocessing pipelines are implemented versus raw data approaches.
Technical Deep Dive: The session includes live demonstration of data preprocessing techniques through interactive notebooks, showing how:
- Missing value imputation strategies affect model reliability
- Feature engineering creates predictive signal from raw operational data
- Proper categorical encoding prevents information leakage
- Standardization improves convergence and model stability
Solution Framework: The session concludes with a demonstration of Norma, an intelligent data preparation assistant that automates the preprocessing workflows shown to be critical for AI success, reducing data preparation time by 90% while maintaining the quality standards demonstrated in our comparative analysis.
Key Outcomes: Attendees will understand the quantifiable business impact of data quality, see evidence-based preprocessing techniques, and learn practical implementation strategies for improving AI project success rates in industrial environments.
PRESENTERS
Eugene Paulia, Noel Thomas, Sam Moses
Eugene Paulia is a data scientist and co-founder at GroupLabs, where he helps organizations prepare their data for advanced analytics and intelligent systems. With experience in energy, enterprise software, and large-scale data infrastructure, Eugene focuses on making data cleaner, smarter, and more impactful.
Noel Thomas is trained in Software and Biomedical Engineering. Noel is a machine learning researcher with a passion for interdisciplinary innovation. He is a founder of GroupLabs, where he focuses on cutting-edge applications of machine learning.
Sam Moses is a software engineer at GroupLabs, passionate about building creative, AI-driven educational tools and interactive web experiences. With a background in data science and a flair for imaginative design, Sam blends technical expertise with innovation to make learning more engaging and impactful.