- Pipeline: Handle NaNs, missing data, and duplicates.
- Fix at Source: Prefer physical improvements over digital cleaning.
- Normalization: Ensure consistent scaling between training and inference.
- Label Uncertainty: Ground truth is often subjective.
- Inter-annotator Variance: Human disagreement sets an upper performance bound.
- Spatial Leakage: Split by sample, not randomly, to avoid physical correlation.
- Temporal Leakage: Use sliding windows for time-series data.
- Information Leakage: Exclude features derived from targets.
- "Garbage In, Garbage Out".
- Cost of acquisition vs. bad labels.
- Systematic cleaning.
- Imputation strategies.
- Scalers (Min-Max, Z-score).
- Manual annotation pitfalls.
- Quantifying label uncertainty.
- Crowdsourcing vs. Experts.
- Test data influencing training.
- Spatial and Feature leakage.
- "Too good to be true" heuristic.
- Grouped K-Fold validation.
- Nested Cross-Validation.
- ML pipeline checklist.
Summary for ML-PC Week 3:
- Emphasizes data integrity, cleaning, and normalization.
- Highlights materials annotation challenges like inter-annotator variance.
- Details Data Leakage risks from physical/spatial correlations.
- Introduces robust validation to ensure true generalization.