Skip to content

Commit 1f6a3eb

Browse files
author
Quarto GHA Workflow Runner
committed
Built site for gh-pages
1 parent a3f69bf commit 1f6a3eb

29 files changed

Lines changed: 171 additions & 308 deletions

.nojekyll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
01bdab7e
1+
b00719d5

_tex/index.tex

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,9 @@ \subsubsection{Unit I --- Experimental Data as a Learning Problem (Weeks
350350
\emph{Lecture: Tuesday, 28.04.2026, 14:15-15:45 \textbar{} Exercise:
351351
Thursday, 30.04.2026, 16:15-17:45}
352352

353+
\textbf{Slides:}
354+
\href{https://pelzlab.science/public_presentations/ml_for_characterization_and_processing/unit03_data_quality/01_intro.html}{Open}
355+
353356
\begin{itemize}
354357
\tightlist
355358
\item
@@ -360,14 +363,25 @@ \subsubsection{Unit I --- Experimental Data as a Learning Problem (Weeks
360363
Why ``good accuracy'' often means a broken pipeline.
361364
\end{itemize}
362365

363-
\textbf{Summary:} This unit focuses on the most critical and often
364-
overlooked part of the ML pipeline: data integrity. We discuss
365-
systematic data cleaning and normalization techniques while highlighting
366-
the unique challenges of labeling experimental materials data, such as
367-
inter-annotator variance. A major focus is on \textbf{Data Leakage},
368-
specifically how spatial and physical correlations in materials samples
369-
can lead to deceptively high model performance. We introduce robust
370-
validation strategies to ensure models generalize to truly unseen data.
366+
\textbf{Summary:} This unit covers the often-overlooked half of an ML
367+
pipeline: data integrity, validation, and how performance is measured.
368+
We start with the measurement chain and systematic \textbf{data
369+
cleaning} --- handling missing values, outliers, and duplicates with a
370+
``fix at source'' mindset. We then build the \textbf{transformation
371+
toolbox}: centering, min--max and z-score scaling, physics-aware
372+
non-dimensionalisation, log transforms, differentiation, and
373+
frequency-domain views (FFT, triggering for time series). On the
374+
supervision side we examine \textbf{labels and uncertainty} ---
375+
inter-annotator variance, probabilistic labels, and a Bayesian view of
376+
priors, likelihoods, and posteriors --- and then formalize the
377+
\textbf{bias--variance} tradeoff with parsimony and regularization. A
378+
major focus is \textbf{Data Leakage} in materials workflows
379+
(pre-processing, temporal, and group/spatial), tackled with proper
380+
holdout, K-fold, LOOCV, and stratified validation. We close with the
381+
\textbf{error measures} that decide what ``good'' actually means:
382+
MAE/MSE/RMSE and \(R^2\) for regression, and confusion matrices,
383+
precision/recall, F1/Dice, IoU, and categorical cross-entropy for
384+
classification and segmentation.
371385

372386
\textbf{Exercise:}\\
373387
Construct a deliberately flawed ML pipeline and diagnose its failure.

index-meca.zip

-233 KB
Binary file not shown.

index-preview.html

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@
153153
window.document.addEventListener("DOMContentLoaded", function (_event) {
154154
document.body.classList.add('hypothesis-enabled');
155155
});
156-
</script>
156+
</script> <script defer="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>
157157
<link rel="stylesheet" href="styles.css">
158158
<meta name="citation_title" content="Machine Learning in Materials Processing &amp;amp; Characterization">
159159
<meta name="citation_abstract" content="This course teaches how machine learning can be applied to experimental data
@@ -401,12 +401,13 @@ <h4 data-number="1.3.1.2" class="anchored" data-anchor-id="week-2-physics-of-dat
401401
<section id="week-3-data-quality-labels-and-leakage" class="level4" data-number="1.3.1.3">
402402
<h4 data-number="1.3.1.3" class="anchored" data-anchor-id="week-3-data-quality-labels-and-leakage"><span class="header-section-number">1.3.1.3</span> Week 3 – Data quality, labels, and leakage</h4>
403403
<p><em>Lecture: Tuesday, 28.04.2026, 14:15-15:45 | Exercise: Thursday, 30.04.2026, 16:15-17:45</em></p>
404+
<p><strong>Slides:</strong> <a href="https://pelzlab.science/public_presentations/ml_for_characterization_and_processing/unit03_data_quality/01_intro.html">Open</a></p>
404405
<ul>
405406
<li>Annotation uncertainty and inter-annotator variance.</li>
406407
<li>Train/test leakage in materials workflows.</li>
407408
<li>Why “good accuracy” often means a broken pipeline.</li>
408409
</ul>
409-
<p><strong>Summary:</strong> This unit focuses on the most critical and often overlooked part of the ML pipeline: data integrity. We discuss systematic data cleaning and normalization techniques while highlighting the unique challenges of labeling experimental materials data, such as inter-annotator variance. A major focus is on <strong>Data Leakage</strong>, specifically how spatial and physical correlations in materials samples can lead to deceptively high model performance. We introduce robust validation strategies to ensure models generalize to truly unseen data.</p>
410+
<p><strong>Summary:</strong> This unit covers the often-overlooked half of an ML pipeline: data integrity, validation, and how performance is measured. We start with the measurement chain and systematic <strong>data cleaning</strong> — handling missing values, outliers, and duplicates with a “fix at source” mindset. We then build the <strong>transformation toolbox</strong>: centering, min–max and z-score scaling, physics-aware non-dimensionalisation, log transforms, differentiation, and frequency-domain views (FFT, triggering for time series). On the supervision side we examine <strong>labels and uncertainty</strong> — inter-annotator variance, probabilistic labels, and a Bayesian view of priors, likelihoods, and posteriors — and then formalize the <strong>bias–variance</strong> tradeoff with parsimony and regularization. A major focus is <strong>Data Leakage</strong> in materials workflows (pre-processing, temporal, and group/spatial), tackled with proper holdout, K-fold, LOOCV, and stratified validation. We close with the <strong>error measures</strong> that decide what “good” actually means: MAE/MSE/RMSE and <span class="math inline">\(R^2\)</span> for regression, and confusion matrices, precision/recall, F1/Dice, IoU, and categorical cross-entropy for classification and segmentation.</p>
410411
<p><strong>Exercise:</strong><br>
411412
Construct a deliberately flawed ML pipeline and diagnose its failure.</p>
412413
<hr>

index.docx

489 Bytes
Binary file not shown.

index.embed.ipynb

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"\n",
1414
"This course teaches how machine learning can be applied to experimental data from materials processing and characterization. The focus lies on images, spectra, time-series, and processing parameters, and on understanding how physical data formation interacts with learning algorithms. Students learn to build robust, uncertainty-aware ML pipelines for real experimental workflows, avoiding common pitfalls such as data leakage, overfitting, and spurious correlations."
1515
],
16-
"id": "f51de559-b328-412e-a061-d83943d738fd"
16+
"id": "ec8e522f-188b-40d4-8405-b13ac06c92bf"
1717
},
1818
{
1919
"cell_type": "raw",
@@ -76,7 +76,7 @@
7676
"}\n",
7777
"</style>"
7878
],
79-
"id": "4d0b7fd1-a823-4047-905a-375f327a501c"
79+
"id": "91fb63c7-5649-4983-b27c-f804838b71f7"
8080
},
8181
{
8282
"cell_type": "raw",
@@ -106,7 +106,7 @@
106106
" <strong>How to use this course site.</strong> Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.\n",
107107
"</div>"
108108
],
109-
"id": "df4fb92a-b630-4c8a-9316-7f88737ed552"
109+
"id": "927882ed-8b4d-4699-aede-5b041368f773"
110110
},
111111
{
112112
"cell_type": "markdown",
@@ -176,11 +176,13 @@
176176
"\n",
177177
"*Lecture: Tuesday, 28.04.2026, 14:15-15:45 \\| Exercise: Thursday, 30.04.2026, 16:15-17:45*\n",
178178
"\n",
179+
"**Slides:** [Open](https://pelzlab.science/public_presentations/ml_for_characterization_and_processing/unit03_data_quality/01_intro.html)\n",
180+
"\n",
179181
"- Annotation uncertainty and inter-annotator variance.\n",
180182
"- Train/test leakage in materials workflows.\n",
181183
"- Why “good accuracy” often means a broken pipeline.\n",
182184
"\n",
183-
"**Summary:** This unit focuses on the most critical and often overlooked part of the ML pipeline: data integrity. We discuss systematic data cleaning and normalization techniques while highlighting the unique challenges of labeling experimental materials data, such as inter-annotator variance. A major focus is on **Data Leakage**, specifically how spatial and physical correlations in materials samples can lead to deceptively high model performance. We introduce robust validation strategies to ensure models generalize to truly unseen data.\n",
185+
"**Summary:** This unit covers the often-overlooked half of an ML pipeline: data integrity, validation, and how performance is measured. We start with the measurement chain and systematic **data cleaning** — handling missing values, outliers, and duplicates with a “fix at source” mindset. We then build the **transformation toolbox**: centering, min–max and z-score scaling, physics-aware non-dimensionalisation, log transforms, differentiation, and frequency-domain views (FFT, triggering for time series). On the supervision side we examine **labels and uncertainty** — inter-annotator variance, probabilistic labels, and a Bayesian view of priors, likelihoods, and posteriors — and then formalize the **bias–variance** tradeoff with parsimony and regularization. A major focus is **Data Leakage** in materials workflows (pre-processing, temporal, and group/spatial), tackled with proper holdout, K-fold, LOOCV, and stratified validation. We close with the **error measures** that decide what “good” actually means: MAE/MSE/RMSE and $R^2$ for regression, and confusion matrices, precision/recall, F1/Dice, IoU, and categorical cross-entropy for classification and segmentation.\n",
184186
"\n",
185187
"**Exercise:** \n",
186188
"Construct a deliberately flawed ML pipeline and diagnose its failure.\n",
@@ -383,7 +385,7 @@
383385
"\n",
384386
"**Summary:** This unit explores the cutting edge of **Autonomous Characterization**, where machine learning moves from passive data analysis to active instrument control. We introduce **Multi-Modal Data Fusion** techniques to combine information from diverse sensors like SEM images, EDS spectra, and process logs using Bayesian frameworks. We then discuss **Reinforcement Learning (RL)** as a tool for automating complex laboratory tasks, such as instrument tuning and process optimization. Through case studies in microscopy and industrial processing, students learn how to build integrated pipelines that can autonomously find, characterize, and decide the next steps of an experiment."
385387
],
386-
"id": "56ea3dd0-e7f8-4777-b8b1-7b73ce006df2"
388+
"id": "80a417a5-c2df-4d28-9de8-2be3b782a69e"
387389
}
388390
],
389391
"nbformat": 4,

index.html

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,35 @@
6666
});
6767
</script>
6868

69+
<script src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6"></script>
70+
<script defer="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>
71+
72+
<script type="text/javascript">
73+
const typesetMath = (el) => {
74+
if (window.MathJax) {
75+
// MathJax Typeset
76+
window.MathJax.typeset([el]);
77+
} else if (window.katex) {
78+
// KaTeX Render
79+
var mathElements = el.getElementsByClassName("math");
80+
var macros = [];
81+
for (var i = 0; i < mathElements.length; i++) {
82+
var texText = mathElements[i].firstChild;
83+
if (mathElements[i].tagName == "SPAN" && texText && texText.data) {
84+
window.katex.render(texText.data, mathElements[i], {
85+
displayMode: mathElements[i].classList.contains('display'),
86+
throwOnError: false,
87+
macros: macros,
88+
fleqn: false
89+
});
90+
}
91+
}
92+
}
93+
}
94+
window.Quarto = {
95+
typesetMath
96+
};
97+
</script>
6998

7099
<link rel="stylesheet" href="styles.css">
71100
<meta name="citation_title" content="Machine Learning in Materials Processing &amp;amp; Characterization">
@@ -312,12 +341,13 @@ <h4 data-number="1.3.1.2" class="anchored" data-anchor-id="week-2-physics-of-dat
312341
<section id="week-3-data-quality-labels-and-leakage" class="level4" data-number="1.3.1.3">
313342
<h4 data-number="1.3.1.3" class="anchored" data-anchor-id="week-3-data-quality-labels-and-leakage"><span class="header-section-number">1.3.1.3</span> Week 3 – Data quality, labels, and leakage</h4>
314343
<p><em>Lecture: Tuesday, 28.04.2026, 14:15-15:45 | Exercise: Thursday, 30.04.2026, 16:15-17:45</em></p>
344+
<p><strong>Slides:</strong> <a href="https://pelzlab.science/public_presentations/ml_for_characterization_and_processing/unit03_data_quality/01_intro.html">Open</a></p>
315345
<ul>
316346
<li>Annotation uncertainty and inter-annotator variance.</li>
317347
<li>Train/test leakage in materials workflows.</li>
318348
<li>Why “good accuracy” often means a broken pipeline.</li>
319349
</ul>
320-
<p><strong>Summary:</strong> This unit focuses on the most critical and often overlooked part of the ML pipeline: data integrity. We discuss systematic data cleaning and normalization techniques while highlighting the unique challenges of labeling experimental materials data, such as inter-annotator variance. A major focus is on <strong>Data Leakage</strong>, specifically how spatial and physical correlations in materials samples can lead to deceptively high model performance. We introduce robust validation strategies to ensure models generalize to truly unseen data.</p>
350+
<p><strong>Summary:</strong> This unit covers the often-overlooked half of an ML pipeline: data integrity, validation, and how performance is measured. We start with the measurement chain and systematic <strong>data cleaning</strong> — handling missing values, outliers, and duplicates with a “fix at source” mindset. We then build the <strong>transformation toolbox</strong>: centering, min–max and z-score scaling, physics-aware non-dimensionalisation, log transforms, differentiation, and frequency-domain views (FFT, triggering for time series). On the supervision side we examine <strong>labels and uncertainty</strong> — inter-annotator variance, probabilistic labels, and a Bayesian view of priors, likelihoods, and posteriors — and then formalize the <strong>bias–variance</strong> tradeoff with parsimony and regularization. A major focus is <strong>Data Leakage</strong> in materials workflows (pre-processing, temporal, and group/spatial), tackled with proper holdout, K-fold, LOOCV, and stratified validation. We close with the <strong>error measures</strong> that decide what “good” actually means: MAE/MSE/RMSE and <span class="math inline">\(R^2\)</span> for regression, and confusion matrices, precision/recall, F1/Dice, IoU, and categorical cross-entropy for classification and segmentation.</p>
321351
<p><strong>Exercise:</strong><br>
322352
Construct a deliberately flawed ML pipeline and diagnose its failure.</p>
323353
<hr>

index.out.ipynb

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"\n",
1414
"This course teaches how machine learning can be applied to experimental data from materials processing and characterization. The focus lies on images, spectra, time-series, and processing parameters, and on understanding how physical data formation interacts with learning algorithms. Students learn to build robust, uncertainty-aware ML pipelines for real experimental workflows, avoiding common pitfalls such as data leakage, overfitting, and spurious correlations."
1515
],
16-
"id": "b73b41b1-31ba-4784-92f9-8084868a8922"
16+
"id": "f23b30ad-8a34-41cd-bf2e-e8074d2ca14b"
1717
},
1818
{
1919
"cell_type": "raw",
@@ -76,7 +76,7 @@
7676
"}\n",
7777
"</style>"
7878
],
79-
"id": "fe078047-86ca-4ab4-8bdf-248b53f0e63e"
79+
"id": "99dc06b5-04b5-4a15-b07b-c8f865219c08"
8080
},
8181
{
8282
"cell_type": "raw",
@@ -106,7 +106,7 @@
106106
" <strong>How to use this course site.</strong> Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.\n",
107107
"</div>"
108108
],
109-
"id": "6ffb6083-0168-465b-b006-157d7832997e"
109+
"id": "8cd1a24b-7f9c-4901-82b2-a1f956ed0dda"
110110
},
111111
{
112112
"cell_type": "markdown",
@@ -176,11 +176,13 @@
176176
"\n",
177177
"*Lecture: Tuesday, 28.04.2026, 14:15-15:45 \\| Exercise: Thursday, 30.04.2026, 16:15-17:45*\n",
178178
"\n",
179+
"**Slides:** [Open](https://pelzlab.science/public_presentations/ml_for_characterization_and_processing/unit03_data_quality/01_intro.html)\n",
180+
"\n",
179181
"- Annotation uncertainty and inter-annotator variance.\n",
180182
"- Train/test leakage in materials workflows.\n",
181183
"- Why “good accuracy” often means a broken pipeline.\n",
182184
"\n",
183-
"**Summary:** This unit focuses on the most critical and often overlooked part of the ML pipeline: data integrity. We discuss systematic data cleaning and normalization techniques while highlighting the unique challenges of labeling experimental materials data, such as inter-annotator variance. A major focus is on **Data Leakage**, specifically how spatial and physical correlations in materials samples can lead to deceptively high model performance. We introduce robust validation strategies to ensure models generalize to truly unseen data.\n",
185+
"**Summary:** This unit covers the often-overlooked half of an ML pipeline: data integrity, validation, and how performance is measured. We start with the measurement chain and systematic **data cleaning** — handling missing values, outliers, and duplicates with a “fix at source” mindset. We then build the **transformation toolbox**: centering, min–max and z-score scaling, physics-aware non-dimensionalisation, log transforms, differentiation, and frequency-domain views (FFT, triggering for time series). On the supervision side we examine **labels and uncertainty** — inter-annotator variance, probabilistic labels, and a Bayesian view of priors, likelihoods, and posteriors — and then formalize the **bias–variance** tradeoff with parsimony and regularization. A major focus is **Data Leakage** in materials workflows (pre-processing, temporal, and group/spatial), tackled with proper holdout, K-fold, LOOCV, and stratified validation. We close with the **error measures** that decide what “good” actually means: MAE/MSE/RMSE and $R^2$ for regression, and confusion matrices, precision/recall, F1/Dice, IoU, and categorical cross-entropy for classification and segmentation.\n",
184186
"\n",
185187
"**Exercise:** \n",
186188
"Construct a deliberately flawed ML pipeline and diagnose its failure.\n",
@@ -385,7 +387,7 @@
385387
"\n",
386388
"Sandfeld, Stefan. 2024. *Materials Data Science: Introduction to Data Mining, Machine Learning, and Data-Driven Predictions for Materials Science and Engineering*. Springer Nature."
387389
],
388-
"id": "921109e0-2219-40d8-b8a9-9398719dd610"
390+
"id": "ab3eae36-98a4-4c35-8a83-b928deea2b5d"
389391
}
390392
],
391393
"nbformat": 4,

index.pdf

2.53 KB
Binary file not shown.

0 commit comments

Comments
 (0)