Hyperparameter-Tuning-Cookbook/001_sampling.qmd at main · sequential-parameter-optimization/Hyperparameter-Tuning-Cookbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
execute:
  cache: false
  eval: true
  echo: true
  warning: false
---


# Sampling Plans

::: {.callout-note}
### Note
* This section is based on chapter 1 in @Forr08a.
* The following Python packages are imported:
```{python}
import numpy as np
import pandas as pd
import numpy as np
from typing import Tuple, Optional
import matplotlib.pyplot as plt
from spotpython.utils.sampling import rlh
from spotpython.utils.effects import screening_plot, screeningplan
from spotpython.fun.objectivefunctions import Analytical
from spotpython.utils.sampling import (fullfactorial, bestlh,
     jd, mm, mmphi, mmsort, perturb, mmlhs, phisort, mmphi_intensive)
from spotpython.design.poor import Poor
from spotpython.design.clustered import Clustered
from spotpython.design.sobol import Sobol
from spotpython.design.spacefilling import SpaceFilling
from spotpython.design.random import Random
from spotpython.design.grid import Grid
```
:::

## Ideas and Concepts


::: {#def-sampling-plan}

#### Sampling Plan
In the context of computer experiments, the term sampling plan refers to the set of input values, say $X$,at which the computer code is evaluated.

:::

The goal of a sampling plan is to efficiently explore the input space to understand the behavior of the computer code and build a surrogate model that accurately represents the code's behavior.
Traditionally, Response Surface Methodology (RSM) has been used to design sampling plans for computer experiments.
These sampling plans are based on procedures that generate points by means of a rectangular grid or a factorial design.

However, more recently, Design and Analysis of Computer Experiments (DACE) has emerged as a more flexible and powerful approach for designing sampling plans.

Engineering design often requires the construction of a surrogate model $\hat{f}$ to approximate the expensive response of a black-box function $f$. The function $f(x)$ represents a continuous metric (e.g., quality, cost, or performance) defined over a design space $D \subset \mathbb{R}^k$, where $x$ is a $k$-dimensional vector of design variables. Since evaluating $f$ is costly, only a sparse set of samples is used to construct $\hat{f}$, which can then provide inexpensive predictions for any $x \in D$.

The process involves:

* Sampling discrete observations:
* Using these samples to construct an approximation $\hat{f}$.
* Ensuring the surrogate model is well-posed, meaning it is mathematically valid and can generalize predictions effectively.

A sampling plan

$$
X =
\left\{
  x^{(i)} \in D | i = 1, \ldots, n
\right\}
$$

determines the spatial arrangement of observations. While some models require a minimum number of data points $n$, once this threshold is met, a surrogate model can be constructed to approximate $f$ efficiently.

A well-posed model does not always perform well because its ability to generalize depends heavily on the sampling plan used to collect data. If the sampling plan is poorly designed, the model may fail to capture critical behaviors in the design space. For example:

* Extreme Sampling: Measuring performance only at the extreme values of parameters may miss important behaviors in the center of the design space, leading to incomplete understanding.
* Uneven Sampling: Concentrating samples in certain regions while neglecting others forces the model to extrapolate over unsampled areas, potentially resulting in inaccurate or misleading predictions.
Additionally, in some cases, the data may come from external sources or be limited in scope, leaving little control over the sampling plan. This can further restrict the model's ability to generalize effectively.


### The 'Curse of Dimensionality' and How to Avoid It


The "curse of dimensionality" refers to the exponential increase in computational complexity and data requirements as the number of dimensions (variables) in a problem grows.
For a one-dimensional space, sampling $n$ locations may suffice for accurate predictions.
In high-dimensional spaces, the amount of data needed to maintain the same level of accuracy or coverage increases dramatically. For example, if a one-dimensional space requires $n$ samples for a certain accuracy, a $k$-dimensional space would require $n^k$ samples. This makes tasks like optimization, sampling, and modeling computationally expensive and often impractical in high-dimensional settings.


::: {#exm-curse-of-dim}
#### Example: Curse of Dimensionality
Consider a simple example where we want to model the cost of a car tire based on its wheel diameter. If we have one variable (wheel diameter), we might need 10 simulations to get a good estimate of the cost.
Now, if we add 8 more variables (e.g., tread pattern, rubber type, etc.), the number of simulations required increases to $10^8$ (10 million). This is because the number of combinations of design variables grows exponentially with the number of dimensions.
This means that the computational budget required to evaluate all combinations of design variables becomes infeasible. In this case, it would take 11,416 years to complete the simulations, making it impractical to explore the design space fully.

:::

### Physical versus Computational Experiments

Physical experiments are prone to experimental errors from three main sources:

* Human error: Mistakes made by the experimenter.
* Random error: Measurement inaccuracies that vary unpredictably.
* Systematic error: Consistent bias due to flaws in the experimental setup.

The key distinction is repeatability: systematic errors remain constant across repetitions, while random errors vary.

Computational experiments, on the other hand, are deterministic and free from random errors. However, they are still affected by:

* Human error: Bugs in code or incorrect boundary conditions.
* Systematic error: Biases from model simplifications (e.g., inviscid flow approximations) or finite resolution (e.g., insufficient mesh resolution).

The term "noise" is used differently in physical and computational contexts. In physical experiments, it refers to random errors, while in computational experiments, it often refers to systematic errors.

Understanding these differences is crucial for designing experiments and applying techniques like Gaussian process-based approximations. For physical experiments, replication mitigates random errors, but this is unnecessary for deterministic computational experiments.

### Designing Preliminary Experiments (Screening)

Minimizing the number of design variables $x_1, x_2, \dots, x_k$ is crucial before modeling the objective function $f$. This process, called screening, aims to reduce dimensionality without compromising the analysis. If $f$ is at least once differentiable over the design domain $D$, the partial derivative $\frac{\partial f}{\partial x_i}$ can be used to classify variables:

* Negligible Variables:
If $\frac{\partial f}{\partial x_i} = 0, \, \forall x \in D$, the variable $x_i$ can be safely neglected.
* Linear Additive Variables:
If $\frac{\partial f}{\partial x_i} = \text{constant} \neq 0, \, \forall x \in D$, the effect of $x_i$ is linear and additive.
* Nonlinear Variables:
If $\frac{\partial f}{\partial x_i} = g(x_i), \, \forall x \in D$, where $g(x_i)$ is a non-constant function, $f$ is nonlinear in $x_i$.
* Interactive Nonlinear Variables:
If $\frac{\partial f}{\partial x_i} = g(x_i, x_j, \dots), /, \forall x \in D$, where $g(x_i, x_j, \dots)$ is a function involving interactions with other variables, $f$ is nonlinear in $x_i$ and interacts with $x_j$.


Measuring $\frac{\partial f}{\partial x_i}$ across the entire design space is often infeasible due to limited budgets.
The percentage of time allocated to screening depends on the problem:
If many variables are expected to be inactive, thorough screening can significantly improve model accuracy by reducing dimensionality.
If most variables are believed to impact the objective, focus should shift to modeling instead.
Screening is a trade-off between computational cost and model accuracy, and its effectiveness depends on the specific problem context.

#### Estimating the Distribution of Elementary Effects

In order to simplify the presentation of what follows, we make, without loss of generality, the assumption that the design space $D = [0, 1]^k$; that is, we normalize all variables into the unit cube. We shall adhere to this convention for the rest of the book and strongly urge the reader to do likewise when implementing any algorithms described here, as this step not only yields clearer mathematics in some cases but also safeguards against scaling issues.

Before proceeding with the description of the Morris algorithm, we need to define an important statistical concept. Let us restrict our design space $D$ to a $k$-dimensional, $p$-level full factorial grid, that is,

$$
x_i \in \{0, \frac{1}{p-1}, \frac{2}{p-1}, \dots, 1\}, \quad \text{ for } i = 1, \dots, k.
$$


::: {#def-elementary-effect}
#### Elementary Effect

For a given baseline value $x \in D$, let $d_i(x)$ denote the elementary effect of $x_i$, where:

$$
d_i(x) = \frac{f(x_1, \dots, x_i + \Delta, \dots, x_k) - f(x_1, \dots, x_i - \Delta, \dots, x_k)}{2\Delta}, \quad i = 1, \dots, k,
$$ {#eq-eleffect}
where $\Delta$ is the step size, which is defined as the distance between two adjacent levels in the grid. In other words, we have:

with
$$\Delta = \frac{\xi}{p-1}, \quad \xi \in \mathbb{N}^*, \quad \text{and} \quad x \in D , \text{ such that its components } x_i \leq 1 - \Delta.
$$

$\Delta$ is the step size. The elementary effect $d_i(x)$ measures the sensitivity of the function $f$ to changes in the variable $x_i$ at the point $x$.
:::

Morris's method aims to estimate the parameters of the distribution of elementary effects associated with each variable. A large measure of central tendency indicates that a variable has a significant influence on the objective function across the design space, while a large measure of spread suggests that the variable is involved in interactions or contributes to the nonlinearity of $f$. In practice, the sample mean and standard deviation of a set of $d_i(x)$ values, calculated in different parts of the design space, are used for this estimation.

To ensure efficiency, the preliminary sampling plan $X$ should be designed so that each evaluation of the objective function $f$ contributes to the calculation of two elementary effects, rather than just one (as would occur with a naive random spread of baseline $x$ values and adding $\Delta$ to one variable). Additionally, the sampling plan should provide a specified number (e.g., $r$) of elementary effects for each variable, independently drawn with replacement. For a detailed discussion on constructing such a sampling plan, readers are encouraged to consult Morris's original paper (Morris, 1991). Here, we focus on describing the process itself.

The random orientation of the sampling plan $B$ can be constructed as follows:

* Let $B$ be a $(k+1) \times k$ matrix of 0s and 1s, where for each column $i$, two rows differ only in their $i$-th entries.
* Compute a random orientation of $B$, denoted $B^*$:

$$
B^* =
\left(
1_{k+1,k} x^* + (\Delta/2)
\left[
(2B-1_{k+1,k})
D^* +
1_{k+1,k}
\right]
\right)
P^*,
$$

where:

* $D^*$ is a $k$-dimensional diagonal matrix with diagonal elements $\pm 1$ (equal probability),
* $\mathbf{1}$ is a matrix of 1s,
* $x^*$ is a randomly chosen point in the $p$-level design space (limited by $\Delta$),
* $P^*$ is a $k \times k$ random permutation matrix with one 1 per column and row.

`spotpython` provides a `Python` implementation to compute $B^*$, see [https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/effects.py](https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/effects.py).

Here is the corresponding code:

```{python}
def randorient(k, p, xi, seed=None):
    # Initialize random number generator with the provided seed
    if seed is not None:
        rng = np.random.default_rng(seed)
    else:
        rng = np.random.default_rng()

    # Step length
    Delta = xi / (p - 1)

    m = k + 1

    # A truncated p-level grid in one dimension
    xs = np.arange(0, 1 - Delta, 1 / (p - 1))
    xsl = len(xs)
    if xsl < 1:
        print(f"xi = {xi}.")
        print(f"p = {p}.")
        print(f"Delta = {Delta}.")
        print(f"p - 1 = {p - 1}.")
        raise ValueError(f"The number of levels xsl is {xsl}, but it must be greater than 0.")

    # Basic sampling matrix
    B = np.vstack((np.zeros((1, k)), np.tril(np.ones((k, k)))))

    # Randomization

    # Matrix with +1s and -1s on the diagonal with equal probability
    Dstar = np.diag(2 * rng.integers(0, 2, size=k) - 1)

    # Random base value
    xstar = xs[rng.integers(0, xsl, size=k)]

    # Permutation matrix
    Pstar = np.zeros((k, k))
    rp = rng.permutation(k)
    for i in range(k):
        Pstar[i, rp[i]] = 1

    # A random orientation of the sampling matrix
    Bstar = (np.ones((m, 1)) @ xstar.reshape(1, -1) +
        (Delta / 2) * ((2 * B - np.ones((m, k))) @ Dstar +
        np.ones((m, k)))) @ Pstar

    return Bstar
```

The code following snippet generates a random orientation of a sampling matrix `Bstar` using the `randorient()` function. The input parameters are:

* k = 3: The number of design variables (dimensions).
* p = 3: The number of levels in the grid for each variable.
* xi = 1: A parameter used to calculate the step size Delta.

Step-size calculation is performed as follows:
`Delta = xi / (p - 1) = 1 / (3 - 1) = 0.5`, which determines the spacing between levels in the grid.

Next, random sampling matrix construction is computed:

* A truncated grid is created with levels `[0, 0.5]` (based on Delta).
* A basic sampling matrix B is constructed, which is a lower triangular matrix with 0s and 1s.

Then, randomization is applied:

* `Dstar`: A diagonal matrix with random entries of +1 or -1.
* `xstar`: A random starting point from the grid.
* `Pstar`: A random permutation matrix.

Random orientation is applied to the basic sampling matrix `B` to create `Bstar`. This involves scaling, shifting, and permuting the rows and columns of B.

The final output is the matrix `Bstar`, which represents a random orientation of the sampling plan. Each row corresponds to a sampled point in the design space, and each column corresponds to a design variable.


::: {#exm-randorient-2}
##### Random Orientation of the Sampling Matrix in 2-D

```{python}
k = 2
p = 3
xi = 1
Bstar = randorient(k, p, xi, seed=123)
print(f"Random orientation of the sampling matrix:\n{Bstar}")
```

We can visualize the random orientation of the sampling matrix in 2-D as shown in @fig-randorient-2d.

```{python}
#| label: fig-randorient-2d
#| fig-cap: "Random orientation of the sampling matrix in 2-D. The labels indicate the row index of the points."
plt.figure(figsize=(6, 6))
plt.scatter(Bstar[:, 0], Bstar[:, 1], color='blue', s=50, label='Hypercube Points')
for i in range(Bstar.shape[0]):
    plt.text(Bstar[i, 0] + 0.01, Bstar[i, 1] + 0.01, str(i), fontsize=9)
plt.xlim(-0.1, 1.1)
plt.ylim(-0.1, 1.1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.grid()
```

:::


::: {#exm-randorient-3}
##### Random Orientation of the Sampling Matrix

```{python}
k = 3
p = 3
xi = 1
Bstar = randorient(k, p, xi)
print(f"Random orientation of the sampling matrix:\n{Bstar}")
```
:::


To obtain $r$ elementary effects for each variable, the screening plan is built from $r$ random orientations:

$$
X =
\begin{pmatrix}
B^*_1 \\
B^*_2 \\
\vdots \\
B^*_r
\end{pmatrix}
$$

The function `screeningplan()` generates a screening plan by calling the `randorient()` function `r` times. It creates a list of random orientations and then concatenates them into a single array, which represents the screening plan.
The screening plan implementation in `Python` is as follows (see [https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/effects.py](https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/effects.py)):
```{python}
def screeningplan(k, p, xi, r):
    # Empty list to accumulate screening plan rows
    X = []
    for i in range(r):
        X.append(randorient(k, p, xi))
    # Concatenate list of arrays into a single array
    X = np.vstack(X)
    return X
```


It works like follows:

* The value of the objective function $f$ is computed for each row of the screening plan matrix $X$.
These values are stored in a column vector $t$ of size $(r * (k + 1)) \times 1$, where:
    * `r` is the number of random orientations.
    * `k` is the number of design variables.


The elementary effects are calculated using the following formula:

* For each random orientation, adjacent rows of the screening plan matrix X and their corresponding function values from t are used.
* These values are inserted into @eq-eleffect to compute elementary effects for each variable. An elementary effect measures the sensitivity of the objective function to changes in a specific variable.

Results can be used for a statistical analysis.  After collecting a sample of $r$ elementary effects for each variable:

* The sample mean (central tendency) is computed to indicate the overall influence of the variable.
* The sample standard deviation (spread) is computed to capture variability, which may indicate interactions or nonlinearity.

The results (sample means and standard deviations) are plotted on a chart for comparison.
This helps identify which variables have the most significant impact on the objective function and whether their effects are linear or involve interactions.
This is implemented in the function `screening_plot()` in `Python`, which uses the helper function `_screening()` to calculate the elementary effects and their statistics.

```{python}
def _screening(X, fun, xi, p, labels, bounds=None) -> tuple:
    """Helper function to calculate elementary effects for a screening design.

    Args:
        X (np.ndarray): The screening plan matrix, typically structured
            within a [0,1]^k box.
        fun (object): The objective function to evaluate at each
            design point in the screening plan.
        xi (float): The elementary effect step length factor.
        p (int): Number of discrete levels along each dimension.
        labels (list of str): A list of variable names corresponding to
            the design variables.
        bounds (np.ndarray): A 2xk matrix where the first row contains
            lower bounds and the second row contains upper bounds for
            each variable.

    Returns:
        tuple: A tuple containing two arrays:
            - sm: The mean of the elementary effects for each variable.
            - ssd: The standard deviation of the elementary effects for
            each variable.
    """
    k = X.shape[1]
    r = X.shape[0] // (k + 1)

    # Scale each design point
    t = np.zeros(X.shape[0])
    for i in range(X.shape[0]):
        if bounds is not None:
            X[i, :] = bounds[0, :] + X[i, :] * (bounds[1, :] - bounds[0, :])
        t[i] = fun(X[i, :])

    # Elementary effects
    F = np.zeros((k, r))
    for i in range(r):
        for j in range(i * (k + 1), i * (k + 1) + k):
            idx = np.where(X[j, :] - X[j + 1, :] != 0)[0][0]
            F[idx, i] = (t[j + 1] - t[j]) / (xi / (p - 1))

    # Statistical measures (divide by n)
    ssd = np.std(F, axis=1, ddof=0)
    sm = np.mean(F, axis=1)
    return sm, ssd


def screening_plot(X, fun, xi, p, labels, bounds=None, show=True) -> None:
    """Generates a plot with elementary effect screening metrics.

    This function calculates the mean and standard deviation of the
    elementary effects for a given set of design variables and plots
    the results.

    Args:
        X (np.ndarray):
            The screening plan matrix, typically structured within a [0,1]^k box.
        fun (object):
            The objective function to evaluate at each design point in the screening plan.
        xi (float):
            The elementary effect step length factor.
        p (int):
            Number of discrete levels along each dimension.
        labels (list of str):
            A list of variable names corresponding to the design variables.
        bounds (np.ndarray):
            A 2xk matrix where the first row contains lower bounds and
            the second row contains upper bounds for each variable.
        show (bool):
            If True, the plot is displayed. Defaults to True.

    Returns:
        None: The function generates a plot of the results.
    """
    k = X.shape[1]
    sm, ssd = _screening(X=X, fun=fun, xi=xi, p=p, labels=labels, bounds=bounds)
    plt.figure()
    for i in range(k):
        plt.text(sm[i], ssd[i], labels[i], fontsize=10)
    plt.axis([min(sm), 1.1 * max(sm), min(ssd), 1.1 * max(ssd)])
    plt.xlabel("Sample means")
    plt.ylabel("Sample standard deviations")
    plt.gca().tick_params(labelsize=10)
    plt.grid(True)
    if show:
        plt.show()
```


### Special Considerations When Deploying Screening Algorithms
When implementing the screening algorithm described above, two specific scenarios require special attention:

* Duplicate Design Points: If the dimensionality $k$ of the space is relatively low and you can afford a large number of elementary effects $r$, we should be be aware of the increased probability of duplicate design points appearing in the sampling plan $X$.
*Since the responses at sample points are deterministic, there's no value in evaluating the same point multiple times. Fortunately, this issue is relatively uncommon in practice, as screening high-dimensional spaces typically requires large numbers of elementary effects, which naturally reduces the likelihood of duplicates.
* Failed Simulations: Numerical simulation codes occasionally fail to return valid results due to meshing errors, non-convergence of partial differential equation solvers, numerical instabilities, or parameter combinations outside the stable operating range.

From a screening perspective, this is particularly problematic because an entire random orientation $B^*$ becomes compromised if even a single point within it fails to evaluate properly. Implementing error handling strategies or fallback methods to manage such cases should be considered.

For robust screening studies, monitoring simulation success rates and having contingency plans for failed evaluations are important aspects of the experimental design process.


## Analyzing Variable Importance in Aircraft Wing Weight

Let us consider the following analytical expression  used as a conceptual level estimate of the weight of a light aircraft wing as discussed in @sec-awwe.


```{python}
#| label: fig-forre08a-1-2
#| fig-cap: "Estimated means and standard deviations of the elementary effects for the 10 design variables of the wing weight function. Example based on @Forr08a."
fun = Analytical()
k = 10
p = 10
xi = 1
r = 25
X = screeningplan(k=k, p=p, xi=xi, r=r)  # shape (r x (k+1), k)
value_range = np.array([
    [150, 220,   6, -10, 16, 0.5, 0.08, 2.5, 1700, 0.025],
    [200, 300,  10,  10, 45, 1.0, 0.18, 6.0, 2500, 0.08 ],
])
labels = [
    "S_W", "W_fw", "A", "Lambda",
    "q",   "lambda", "tc", "N_z",
    "W_dg", "W_p"
]
screening_plot(
    X=X,
    fun=fun.fun_wingwt,
    bounds=value_range,
    xi=xi,
    p=p,
    labels=labels,
)
```

::: {.callout-note}
### Nondeterministic Results
The code will generate a slightly different screening plan each time, as it uses random orientations of the sampling matrix $B$.
:::


@fig-forre08a-1-2 provides valuable insights into variable activity without requiring domain expertise. The screening study with $r = 25$ elementary effects reveals distinct patterns in how variables affect wing weight:

* Variables with Minimal Impact: A clearly defined group of variables clusters around the origin - indicating their minimal impact on wing weight:
    * Paint weight ($W_p$) - as expected, contributes little to overall wing weight
    * Dynamic pressure ($q$) - within our chosen range, this has limited effect (essentially representing different cruise altitudes at the same speed)
    * Taper ratio ($\lambda$) and quarter-chord sweep ($\Lambda$) - these geometric parameters have minor influence within the narrow range (-10° to 10°) typical of light aircraft

* Variables with Linear Effects:
    * While still close to the origin, fuel weight ($W_{fw}$) shows a slightly larger central tendency with very low standard deviation. This indicates moderate importance but minimal involvement in interactions with other variables.

* Variables with Nonlinear/Interactive Effects:
    * Aspect ratio ($A$) and airfoil thickness ratio ($R_{tc}$) show similar importance levels, but their high standard deviations suggest significant nonlinear behavior and interactions with other variables.

* Dominant Variables: The most significant impacts come from:
    * Flight design gross weight ($W_{dg}$)
    * Wing area ($S_W$)
    * Ultimate load factor ($N_z$)

These variables show both large central tendency values and high standard deviations, indicating strong direct effects and complex interactions. The interaction between aspect ratio and load factor is particularly important - high values of both create extremely heavy wings, explaining why highly maneuverable fighter jets cannot use glider-like wing designs.

What makes this screening approach valuable is its ability to identify critical variables without requiring engineering knowledge or expensive modeling. In real-world applications, we rarely have the luxury of creating comprehensive parameter space visualizations, which is precisely why surrogate modeling is needed.
After identifying the active variables through screening, we can design a focused sampling plan for these key variables. This forms the foundation for building an accurate surrogate model of the objective function.

When the objective function is particularly expensive to evaluate, we might recycle the runs performed during screening for the actual model fitting step. This is most effective when some variables prove to have no impact at all. However, since completely inactive variables are rare in practice, engineers must carefully balance the trade-off between reusing expensive simulation runs and introducing potential noise into the model.


## Designing a Sampling Plan

### Stratification

A feature shared by all of the approximation models discussed in @Forr08a is that they are more accurate in the vicinity of the points where we have evaluated the objective function.
In later chapters we will delve into the laws that quantify our decaying trust in the model as we move away from a known, sampled point, but for the purposes of the present discussion we shall merely draw the intuitive conclusion that a uniform level of model accuracy throughout the design space requires a uniform spread of points. A sampling plan possessing this feature is said to be space-filling.

The most straightforward way of sampling a design space in a uniform fashion is by means of a rectangular grid of points. This is the full factorial sampling technique.

Here is the simplified version of a `Python` function that will sample the unit hypercube at all levels in all dimensions, with the $k$-vector $q$ containing the number of points required along each dimension, see
[https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/sampling.py](https://github.com/sequential-parameter-optimization/spotPython/blob/main/src/spotpython/utils/sampling.py).

The variable `Edges` specifies whether we want the points to be equally spaced from edge to edge (`Edges=1`)  or we want them to be in the centres of $n = q_1 \times q_2 \times \ldots \times q_k$ bins filling the unit hypercube (for any other value of `Edges`).

```{python}
def fullfactorial(q_param, Edges=1) -> np.ndarray:
    """Generates a full factorial sampling plan in the unit cube.

    Args:
        q (list or np.ndarray):
            A list or array containing the number of points along each dimension (k-vector).
        Edges (int, optional):
            Determines spacing of points. If `Edges=1`, points are equally spaced from edge to edge (default).
            Otherwise, points will be in the centers of n = q[0]*q[1]*...*q[k-1] bins filling the unit cube.

    Returns:
        (np.ndarray): Full factorial sampling plan as an array of shape (n, k), where n is the total number of points and k is the number of dimensions.

    Raises:
        ValueError: If any dimension in `q` is less than 2.
    """
    q_levels = np.array(q_param) # Use a distinct variable for original levels
    if np.min(q_levels) < 2:
        raise ValueError("You must have at least two points per dimension.")

    n = np.prod(q_levels)
    k = len(q_levels)
    X = np.zeros((n, k))

    # q_for_prod_calc is used for calculating repetitions, includes the phantom element.
    # This matches the logic of the user-provided snippet where 'q' was modified.
    q_for_prod_calc = np.append(q_levels, 1)

    for j in range(k): # k is the original number of dimensions
        # current_dim_levels is the number of levels for the current dimension j
        # In the user's snippet, q[j] correctly refers to the original level count
        # as j ranges from 0 to k-1, and q_for_prod_calc[j] = q_levels[j] for this range.
        current_dim_levels = q_for_prod_calc[j]

        if Edges == 1:
            one_d_slice = np.linspace(0, 1, int(current_dim_levels))
        else:
            # Corrected calculation for bin centers
            if current_dim_levels == 1: # Should not be hit if np.min(q_levels) >= 2
                one_d_slice = np.array([0.5])
            else:
                one_d_slice = np.linspace(1 / (2 * current_dim_levels),
                                          1 - 1 / (2 * current_dim_levels),
                                          int(current_dim_levels))

        column = np.array([])
        # The product q_for_prod_calc[j + 1 : k] correctly calculates
        # the product of remaining original dimensions' levels.
        num_consecutive_repeats = np.prod(q_for_prod_calc[j + 1 : k])

        # This loop structure replicates the logic from the user's snippet
        while len(column) < n:
            for ll_idx in range(int(current_dim_levels)): # Iterate through levels of current dimension
                val_to_repeat = one_d_slice[ll_idx]
                column = np.append(column, np.ones(int(num_consecutive_repeats)) * val_to_repeat)
        X[:, j] = column
    return X
```

```{python}
q = [3, 2]
X = fullfactorial(q, Edges=0)
print(X)
```

@fig-fullfactorial-2d-edges0 shows the points in the unit hypercube for the case of 3x2 points.

```{python}
#| label: fig-fullfactorial-2d-edges0
#| fig-cap: "2D Full Factorial Sampling (3x2 Points). Edges = 0"
#| echo: false
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=50, label='Hypercube Points')
plt.grid(True, linestyle='--', alpha=0.7)
plt.title('2D Full Factorial Sampling (3x2 Points). Edges = 0')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.xlim(0, 1)
plt.ylim(0, 1)
```


```{python}
X = fullfactorial(q, Edges=1)
print(X)
```

@fig-fullfactorial-2d-edges1 shows the points in the unit hypercube for the case of 3x2 points with edges.

```{python}
#| label: fig-fullfactorial-2d-edges1
#| fig-cap: "2D Full Factorial Sampling (3x2 Points). Edges = 1"
#| echo: false
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=50, label='Hypercube Points')
plt.grid(True, linestyle='--', alpha=0.7)
plt.title('2D Full Factorial Sampling (3x2 Points). Edges = 1')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.xlim(-0.1, 1.1)
plt.ylim(-0.1, 1.1)
```


The full factorial sampling plan method generates a uniform sampling design by creating a grid of points across all dimensions. For example, calling fullfactorial([3, 4, 5], 1) produces a three-dimensional sampling plan with 3, 4, and 5 levels along each dimension, respectively. While this approach satisfies the uniformity criterion, it has two significant limitations:

* Restricted Design Sizes: The method only works for designs where the total number of points $n$ can be expressed as the product of the number of levels in each dimension, i.e., $n = q_1 \times q_2 \times \cdots \times q_k$.

* Overlapping Projections: When the sampling points are projected onto individual axes, sets of points may overlap, reducing the effectiveness of the sampling plan. This can lead to non-uniform coverage in the projections, which may not fully represent the design space.

###  Latin Squares and Random Latin Hypercubes

To improve the uniformity of projections for any individual variable, the range of that variable can be divided into a large number of equal-sized bins, and random subsamples of equal size can be generated within these bins. This method is called stratified random sampling. Extending this idea to all dimensions results in a stratified sampling plan, commonly implemented using Latin hypercube sampling.

::: {#def-latin-hypercube}
#### Latin Squares and Hypercubes
In the context of statistical sampling, a square grid containing sample positions is a Latin square if (and only if) there is only one sample in each row and each column. A Latin hypercube is the generalisation of this concept to an arbitrary number of dimensions, whereby each sample is the only one in each axis-aligned hyperplane containing it
:::

For two-dimensional discrete variables, a Latin square ensures uniform projections. An $(n \times n)$ Latin square is constructed by filling each row and column with a permutation of $\{1, 2, \dots, n\}$, ensuring each number appears only once per row and column.

::: {#exm-latin-square}
### Latin Square
For $n = 4$, a Latin square might look like this:


```{raw}
2	1	3	4
3	2	4	1
1	4	2	3
4	3	1	2
```

:::


Latin Hypercubes are the multidimensional extension of Latin squares.
The design space is divided into equal-sized hypercubes (bins), and one point is placed in each bin.
The placement ensures that moving along any axis from an occupied bin does not encounter another occupied bin. This guarantees uniform projections across all dimensions.
To construct a Latin hypercube, the following steps are taken:

* Represent the sampling plan as an $n \times k$ matrix $X$, where $n$ is the number of points and $k$ is the number of dimensions.
* Fill each column of $X$ with random permutations of $\{1, 2, \dots, n\}$.
* Normalize the plan into the unit hypercube $[0, 1]^k$.

This approach ensures multidimensional stratification and uniformity in projections.
Here is the code:

```{python}
def rlh(n: int, k: int, edges: int = 0) -> np.ndarray:
    # Initialize array
    X = np.zeros((n, k), dtype=float)

    # Fill with random permutations
    for i in range(k):
        X[:, i] = np.random.permutation(n)

    # Adjust normalization based on the edges flag
    if edges == 1:
        # [X=0..n-1] -> [0..1]
        X = X / (n - 1)
    else:
        # Points at true midpoints
        # [X=0..n-1] -> [0.5/n..(n-0.5)/n]
        X = (X + 0.5) / n

    return X
```


::: {#exm-rlh}
### Random Latin Hypercube

The following code can be used to generate a 2D Latin hypercube with 5 points and edges=0:
```{python}
X = rlh(n=5, k=2, edges=0)
print(X)
```

@fig-rlh-edges0 shows the points in the unit hypercube for the case of 5 points with `edges=0`.

```{python}
#| label: fig-rlh-edges0
#| fig-cap: "2D Latin Hypercube Sampling (5 Points, Edges=0)"
#| echo: false
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=50, label='Hypercube Points')
plt.grid(True, linestyle='--', alpha=0.7)
plt.title('2D Latin Hypercube Sampling (5 Points, Edges=0)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.legend()
plt.show()
```

:::


::: {#exm-rlh-edges}

### Random Latin Hypercube with Edges

The following code can be used to generate a 2D Latin hypercube with 5 points and edges=1:
```{python}
X = rlh(n=5, k=2, edges=1)
print(X)
```

@fig-rlh-edges1 shows the points in the unit hypercube for the case of 5 points with `edges=1`.

```{python}
#| label: fig-rlh-edges1
#| fig-cap: "2D Latin Hypercube Sampling (5 Points, Edges=1)"
#| echo: false
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=50, label='Hypercube Points')
plt.grid(True, linestyle='--', alpha=0.7)
plt.title('2D Latin Hypercube Sampling (5 Points, Edges=1)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.legend()
plt.show()
```
:::


### Space-filling Designs: Maximin Plans

A widely adopted measure for assessing the uniformity, or 'space-fillingness', of a sampling plan is the maximin metric, initially proposed by @john90a. This criterion can be formally defined as follows.

Consider a sampling plan $X$. Let $d_1, d_2, \ldots, d_m$ represent the unique distances between all possible pairs of points within $X$, arranged in ascending order. Furthermore, let $J_1, J_2, \ldots, J_m$ be defined such that $J_j$ denotes the count of point pairs in $X$ separated by the distance $d_j$.

::: {#def-maximin}
#### Maximin plan
A sampling plan $X$ is considered a maximin plan if, among all candidate plans, it maximizes the smallest inter-point distance $d_1$. Among plans that satisfy this condition, it further minimizes $J_1$, the number of pairs separated by this minimum distance.
:::

While this definition is broadly applicable to any collection of sampling plans, our focus is narrowed to Latin hypercube designs to preserve their desirable stratification properties. However, even within this restricted class, @def-maximin may identify multiple equivalent maximin designs. To address this, a more comprehensive 'tie-breaker' definition, as proposed by @morr95a, is employed:

::: {#def-maximin2}
#### Maximin plan with tie-breaker
A sampling plan $X$ is designated as the maximin plan if it sequentially optimizes the following conditions: it maximizes $d_1$; among those, it minimizes $J_1$; among those, it maximizes $d_2$; among those, it minimizes $J_2$; and so forth, concluding with minimizing $J_m$.
:::


@john90a established that the maximin criterion (@def-maximin) is equivalent to the D-optimality criterion used in linear regression. However, the extended maximin criterion incorporating a tie-breaker (@def-maximin2) is often preferred due to its intuitive nature and practical utility. Given that the sampling plans under consideration make no assumptions about model structure, the latter criterion (@def-maximin2) will be employed.

To proceed, a precise definition of 'distance' within these contexts is necessary. The p-norm is the most widely adopted metric for this purpose:


::: {#def-p-norm}
#### p-norm
The p-norm of a vector $\vec{x} = (x_1, x_2, \ldots, x_k)$ is defined as:

$$
d_p(\vec{x}^{(i_1)}, \vec{x}^{(i_2)}) = \left( \sum_{j=1}^k |x_j^{(i_1)} - x_j^{(i_2)}|^p \right)^{1/p}.
$$ {#eq-p-norm}

:::


When $p = 1$, @eq-p-norm defines the rectangular distance, occasionally referred to as the Manhattan norm (an allusion to a grid-like city layout). Setting $p = 2$ yields the Euclidean norm. The existing literature offers limited evidence to suggest the superiority of one norm over the other for evaluating sampling plans when no model structure assumptions are made. It is important to note, however, that the rectangular distance is considerably less computationally demanding. This advantage can be quite significant, particularly when evaluating large sampling plans.

For the computational implementation of @def-maximin2, the initial step involves constructing the vectors $d_1, d_2, \ldots, d_m$ and $J_1, J_2, \ldots, J_m$. The `jd` function facilitates this task.


#### The Function `jd`

The function `jd` computes the distinct p-norm distances between all pairs of points in a given set and counts their occurrences. It returns two arrays: one for the distinct distances and another for their multiplicities.

```{python}
def jd(X: np.ndarray, p: float = 1.0) -> Tuple[np.ndarray, np.ndarray]:
    """
    Args:
        X (np.ndarray):
            A 2D array of shape (n, d) representing n points
            in d-dimensional space.
        p (float, optional):
            The distance norm to use.
            p=1 uses the Manhattan (L1) norm, while p=2 uses the
            Euclidean (L2) norm. Defaults to 1.0 (Manhattan norm).

    Returns:
        (np.ndarray, np.ndarray):
            A tuple (J, distinct_d), where:
            - distinct_d is a 1D float array of unique,
            sorted distances between points.
            - J is a 1D integer array that provides
            the multiplicity (occurrence count)
            of each distance in distinct_d.
    """
    n = X.shape[0]

    # Allocate enough space for all pairwise distances
    # (n*(n-1))/2 pairs for an n-point set
    pair_count = n * (n - 1) // 2
    d = np.zeros(pair_count, dtype=float)

    # Fill the distance array
    idx = 0
    for i in range(n - 1):
        for j in range(i + 1, n):
            # Compute the p-norm distance
            d[idx] = np.linalg.norm(X[i] - X[j], ord=p)
            idx += 1

    # Find unique distances and their multiplicities
    distinct_d = np.unique(d)
    J = np.zeros_like(distinct_d, dtype=int)
    for i, val in enumerate(distinct_d):
        J[i] = np.sum(d == val)
    return J, distinct_d
```

::: {#exm-jd}
### The Function `jd`

Consider a small 3-point set in 2D space, with points located at (0,0), (1,1), and (2,2) as shown in @fig-jd-3points. The distinct distances and their occurrences can be computed using the `jd` function, as shown in the following code:

```{python}
#| label: fig-jd-3points
#| fig-cap: "3-Point Set in 2D Space"
#| echo: false
X = np.array([[0.0, 0.0],[1.0, 1.0],[2.0, 2.0]])
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', label='Points')
plt.title('3-Point Set in 2D Space')
plt.grid()
```
```{python}
J, distinct_d = jd(X, p=2.0)
print("Distinct distances (d_i):", distinct_d)
print("Occurrences (J_i):", J)
```

:::


### Memory Management

A computationally intensive part of the calculation performed with the `jd`-function is the creation of the vector $\vec{d}$ containing all pairwise distances. This is particularly true for large sampling plans; for instance, a 1000-point plan requires nearly half a million distance calculations.


::: {#def-prealloc}
#### Pre-allocation of Memory

Pre-allocation of memory is a programming technique where a fixed amount of memory is reserved for a data structure (like an array or vector) before it is actually filled with data. This is done to avoid the computational overhead associated with dynamic memory allocation, which involves repeatedly requesting and resizing memory as new elements are added.

:::

Consequently, pre-allocating memory for the distance vector $\vec{d}$ is essential. This necessitates a slightly less direct method for computing the indices of $\vec{d}$, rather than appending each new element, which would involve costly dynamic memory allocation.

The implementation of @def-maximin2 is now required. Finding the most space-filling design involves pairwise comparisons. This problem can be approached using a 'divide and conquer' strategy, simplifying it to the task of selecting the better of two sampling plans. The function `mm(X1,X2,p)` is designed for this purpose. It returns an index indicating which of the two designs is more space-filling, or `0` if they are equally space-filling, based on the $p$-norm for distance computation.

#### The Function `mm`

The function `mm` compares two sampling plans based on the Morris-Mitchell criterion. It uses the `jd` function to compute the distances and multiplicities, constructs vectors for comparison, and determines which plan is more space-filling.

```{python}
def mm(X1: np.ndarray, X2: np.ndarray, p: Optional[float] = 1.0) -> int:
    """
    Args:
        X1 (np.ndarray): A 2D array representing the first sampling plan.
        X2 (np.ndarray): A 2D array representing the second sampling plan.
        p (float, optional): The distance metric. p=1 uses Manhattan (L1) distance,
            while p=2 uses Euclidean (L2). Defaults to 1.0.

    Returns:
        int:
            - 0 if both plans are identical or equally space-filling
            - 1 if X1 is more space-filling
            - 2 if X2 is more space-filling
    """
    X1_sorted = X1[np.lexsort(np.rot90(X1))]
    X2_sorted = X2[np.lexsort(np.rot90(X2))]
    if np.array_equal(X1_sorted, X2_sorted):
        return 0  # Identical sampling plans

    # Compute distance multiplicities for each plan
    J1, d1 = jd(X1, p)
    J2, d2 = jd(X2, p)
    m1, m2 = len(d1), len(d2)

    # Construct V1 and V2: alternate distance and negative multiplicity
    V1 = np.zeros(2 * m1)
    V1[0::2] = d1
    V1[1::2] = -J1

    V2 = np.zeros(2 * m2)
    V2[0::2] = d2
    V2[1::2] = -J2

    # Trim the longer vector to match the size of the shorter
    m = min(m1, m2)
    V1 = V1[:m]
    V2 = V2[:m]

    # Compare element-by-element:
    # c[i] = 1 if V1[i] > V2[i], 2 if V1[i] < V2[i], 0 otherwise.
    c = (V1 > V2).astype(int) + 2 * (V1 < V2).astype(int)

    if np.sum(c) == 0:
        # Equally space-filling
        return 0
    else:
        # The first non-zero entry indicates which plan is better
        idx = np.argmax(c != 0)
        return c[idx]
```


:::{#exm-mm}
### The Function `mm`
We can use the `mm` function to compare two sampling plans. The following code creates two 3-point sampling plans in 2D (shown in @fig-mm-3points) and compares them using the Morris-Mitchell criterion:

```{python}
X1 = np.array([[0.0, 0.0],[0.5, 0.5],[0.0, 1.0], [1.0, 1.0]])
X2 = np.array([[0.1, 0.1],[0.4, 0.6],[0.1, 0.9], [0.9, 0.9]])
```
```{python}
#| label: fig-mm-3points
#| fig-cap: "Comparison of Two Sampling Plans"
#| echo: false
plt.figure(figsize=(6, 6))
plt.scatter(X1[:, 0], X1[:, 1], color='blue', label='Plan 1')
plt.scatter(X2[:, 0], X2[:, 1], color='red', label='Plan 2')
plt.title('Comparison of Two Sampling Plans')
plt.grid()
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()
```

We can compare which plan has better space-filling (Morris-Mitchell).
The output is either `0`, `1`, or `2` depending on which plan is more space-filling.
```{python}
better = mm(X1, X2, p=2.0)
print(f"Plan {better} is more space-filling.")
```
:::


#### The Function `mmphi`

Searching across a space of potential sampling plans can be accomplished by pairwise comparisons. An optimization algorithm could, in theory, be written with mm as the comparative objective. However, experimental evidence [@morr95a] suggests that the resulting optimization landscape can be quite deceptive, making it difficult to search reliably. This difficulty arises because the comparison process terminates upon finding the first non-zero element in the comparison array c. Consequently, the remaining values in the distance ($d_1, d_2, ..., d_m$) and multiplicity ($J_1, J_2, ..., J_m$) arrays are disregarded. These disregarded values, however, might contain potentially useful 'slope' information about the global landscape for the optimization process.

To address this, @morr95a defined the following scalar-valued criterion function, which is used to rank competing sampling plans. This function, while based on the logic of @def-maximin2, incorporates the complete vectors $d_1, d_2, ..., d_m$ and $J_1, J_2, ..., J_m$.

::: {#def-morris-mitchell}
#### Morris-Mitchell Criterion
The Morris-Mitchell criterion is defined as:

$$
\Phi_q (X) = \left(\sum_{j=1}^m J_j d_j^{-q}\right)^{1/q},