fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012
fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012anshul23102 wants to merge 4 commits into
Conversation
…ach argument
In the ensemble-source code path of align.met(), when the source data is
at a coarser time step than the training data (align == "repeat"), the
call was:
rep(dat.tem, each = stamps.hr)
`stamps.hr` is a numeric vector of hour-of-day time stamps, not a scalar
count. R's rep() silently uses only the first element of a vector passed
to `each`, so the replication count was stamps.hr[1] = hr.train / 2,
a floating-point number that gets truncated to an integer, yielding
completely wrong output dimensions and corrupted met data.
The equivalent code in the single-time-series branch (same function,
earlier) correctly writes:
rep(dat.tem, each = length(stamps.hr))
This one-character fix aligns the ensemble path with the single-series
path and produces the correct number of output rows per source time step.
mdietze
left a comment
There was a problem hiding this comment.
PR listed a Test plan but none of the tests have been checked off yet. Could you let me know when these steps have been completed successfully.
|
Thanks for flagging this, @mdietze. I have now worked through each item in the test plan. Items 1 and 2:
stamps.hr <- seq(hr.train/2, by = hr.train, length.out = 1/day.train)
# stamps.hr = c(1.5, 4.5, 7.5, 10.5, 13.5, 16.5, 19.5, 22.5) -- length 8
dat.tem <- c(10, 20, 30) # three daily values
# Before fix: rep(dat.tem, each = stamps.hr)
# R silently uses stamps.hr[1] = 1.5, truncated to 1 -- no upsampling at all
# Result: c(10, 20, 30) -- length 3, wrong
# After fix: rep(dat.tem, each = length(stamps.hr))
# each = 8 -- correct
# Result: c(10,10,10,10,10,10,10,10, 20,20,..., 30,30,...) -- length 24, correct
Item 3: Single-time-series path unaffected The single-time-series source branch was not modified. It already correctly used Item 4: Syntax check
Checking off:
If you have a small set of test NetCDF files I can point |
There was a problem hiding this comment.
I verified the fix directly in base R by reproducing the exact computation the function performs.
That's a good first step to understanding the system but importantly it is not a test of the align.met function. The zeroth rule of code testing is that it always has to call the actual function you're trying to test, not a surrogate or a reimplementation of it. That way (1) we can't fool ourselves missing ways the test differs from the real code, and (2) the test result will catch regressions caused by future edits whether or not you realize you have changed the function's behavior.
If you have a small set of test NetCDF files I can point align.met() at to fully close items 1 and 4, happy to do that.
Check the other existing tests in the data.atmosphere package, and if the files they use don't contain the data you need there's also an example_netcdf helper in base/utils/tests/testthat/helper.R that could be adapted to generate minimal netCDFs for this test.
The previous test plan described verifying the rep(each=) fix using base-R, which did not call the actual function under test. These tests create minimal synthetic NetCDF files (one variable, one year) and call align.met() directly to confirm that, after the fix, the source data matrix has the same number of rows as the training data matrix when the ensemble source is at a coarser (daily) resolution than the 3-hourly training series. A second test covers the single-series aligned case to ensure no regression there.
…uncation bug The previous test used 3-hourly training where stamps.hr[1]=1.5 truncates to 1, giving each=1 both before and after the fix (no observable difference). With hourly training, stamps.hr[1]=0.5, which truncates to 0 before the fix, making rep() return an empty vector and silently zeroing out all ensemble source data. After the fix, each=length(stamps.hr)=1, so the 365 daily source values are preserved. Test 1 now asserts dat.source has 365 rows (non-empty) for hourly training with a daily ensemble source. Test 2 continues to verify the already-correct single-series aligned path produces matching row counts.
| skip_if_not_installed("ncdf4") | ||
| skip_if_not_installed("withr") | ||
| skip_if_not_installed("lubridate") |
There was a problem hiding this comment.
These are all required dependencies; if they're not available the test should fail rather than being skipped.
| skip_if_not_installed("ncdf4") | |
| skip_if_not_installed("withr") | |
| skip_if_not_installed("lubridate") |
|
|
||
| ## Fixed | ||
|
|
||
| * `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`. |
There was a problem hiding this comment.
| * `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`. | |
| * `align.met()` now correctly calculates output timesteps in cases where the source data is at a coarser time step than the training data and `align == "repeat"` (#4012, @anshul23102). |
| ## This produces stamps.hr[1] = 0.5 in the ensemble source path. | ||
| ## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an | ||
| ## empty vector, silently zeroing out all source data. |
There was a problem hiding this comment.
This is good reasoning to work through when developing the test, but is too specific to this issue to be useful as a comment to future readers.
| ## This produces stamps.hr[1] = 0.5 in the ensemble source path. | |
| ## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an | |
| ## empty vector, silently zeroing out all source data. | |
| ## Note this produces stamps.hr[1] == 0.5, intentionally chosen so that | |
| ## any truncation of timesteps to integer will yield 0 for loud failures. |
| ## Training should have 8760 rows (1 per hour). | ||
| expect_equal(nrow(result$dat.train$air_temperature), 8760, | ||
| label = "training row count is 8760 (hourly)") | ||
|
|
||
| ## Before the fix, rep(..., each=0) silently produced an empty matrix, so | ||
| ## dat.source would have 0 rows. After the fix, each daily source value is | ||
| ## carried through (repeated once), giving 365 rows. | ||
| expect_equal(nrow(result$dat.source$air_temperature), 365, | ||
| label = "source data is non-empty after fix (was 0 rows before)") | ||
| }) |
There was a problem hiding this comment.
- Please avoid custom labels on expectations in most cases -- At best they're usually redundant, at worst they're misleading. Notice that
?expect_equalsays thelabelargument is "for expert use only". - This is a place where having the expectations next to each other with one brief comment is arguably clearer than two separate blocks with separate explanations.
| ## Training should have 8760 rows (1 per hour). | |
| expect_equal(nrow(result$dat.train$air_temperature), 8760, | |
| label = "training row count is 8760 (hourly)") | |
| ## Before the fix, rep(..., each=0) silently produced an empty matrix, so | |
| ## dat.source would have 0 rows. After the fix, each daily source value is | |
| ## carried through (repeated once), giving 365 rows. | |
| expect_equal(nrow(result$dat.source$air_temperature), 365, | |
| label = "source data is non-empty after fix (was 0 rows before)") | |
| }) | |
| # result lengths should carry through from inputs | |
| expect_equal(nrow(result$dat.train$air_temperature), 8760) | |
| expect_equal(nrow(result$dat.source$air_temperature), 365) | |
| }) |
| skip_if_not_installed("ncdf4") | ||
| skip_if_not_installed("withr") | ||
| skip_if_not_installed("lubridate") |
There was a problem hiding this comment.
| skip_if_not_installed("ncdf4") | |
| skip_if_not_installed("withr") | |
| skip_if_not_installed("lubridate") |
| ## Source files are placed directly in source_dir (single-series path), not in | ||
| ## a subdirectory, so the already-correct line 304 is used. |
There was a problem hiding this comment.
This is confusing as a comment and I suggest removing it, but if behavior really changes this much between single-series and subdir paths then it does suggest we should consider testing that explicitly.
| ## Source files are placed directly in source_dir (single-series path), not in | |
| ## a subdirectory, so the already-correct line 304 is used. |
| train.path = train_dir, | ||
| source.path = source_dir, | ||
| n.ens = 1, | ||
| seed = 42 |
There was a problem hiding this comment.
Setting seeds and correct handling of randomness across test and production environments is a big and surprisingly complicated topic, but here's one rule that holds everywhere: Whenever you do set a seed, choose it at the moment you're typing instead of having a standard value. I like to use the current date, but a keymash on the number row works well too.
| seed = 42 | |
| seed = 20260525 |
| train.path = train_dir, | ||
| source.path = source_dir, | ||
| n.ens = 1, | ||
| seed = 42 |
There was a problem hiding this comment.
| seed = 42 | |
| seed = 20260602 |
| nrow(result$dat.train$air_temperature), | ||
| label = "source and training row counts match when already aligned") |
There was a problem hiding this comment.
| nrow(result$dat.train$air_temperature), | |
| label = "source and training row counts match when already aligned") | |
| nrow(result$dat.train$air_temperature)) |
Summary
In the ensemble-source branch of
align.met(), when the source dataset is at a coarser time step than the training data (align == "repeat"), source values must be repeated to match the finer training resolution. The original code was:stamps.hris a numeric vector of hour-of-day time stamps (e.g.c(1, 3, 5, ...)for 2-hourly data). Passing a vector torep()'seachargument is silent in R -- the function uses only the first element and ignores the rest. The first element ishr.train / 2, a floating-point number that gets truncated to an integer, so the replication count is wrong.The correct scalar to pass is
length(stamps.hr)-- the number of output sub-steps that each coarser source value should fill. The equivalent code in the single-time-series branch of the same function (a few hundred lines earlier) already does this correctly:The ensemble-source path was missing the
length()call, producing silently corrupted met data with the wrong number of rows any time ensemble source data at a coarser temporal resolution was aligned with higher-resolution training data.Change
modules/data.atmosphere/R/align_met.Rline 431:each = stamps.hr->each = length(stamps.hr).modules/data.atmosphere/NEWS.md: added entry for the development version.Test plan
align.met()with ensemble source data at daily resolution and training data at sub-daily resolution (e.g. 3-hourly); confirm the number of rows inmet.out$dat.sourcematches the training data time dimension.R CMD checkonPEcAn.data.atmospherepasses without new warnings or errors.