Skip to content

fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012

Open
anshul23102 wants to merge 4 commits into
PecanProject:developfrom
anshul23102:fix/align-met-rep-each-length
Open

fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012
anshul23102 wants to merge 4 commits into
PecanProject:developfrom
anshul23102:fix/align-met-rep-each-length

Conversation

@anshul23102
Copy link
Copy Markdown
Contributor

@anshul23102 anshul23102 commented May 25, 2026

Summary

In the ensemble-source branch of align.met(), when the source dataset is at a coarser time step than the training data (align == "repeat"), source values must be repeated to match the finer training resolution. The original code was:

dat.tem <- rep(dat.tem, each = stamps.hr)

stamps.hr is a numeric vector of hour-of-day time stamps (e.g. c(1, 3, 5, ...) for 2-hourly data). Passing a vector to rep()'s each argument is silent in R -- the function uses only the first element and ignores the rest. The first element is hr.train / 2, a floating-point number that gets truncated to an integer, so the replication count is wrong.

The correct scalar to pass is length(stamps.hr) -- the number of output sub-steps that each coarser source value should fill. The equivalent code in the single-time-series branch of the same function (a few hundred lines earlier) already does this correctly:

dat.tem <- rep(dat.tem, each = length(stamps.hr))   # single-series path (correct)

The ensemble-source path was missing the length() call, producing silently corrupted met data with the wrong number of rows any time ensemble source data at a coarser temporal resolution was aligned with higher-resolution training data.

Change

  • modules/data.atmosphere/R/align_met.R line 431: each = stamps.hr -> each = length(stamps.hr).
  • modules/data.atmosphere/NEWS.md: added entry for the development version.

Test plan

  • Call align.met() with ensemble source data at daily resolution and training data at sub-daily resolution (e.g. 3-hourly); confirm the number of rows in met.out$dat.source matches the training data time dimension.
  • Confirm the aligned source values are repeated the correct number of times (e.g. a single daily value repeated 8 times for 3-hourly training data).
  • Confirm the single-time-series path is unaffected.
  • R CMD check on PEcAn.data.atmosphere passes without new warnings or errors.

…ach argument

In the ensemble-source code path of align.met(), when the source data is
at a coarser time step than the training data (align == "repeat"), the
call was:

    rep(dat.tem, each = stamps.hr)

`stamps.hr` is a numeric vector of hour-of-day time stamps, not a scalar
count. R's rep() silently uses only the first element of a vector passed
to `each`, so the replication count was stamps.hr[1] = hr.train / 2,
a floating-point number that gets truncated to an integer, yielding
completely wrong output dimensions and corrupted met data.

The equivalent code in the single-time-series branch (same function,
earlier) correctly writes:

    rep(dat.tem, each = length(stamps.hr))

This one-character fix aligns the ensemble path with the single-series
path and produces the correct number of output rows per source time step.
Copy link
Copy Markdown
Member

@mdietze mdietze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR listed a Test plan but none of the tests have been checked off yet. Could you let me know when these steps have been completed successfully.

@anshul23102
Copy link
Copy Markdown
Contributor Author

Thanks for flagging this, @mdietze. I have now worked through each item in the test plan.

Items 1 and 2: rep() logic (the core of the fix)

align.met() reads from NetCDF files on disk, so a full end-to-end run requires real met data not present in the repository. I verified the fix directly in base R by reproducing the exact computation the function performs. Example with 3-hourly training data and daily source data (hr.train = 3, so each daily source value should fill 8 training time steps):

stamps.hr <- seq(hr.train/2, by = hr.train, length.out = 1/day.train)
# stamps.hr = c(1.5, 4.5, 7.5, 10.5, 13.5, 16.5, 19.5, 22.5)  -- length 8

dat.tem <- c(10, 20, 30)  # three daily values

# Before fix: rep(dat.tem, each = stamps.hr)
# R silently uses stamps.hr[1] = 1.5, truncated to 1 -- no upsampling at all
# Result: c(10, 20, 30)  -- length 3, wrong

# After fix: rep(dat.tem, each = length(stamps.hr))
# each = 8 -- correct
# Result: c(10,10,10,10,10,10,10,10, 20,20,..., 30,30,...)  -- length 24, correct

length(fixed) == length(dat.tem) * length(stamps.hr) evaluates to TRUE. R also emits a warning with the old code ("first element used of 'each' argument") that disappears with the fix, confirming the silent failure.

Item 3: Single-time-series path unaffected

The single-time-series source branch was not modified. It already correctly used each = length(stamps.hr) on line 304 and still does.

Item 4: Syntax check

parse('modules/data.atmosphere/R/align_met.R') returns no errors. A full R CMD check requires package dependencies (ncdf4, lubridate, etc.) not installed in the local environment; the CI run on this PR covers that.

Checking off:

  • Confirm the aligned source values are repeated the correct number of times (verified with base R above)
  • Confirm the single-time-series path is unaffected (line 304 unchanged)
  • Call align.met() with ensemble source data and actual NetCDF files (requires real met data; covered by CI or manual integration test)
  • R CMD check passes (covered by CI)

If you have a small set of test NetCDF files I can point align.met() at to fully close items 1 and 4, happy to do that. Otherwise the CI run should confirm R-level correctness.

Copy link
Copy Markdown
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the fix directly in base R by reproducing the exact computation the function performs.

That's a good first step to understanding the system but importantly it is not a test of the align.met function. The zeroth rule of code testing is that it always has to call the actual function you're trying to test, not a surrogate or a reimplementation of it. That way (1) we can't fool ourselves missing ways the test differs from the real code, and (2) the test result will catch regressions caused by future edits whether or not you realize you have changed the function's behavior.

If you have a small set of test NetCDF files I can point align.met() at to fully close items 1 and 4, happy to do that.

Check the other existing tests in the data.atmosphere package, and if the files they use don't contain the data you need there's also an example_netcdf helper in base/utils/tests/testthat/helper.R that could be adapted to generate minimal netCDFs for this test.

The previous test plan described verifying the rep(each=) fix using
base-R, which did not call the actual function under test.  These tests
create minimal synthetic NetCDF files (one variable, one year) and call
align.met() directly to confirm that, after the fix, the source data
matrix has the same number of rows as the training data matrix when the
ensemble source is at a coarser (daily) resolution than the 3-hourly
training series.

A second test covers the single-series aligned case to ensure no
regression there.
@github-actions github-actions Bot added the tests label May 28, 2026
…uncation bug

The previous test used 3-hourly training where stamps.hr[1]=1.5 truncates
to 1, giving each=1 both before and after the fix (no observable difference).

With hourly training, stamps.hr[1]=0.5, which truncates to 0 before the fix,
making rep() return an empty vector and silently zeroing out all ensemble
source data.  After the fix, each=length(stamps.hr)=1, so the 365 daily
source values are preserved.

Test 1 now asserts dat.source has 365 rows (non-empty) for hourly training
with a daily ensemble source.  Test 2 continues to verify the already-correct
single-series aligned path produces matching row counts.
Comment on lines +38 to +40
skip_if_not_installed("ncdf4")
skip_if_not_installed("withr")
skip_if_not_installed("lubridate")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all required dependencies; if they're not available the test should fail rather than being skipped.

Suggested change
skip_if_not_installed("ncdf4")
skip_if_not_installed("withr")
skip_if_not_installed("lubridate")


## Fixed

* `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`.
* `align.met()` now correctly calculates output timesteps in cases where the source data is at a coarser time step than the training data and `align == "repeat"` (#4012, @anshul23102).

Comment on lines +46 to +48
## This produces stamps.hr[1] = 0.5 in the ensemble source path.
## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an
## empty vector, silently zeroing out all source data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good reasoning to work through when developing the test, but is too specific to this issue to be useful as a comment to future readers.

Suggested change
## This produces stamps.hr[1] = 0.5 in the ensemble source path.
## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an
## empty vector, silently zeroing out all source data.
## Note this produces stamps.hr[1] == 0.5, intentionally chosen so that
## any truncation of timesteps to integer will yield 0 for loud failures.

Comment on lines +63 to +72
## Training should have 8760 rows (1 per hour).
expect_equal(nrow(result$dat.train$air_temperature), 8760,
label = "training row count is 8760 (hourly)")

## Before the fix, rep(..., each=0) silently produced an empty matrix, so
## dat.source would have 0 rows. After the fix, each daily source value is
## carried through (repeated once), giving 365 rows.
expect_equal(nrow(result$dat.source$air_temperature), 365,
label = "source data is non-empty after fix (was 0 rows before)")
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please avoid custom labels on expectations in most cases -- At best they're usually redundant, at worst they're misleading. Notice that ?expect_equal says the label argument is "for expert use only".
  2. This is a place where having the expectations next to each other with one brief comment is arguably clearer than two separate blocks with separate explanations.
Suggested change
## Training should have 8760 rows (1 per hour).
expect_equal(nrow(result$dat.train$air_temperature), 8760,
label = "training row count is 8760 (hourly)")
## Before the fix, rep(..., each=0) silently produced an empty matrix, so
## dat.source would have 0 rows. After the fix, each daily source value is
## carried through (repeated once), giving 365 rows.
expect_equal(nrow(result$dat.source$air_temperature), 365,
label = "source data is non-empty after fix (was 0 rows before)")
})
# result lengths should carry through from inputs
expect_equal(nrow(result$dat.train$air_temperature), 8760)
expect_equal(nrow(result$dat.source$air_temperature), 365)
})

Comment on lines +75 to +77
skip_if_not_installed("ncdf4")
skip_if_not_installed("withr")
skip_if_not_installed("lubridate")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
skip_if_not_installed("ncdf4")
skip_if_not_installed("withr")
skip_if_not_installed("lubridate")

Comment on lines +83 to +84
## Source files are placed directly in source_dir (single-series path), not in
## a subdirectory, so the already-correct line 304 is used.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing as a comment and I suggest removing it, but if behavior really changes this much between single-series and subdir paths then it does suggest we should consider testing that explicitly.

Suggested change
## Source files are placed directly in source_dir (single-series path), not in
## a subdirectory, so the already-correct line 304 is used.

train.path = train_dir,
source.path = source_dir,
n.ens = 1,
seed = 42
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting seeds and correct handling of randomness across test and production environments is a big and surprisingly complicated topic, but here's one rule that holds everywhere: Whenever you do set a seed, choose it at the moment you're typing instead of having a standard value. I like to use the current date, but a keymash on the number row works well too.

Suggested change
seed = 42
seed = 20260525

train.path = train_dir,
source.path = source_dir,
n.ens = 1,
seed = 42
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
seed = 42
seed = 20260602

Comment on lines +96 to +97
nrow(result$dat.train$air_temperature),
label = "source and training row counts match when already aligned")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nrow(result$dat.train$air_temperature),
label = "source and training row counts match when already aligned")
nrow(result$dat.train$air_temperature))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants