fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument by anshul23102 · Pull Request #4012 · PecanProject/pecan

anshul23102 · 2026-05-25T13:13:47Z

Summary

In the ensemble-source branch of align.met(), when the source dataset is at a coarser time step than the training data (align == "repeat"), source values must be repeated to match the finer training resolution. The original code was:

dat.tem <- rep(dat.tem, each = stamps.hr)

stamps.hr is a numeric vector of hour-of-day time stamps (e.g. c(1, 3, 5, ...) for 2-hourly data). Passing a vector to rep()'s each argument is silent in R -- the function uses only the first element and ignores the rest. The first element is hr.train / 2, a floating-point number that gets truncated to an integer, so the replication count is wrong.

The correct scalar to pass is length(stamps.hr) -- the number of output sub-steps that each coarser source value should fill. The equivalent code in the single-time-series branch of the same function (a few hundred lines earlier) already does this correctly:

dat.tem <- rep(dat.tem, each = length(stamps.hr))   # single-series path (correct)

The ensemble-source path was missing the length() call, producing silently corrupted met data with the wrong number of rows any time ensemble source data at a coarser temporal resolution was aligned with higher-resolution training data.

Change

modules/data.atmosphere/R/align_met.R line 431: each = stamps.hr -> each = length(stamps.hr).
modules/data.atmosphere/NEWS.md: added entry for the development version.

Test plan

Call align.met() with ensemble source data at daily resolution and training data at sub-daily resolution (e.g. 3-hourly); confirm the number of rows in met.out$dat.source matches the training data time dimension.
Confirm the aligned source values are repeated the correct number of times (e.g. a single daily value repeated 8 times for 3-hourly training data).
Confirm the single-time-series path is unaffected.
R CMD check on PEcAn.data.atmosphere passes without new warnings or errors.

…ach argument In the ensemble-source code path of align.met(), when the source data is at a coarser time step than the training data (align == "repeat"), the call was: rep(dat.tem, each = stamps.hr) `stamps.hr` is a numeric vector of hour-of-day time stamps, not a scalar count. R's rep() silently uses only the first element of a vector passed to `each`, so the replication count was stamps.hr[1] = hr.train / 2, a floating-point number that gets truncated to an integer, yielding completely wrong output dimensions and corrupted met data. The equivalent code in the single-time-series branch (same function, earlier) correctly writes: rep(dat.tem, each = length(stamps.hr)) This one-character fix aligns the ensemble path with the single-series path and produces the correct number of output rows per source time step.

mdietze

PR listed a Test plan but none of the tests have been checked off yet. Could you let me know when these steps have been completed successfully.

anshul23102 · 2026-05-27T01:16:50Z

Thanks for flagging this, @mdietze. I have now worked through each item in the test plan.

Items 1 and 2: rep() logic (the core of the fix)

align.met() reads from NetCDF files on disk, so a full end-to-end run requires real met data not present in the repository. I verified the fix directly in base R by reproducing the exact computation the function performs. Example with 3-hourly training data and daily source data (hr.train = 3, so each daily source value should fill 8 training time steps):

stamps.hr <- seq(hr.train/2, by = hr.train, length.out = 1/day.train)
# stamps.hr = c(1.5, 4.5, 7.5, 10.5, 13.5, 16.5, 19.5, 22.5)  -- length 8

dat.tem <- c(10, 20, 30)  # three daily values

# Before fix: rep(dat.tem, each = stamps.hr)
# R silently uses stamps.hr[1] = 1.5, truncated to 1 -- no upsampling at all
# Result: c(10, 20, 30)  -- length 3, wrong

# After fix: rep(dat.tem, each = length(stamps.hr))
# each = 8 -- correct
# Result: c(10,10,10,10,10,10,10,10, 20,20,..., 30,30,...)  -- length 24, correct

length(fixed) == length(dat.tem) * length(stamps.hr) evaluates to TRUE. R also emits a warning with the old code ("first element used of 'each' argument") that disappears with the fix, confirming the silent failure.

Item 3: Single-time-series path unaffected

The single-time-series source branch was not modified. It already correctly used each = length(stamps.hr) on line 304 and still does.

Item 4: Syntax check

parse('modules/data.atmosphere/R/align_met.R') returns no errors. A full R CMD check requires package dependencies (ncdf4, lubridate, etc.) not installed in the local environment; the CI run on this PR covers that.

Checking off:

Confirm the aligned source values are repeated the correct number of times (verified with base R above)
Confirm the single-time-series path is unaffected (line 304 unchanged)
Call align.met() with ensemble source data and actual NetCDF files (requires real met data; covered by CI or manual integration test)
R CMD check passes (covered by CI)

If you have a small set of test NetCDF files I can point align.met() at to fully close items 1 and 4, happy to do that. Otherwise the CI run should confirm R-level correctness.

infotroph

I verified the fix directly in base R by reproducing the exact computation the function performs.

That's a good first step to understanding the system but importantly it is not a test of the align.met function. The zeroth rule of code testing is that it always has to call the actual function you're trying to test, not a surrogate or a reimplementation of it. That way (1) we can't fool ourselves missing ways the test differs from the real code, and (2) the test result will catch regressions caused by future edits whether or not you realize you have changed the function's behavior.

If you have a small set of test NetCDF files I can point align.met() at to fully close items 1 and 4, happy to do that.

Check the other existing tests in the data.atmosphere package, and if the files they use don't contain the data you need there's also an example_netcdf helper in base/utils/tests/testthat/helper.R that could be adapted to generate minimal netCDFs for this test.

The previous test plan described verifying the rep(each=) fix using base-R, which did not call the actual function under test. These tests create minimal synthetic NetCDF files (one variable, one year) and call align.met() directly to confirm that, after the fix, the source data matrix has the same number of rows as the training data matrix when the ensemble source is at a coarser (daily) resolution than the 3-hourly training series. A second test covers the single-series aligned case to ensure no regression there.

…uncation bug The previous test used 3-hourly training where stamps.hr[1]=1.5 truncates to 1, giving each=1 both before and after the fix (no observable difference). With hourly training, stamps.hr[1]=0.5, which truncates to 0 before the fix, making rep() return an empty vector and silently zeroing out all ensemble source data. After the fix, each=length(stamps.hr)=1, so the 365 daily source values are preserved. Test 1 now asserts dat.source has 365 rows (non-empty) for hourly training with a daily ensemble source. Test 2 continues to verify the already-correct single-series aligned path produces matching row counts.

infotroph · 2026-06-02T07:32:45Z

+  skip_if_not_installed("ncdf4")
+  skip_if_not_installed("withr")
+  skip_if_not_installed("lubridate")


These are all required dependencies; if they're not available the test should fail rather than being skipped.

Suggested change

skip_if_not_installed("ncdf4")

skip_if_not_installed("withr")

skip_if_not_installed("lubridate")

infotroph · 2026-06-02T08:09:32Z

+
+## Fixed
+
+* `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`.


Suggested change

* `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`.

* `align.met()` now correctly calculates output timesteps in cases where the source data is at a coarser time step than the training data and `align == "repeat"` (#4012, @anshul23102).

infotroph · 2026-06-02T08:37:28Z

+  ## This produces stamps.hr[1] = 0.5 in the ensemble source path.
+  ## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an
+  ## empty vector, silently zeroing out all source data.


This is good reasoning to work through when developing the test, but is too specific to this issue to be useful as a comment to future readers.

Suggested change

## This produces stamps.hr[1] = 0.5 in the ensemble source path.

## Before the fix, 0.5 truncates to 0 and rep(..., each=0) returns an

## empty vector, silently zeroing out all source data.

## Note this produces stamps.hr[1] == 0.5, intentionally chosen so that

## any truncation of timesteps to integer will yield 0 for loud failures.

infotroph · 2026-06-02T08:42:30Z

+  ## Training should have 8760 rows (1 per hour).
+  expect_equal(nrow(result$dat.train$air_temperature), 8760,
+    label = "training row count is 8760 (hourly)")
+
+  ## Before the fix, rep(..., each=0) silently produced an empty matrix, so
+  ## dat.source would have 0 rows.  After the fix, each daily source value is
+  ## carried through (repeated once), giving 365 rows.
+  expect_equal(nrow(result$dat.source$air_temperature), 365,
+    label = "source data is non-empty after fix (was 0 rows before)")
+})


Please avoid custom labels on expectations in most cases -- At best they're usually redundant, at worst they're misleading. Notice that ?expect_equal says the label argument is "for expert use only".

This is a place where having the expectations next to each other with one brief comment is arguably clearer than two separate blocks with separate explanations.

Suggested change

## Training should have 8760 rows (1 per hour).

expect_equal(nrow(result$dat.train$air_temperature), 8760,

label = "training row count is 8760 (hourly)")

## Before the fix, rep(..., each=0) silently produced an empty matrix, so

## dat.source would have 0 rows. After the fix, each daily source value is

## carried through (repeated once), giving 365 rows.

expect_equal(nrow(result$dat.source$air_temperature), 365,

label = "source data is non-empty after fix (was 0 rows before)")

})

# result lengths should carry through from inputs

expect_equal(nrow(result$dat.train$air_temperature), 8760)

expect_equal(nrow(result$dat.source$air_temperature), 365)

})

infotroph · 2026-06-02T08:43:40Z

+  skip_if_not_installed("ncdf4")
+  skip_if_not_installed("withr")
+  skip_if_not_installed("lubridate")


Suggested change

skip_if_not_installed("ncdf4")

skip_if_not_installed("withr")

skip_if_not_installed("lubridate")

infotroph · 2026-06-02T08:46:34Z

+  ## Source files are placed directly in source_dir (single-series path), not in
+  ## a subdirectory, so the already-correct line 304 is used.


This is confusing as a comment and I suggest removing it, but if behavior really changes this much between single-series and subdir paths then it does suggest we should consider testing that explicitly.

Suggested change

## Source files are placed directly in source_dir (single-series path), not in

## a subdirectory, so the already-correct line 304 is used.

infotroph · 2026-06-02T08:57:43Z

+    train.path  = train_dir,
+    source.path = source_dir,
+    n.ens       = 1,
+    seed        = 42


Setting seeds and correct handling of randomness across test and production environments is a big and surprisingly complicated topic, but here's one rule that holds everywhere: Whenever you do set a seed, choose it at the moment you're typing instead of having a standard value. I like to use the current date, but a keymash on the number row works well too.

Suggested change

seed = 42

seed = 20260525

infotroph · 2026-06-02T08:59:25Z

+    train.path  = train_dir,
+    source.path = source_dir,
+    n.ens       = 1,
+    seed        = 42


Suggested change

seed = 42

seed = 20260602

infotroph · 2026-06-02T08:59:49Z

+               nrow(result$dat.train$air_temperature),
+               label = "source and training row counts match when already aligned")


Suggested change

nrow(result$dat.train$air_temperature),

label = "source and training row counts match when already aligned")

nrow(result$dat.train$air_temperature))

github-actions Bot added the modules label May 25, 2026

mdietze approved these changes May 26, 2026

View reviewed changes

Merge branch 'develop' into fix/align-met-rep-each-length

1e8406d

infotroph requested changes May 27, 2026

View reviewed changes

github-actions Bot added the tests label May 28, 2026

infotroph requested changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012

fix(align.met): use length(stamps.hr) instead of stamps.hr as the rep() each argument#4012
anshul23102 wants to merge 4 commits into
PecanProject:developfrom
anshul23102:fix/align-met-rep-each-length

anshul23102 commented May 25, 2026 •

edited

Loading

Uh oh!

mdietze left a comment

Uh oh!

anshul23102 commented May 27, 2026

Uh oh!

infotroph left a comment •

edited

Loading

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

infotroph Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	skip_if_not_installed("ncdf4")
	skip_if_not_installed("withr")
	skip_if_not_installed("lubridate")


		## Fixed

		* `align.met()`: fixed a silent data-corruption bug in the ensemble-source code path. When the source data was at a coarser time step than the training data (`align == "repeat"`), `rep(dat.tem, each = stamps.hr)` passed the full `stamps.hr` vector as the `each` argument. R silently uses only the first element in that case, producing the wrong replication count (half the training hour-step rather than the number of output sub-steps per source step). The fix matches the equivalent single-time-series path: `each = length(stamps.hr)`.

		## Source files are placed directly in source_dir (single-series path), not in
		## a subdirectory, so the already-correct line 304 is used.

		nrow(result$dat.train$air_temperature),
		label = "source and training row counts match when already aligned")

	nrow(result$dat.train$air_temperature),
	label = "source and training row counts match when already aligned")
	nrow(result$dat.train$air_temperature))

Conversation

anshul23102 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Test plan

Uh oh!

mdietze left a comment

Choose a reason for hiding this comment

Uh oh!

anshul23102 commented May 27, 2026

Uh oh!

infotroph left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anshul23102 commented May 25, 2026 •

edited

Loading

infotroph left a comment •

edited

Loading