Agenda
- Discuss draft testing plan
- Is this a good plan? What can make it better? What am I missing?
- The main "thing" to do for to move this along is to write the testing helpers. Do you think that such helpers would make this process easier, or just different?
- What should be the "correct" place to put a new test?
@ActivitySim/engineering
Notes
Jeff presented a GitHub issue outlining a proposed plan to improve ActivitySim's testing structure. The group reviewed it together and offered initial reactions, with the intent to continue refining the approach asynchronously.
Current Problems Identified
1. Over-reliance on integration tests. Most existing tests run the full model end-to-end (load data → run 20–30 components → compare final trip tables). These are slow (can take an hour+), give poor diagnostic signal when something breaks, and produce cascading failures from trivial changes (e.g., a capitalization fix).
2. Disorganized test structure. Tests are scattered across the repository with no clear guidance on where a new test should live. Contributors adding features or fixing bugs have no obvious location or format to follow.
3. High setup burden for component tests. Writing a test for even a single component (e.g., trip destination choice) currently requires assembling a large set of boilerplate config files. In one recent example, writing the test took significantly more effort than the underlying bug fix itself.
Proposed Approach
Two-tier testing structure:
- Fast tier (unit/component tests): Small, targeted tests covering individual features or bug fixes. Should run in 1–4 minutes on every commit. This is the primary feedback loop for developers.
- Slow tier (integration tests): Full model runs retained as a final safety net before pull requests are merged — not triggered on every commit.
"Boy Scout" rule: Every new feature, bug fix, or non-trivial code change should include an accompanying test. Exceptions (e.g., documentation typos) should be rare and deliberate.
Testing submodule: Jeff proposed creating a dedicated testing submodule with reusable helper functions and setup utilities, reducing the boilerplate burden when writing new component tests.
Clear contributor expectations: Document what a well-formed pull request should include with respect to testing.
Key Discussion Points
Self-contained vs. config-file-based tests. David highlighted a tension between tests that embed all settings inline (transparent, self-contained, but not reusable) vs. tests that load shared config files (reusable, but harder to debug when something breaks). No consensus reached — considered an open design question.
Chain-effect testing. Sijia noted that some features (e.g., global household skipping on failure) require testing cross-component behavior, making pure unit tests insufficient. Some tests will inevitably need a broader model state.
ActivitySim's structural challenge. Because components are tightly coupled and depend on file-based configuration, breaking dependency chains for isolated testing is non-trivial. Jeff acknowledged there may not be a clean solution.
Test maintenance burden. David raised concern that a large, varied test suite can itself become expensive to maintain — especially during library upgrades (e.g., Pandas 3 is already causing cascading failures). Sijia echoed this from experience with Network Wrangler, noting tests are often the first thing to break during refactoring.
Scope of coverage. Whether the goal is full unit-test coverage of the existing codebase vs. incremental improvement going forward is a consortium-level policy question, not an engineering one. Jeff estimated achieving full coverage could require roughly a year of dedicated funding — valuable but difficult to justify to individual agencies in the near term.
Performance testing. David raised the lack of any systematic performance benchmarking. Jeff noted prior discussions about dedicated cloud-based reference machines for this purpose, which were never acted on. The group agreed this is worth revisiting, potentially as a third testing tier or as a regular reporting mechanism.
AI-assisted test writing. David and Jeff discussed using AI tools to accelerate test generation — but only after a consistent test format and a set of 6–12 canonical examples are established. Pointing AI at the current repository without that scaffolding would likely make things worse.
External software developers. David advised against bringing in outside developers unfamiliar with the domain, citing past experience where a 2-year ramp-up still yielded contributions that missed important context.
Action Items
| Owner |
Action |
| Jeff Newman |
Convert the GitHub issue into a shared Google Doc for async commenting and collaboration. |
| Jeff Newman |
Propose named patterns/standards for the two primary test approaches discussed. |
| Sijia Wang |
Share Network Wrangler repository and test structure as a reference example. |
| David Hensle |
Share the Pandas contributing guide as another reference for testing philosophy. |
| All |
Review the draft testing strategy doc and add comments before the next meeting. |
Next Meeting: Continue testing strategy discussion. Jeff will circulate the Google Doc in advance.
Agenda
@ActivitySim/engineering
Notes
Jeff presented a GitHub issue outlining a proposed plan to improve ActivitySim's testing structure. The group reviewed it together and offered initial reactions, with the intent to continue refining the approach asynchronously.
Current Problems Identified
1. Over-reliance on integration tests. Most existing tests run the full model end-to-end (load data → run 20–30 components → compare final trip tables). These are slow (can take an hour+), give poor diagnostic signal when something breaks, and produce cascading failures from trivial changes (e.g., a capitalization fix).
2. Disorganized test structure. Tests are scattered across the repository with no clear guidance on where a new test should live. Contributors adding features or fixing bugs have no obvious location or format to follow.
3. High setup burden for component tests. Writing a test for even a single component (e.g., trip destination choice) currently requires assembling a large set of boilerplate config files. In one recent example, writing the test took significantly more effort than the underlying bug fix itself.
Proposed Approach
Two-tier testing structure:
"Boy Scout" rule: Every new feature, bug fix, or non-trivial code change should include an accompanying test. Exceptions (e.g., documentation typos) should be rare and deliberate.
Testing submodule: Jeff proposed creating a dedicated testing submodule with reusable helper functions and setup utilities, reducing the boilerplate burden when writing new component tests.
Clear contributor expectations: Document what a well-formed pull request should include with respect to testing.
Key Discussion Points
Self-contained vs. config-file-based tests. David highlighted a tension between tests that embed all settings inline (transparent, self-contained, but not reusable) vs. tests that load shared config files (reusable, but harder to debug when something breaks). No consensus reached — considered an open design question.
Chain-effect testing. Sijia noted that some features (e.g., global household skipping on failure) require testing cross-component behavior, making pure unit tests insufficient. Some tests will inevitably need a broader model state.
ActivitySim's structural challenge. Because components are tightly coupled and depend on file-based configuration, breaking dependency chains for isolated testing is non-trivial. Jeff acknowledged there may not be a clean solution.
Test maintenance burden. David raised concern that a large, varied test suite can itself become expensive to maintain — especially during library upgrades (e.g., Pandas 3 is already causing cascading failures). Sijia echoed this from experience with Network Wrangler, noting tests are often the first thing to break during refactoring.
Scope of coverage. Whether the goal is full unit-test coverage of the existing codebase vs. incremental improvement going forward is a consortium-level policy question, not an engineering one. Jeff estimated achieving full coverage could require roughly a year of dedicated funding — valuable but difficult to justify to individual agencies in the near term.
Performance testing. David raised the lack of any systematic performance benchmarking. Jeff noted prior discussions about dedicated cloud-based reference machines for this purpose, which were never acted on. The group agreed this is worth revisiting, potentially as a third testing tier or as a regular reporting mechanism.
AI-assisted test writing. David and Jeff discussed using AI tools to accelerate test generation — but only after a consistent test format and a set of 6–12 canonical examples are established. Pointing AI at the current repository without that scaffolding would likely make things worse.
External software developers. David advised against bringing in outside developers unfamiliar with the domain, citing past experience where a 2-year ramp-up still yielded contributions that missed important context.
Action Items
Next Meeting: Continue testing strategy discussion. Jeff will circulate the Google Doc in advance.