Skip to content

update transformers version#15365

Merged
nithinraok merged 10 commits intomainfrom
update_transformers
Mar 11, 2026
Merged

update transformers version#15365
nithinraok merged 10 commits intomainfrom
update_transformers

Conversation

@nithinraok
Copy link
Copy Markdown
Member

@nithinraok nithinraok commented Feb 6, 2026

What does this PR do ?

Update transformers version

Collection: All

Changelog

Changes Summary

  • transformers: Unpinned from ~=4.57.0 — now allows any version (no upper bound)
  • protobuf: Upgraded from ~=5.29.5 to >=6.33
  • datasets: Added minimum version >=3.2.0
  • fsspec: Relaxed from ==2024.12.0 to >=2024.12.0

Core / Common

  • HuggingFace Hub model filter (nemo/core/classes/mixins/hf_io_mixin.py): Updated get_hf_model_filter() to use the new filter list parameter instead of deprecated library, language, task, tags kwargs (aligns with huggingface_hub API changes)
  • AutoTokenizer (nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py): Added fallback logic for vocab_file — in transformers >= 5.0, from_pretrained may ignore the vocab_file kwarg, so the tokenizer now detects vocab size mismatch and re-loads from the vocab file directly

ASR

  • Aggregate tokenizer vocab size tests: Updated expected vocab size from 254 to 264 across four test files (test_asr_ctc_encoder_model_bpe.py, test_asr_hybrid_rnnt_ctc_model_bpe.py, test_asr_hybrid_rnnt_ctc_model_bpe_prompt.py, test_asr_rnnt_encoder_model_bpe.py) — reflects new tokenizer behavior with updated transformers
  • Parallel chunking test (test_asr_multitask_model_bpe.py::test_aed_parallel_chunking): Relaxed exact text match to a >95% word similarity check (timestamps=True/False use different merge algorithms that may produce slight differences at chunk boundaries). Removed hardcoded expected values for final word/offset assertions

TTS

  • T5 tokenizer vocab_size fix (magpietts.py): In transformers v5+, T5Tokenizer is a fast tokenizer whose vocab_size now includes extra_id sentinel tokens (e.g. 32100 = 32000 + 100). Added logic to subtract _extra_ids so the embedding size matches legacy checkpoints

SpeechLM2

  • test_duplex_eartts.py: Fixed CI cached path check — now checks for the specific model subdirectory (/home/TestData/nvidia--NVIDIA-Nemotron-Nano-9B-v2/) instead of the broad /home/TestData/ directory
  • test_salm.py: Fixed expected tokenized output — removed extra trailing space before [/INST] token (whitespace handling change in updated tokenizers)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@nithinraok nithinraok added Run CICD r2.7.0 Cherry-pick to r2.7.0 release branch labels Feb 6, 2026
@github-actions github-actions Bot removed the Run CICD label Feb 6, 2026
@github-actions github-actions Bot removed the Run CICD label Feb 10, 2026
@github-actions github-actions Bot removed the Run CICD label Feb 10, 2026
@nithinraok nithinraok added Run CICD and removed r2.7.0 Cherry-pick to r2.7.0 release branch labels Feb 18, 2026
@github-actions github-actions Bot added core Changes to NeMo Core common and removed Run CICD labels Feb 18, 2026
@github-actions github-actions Bot removed the Run CICD label Feb 19, 2026
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
author=None,
library='nemo',
language=None,
filter=['nemo'],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change doing? Why is it needed? Where are we still using this mixin?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is for updating args to match latest version.

This mixin provides API for fetching nemo models, pushing to hf hub or for getting hf_model_card. It was previously used for pushing nemo models, however now we do it manually. This file as I can see is now only used in tutorials but not in nemo/collections code. IMO we can remove this file during refactoring.

pzelasko
pzelasko previously approved these changes Mar 9, 2026
@github-actions github-actions Bot removed the Run CICD label Mar 9, 2026
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Copy link
Copy Markdown
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from TTS

@github-actions github-actions Bot removed the Run CICD label Mar 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @nithinraok 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@nithinraok nithinraok requested a review from pzelasko March 11, 2026 06:22
@nithinraok nithinraok merged commit 037573f into main Mar 11, 2026
127 checks passed
@nithinraok nithinraok deleted the update_transformers branch March 11, 2026 13:37

conversations = (
guess_parse_cutset(cfg.inputs)
.map(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this change needed? 👀 @nithinraok

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was failing due to non presence of cut. So I had to change the order.

nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants