Skip to content

feat(huggingface): add qa and ranking tasks#5574

Open
anishshiva7 wants to merge 26 commits into
apache:mainfrom
ELin2025:hf/05-qa-ranking
Open

feat(huggingface): add qa and ranking tasks#5574
anishshiva7 wants to merge 26 commits into
apache:mainfrom
ELin2025:hf/05-qa-ranking

Conversation

@anishshiva7

@anishshiva7 anishshiva7 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

⚠️ This PR is stacked on hf/04-audio-mediagen. Until that lands, the diff below may also include earlier HuggingFace task-family changes depending on which base GitHub is showing. The new code in this PR is codegen/QaRankingCodegen.scala, the QA/ranking-related additions to codegen/PythonCodegenBase.scala, the new QA/ranking fields on HuggingFaceInferenceOpDesc.scala, and the QA/ranking task tests in HuggingFaceInferenceOpDescSpec.scala. Once PR 4 merges and this PR is retargeted to main, the diff should auto-clean to the PR 5 QA/ranking changes only.

What changes were proposed in this PR?

Adds the QA/ranking/classification task family — 5 HF pipeline tasks — as a new TaskCodegen plugged into the dispatcher established by the text-generation PR:

QA tasks: question-answering, table-question-answering

classification/ranking tasks: zero-shot-classification, sentence-similarity, text-ranking

codegen/QaRankingCodegen.scala supplies the per-task payload + parse Python branches for all 5 tasks.

CodegenContext is extended with contextColumn, candidateLabels, and sentencesColumn (EncodableString).

HuggingFaceInferenceOpDesc.scala gains 3 new @JsonProperty fields and registers QaRankingCodegen in the dispatcher.

PythonCodegenBase.scala grows to host the shared QA/ranking infrastructure:

  • Per-row validation for the new column-named fields.
  • question-answering payload handling with prompt + context.
  • table-question-answering payload handling with table data.
  • zero-shot-classification payload handling with candidate labels.
  • sentence-similarity and text-ranking payload handling with sentence inputs.
  • Response parsing for QA/ranking outputs.

User-input strings continue to flow through pyb"..." + EncodableString so they reach Python as self.decode_python_template('<base64>') rather than raw literals. PythonCodeRawInvalidTextSpec still passes with 117/117 descriptors py_compile cleanly.

Any related issues, documentation, or discussions?

Tracking issue: Add HuggingFace question answering and ranking tasks #5292

Closes #5292

Stacked on: PR 4 audio/media generation tasks / hf/04-audio-mediagen

Parent issue: Add Hugging Face inference operator #5041

Closed sibling issue: Add HuggingFaceModelResource REST endpoints for HF operator UI #5134

How was this PR tested?

sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile" clean.

sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec" — 31 focused tests pass, including HuggingFace QA/ranking task coverage and the raw Python descriptor scan.

sbt "WorkflowOperator / scalafmtCheck" clean.

sbt "WorkflowOperator / Test / scalafmtCheck" clean.

PythonCodeRawInvalidTextSpec — 117/117 descriptors py_compile cleanly with the new operator code paths, no marker leaks.

Was this PR authored or co-authored using generative AI tooling?

Yes, co-authored with generative AI tooling (Codex).

PG1204 and others added 26 commits May 17, 2026 13:02
…d media proxy

Introduces a new Jersey REST resource exposing endpoints used by the
upcoming HuggingFace operator UI:

- GET  /api/huggingface/models       — browse / search models per task
- GET  /api/huggingface/tasks        — list HF pipeline tags with hosted inference
- POST /api/huggingface/upload-audio — upload audio for HF audio tasks
- GET  /api/huggingface/audio-preview — stream uploaded audio (path-validated)
- GET  /api/huggingface/media-proxy   — proxy remote media URLs to bypass CORS

This is the first PR in a stacked series landing the HF operator end-to-end.
No operator code yet; this resource is independently useful and lets the
frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously
buffered the full payload into a heap-resident byte[] with no upper
bound, leaving the JVM open to OOM on a hostile or buggy upstream
response (/media-proxy) or out-of-band write into the audio temp dir
(/audio-preview).

- /media-proxy: switch from Unirest.asBytes() to
  asObject(Function<RawResponse, T>), streaming the upstream body in
  8 KiB chunks with a running byte counter. Aborts with 413 if the
  declared Content-Length exceeds the cap (pre-check) or if the body
  crosses the cap mid-read (defends against missing/lying
  Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF
  inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with
  headroom.
- /audio-preview: add Files.size() defense-in-depth check before
  readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on
  ingest; this catches the case where a bug or out-of-band write puts
  an oversized file in the temp dir.

Adds a spec covering the audio-preview cap using a sparse-file fixture
so the test stays fast (87/87 spec passes). The media-proxy cap path
is exercised via the existing input-validation suite plus the new
streamMediaWithCap helper - a follow-up can add a fake-RawResponse
unit test if reviewers want explicit coverage of the chunked-read cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with
@RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five
endpoints require an authenticated user. The annotation isn't enforced
yet — that's coming with the auth-enforcement PR @Yicong-Huang and
@Ma77Ball are working on — but adding it now means no follow-up
change is needed when enforcement lands, and it matches the convention
used by UserConfigResource / AdminSettingsResource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eration

Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the
team's feature branch into a dispatcher + per-task codegen architecture
and ships the first task family (text-generation) end-to-end.

- TaskCodegen trait + CodegenContext model the per-task variation
- PythonCodegenBase emits the shared provider-fallback / process_table /
  _parse_response infrastructure with two holes for the per-task payload
  and parse snippets
- TextGenCodegen supplies text-generation's chat-completions payload and
  the body["choices"][0]["message"]["content"] parse branch
- HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines)
  holding @JsonProperty fields and the registeredCodegens map

User-input string fields are typed as EncodableString and emitted via
the pyb"..." macro so values reach Python as
self.decode_python_template('<base64>') rather than raw literals; class
constants are assigned in open(self) so self is in scope for the decode
call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN
check at runtime before any HF URL is composed.

PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST
resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking
task families by registering new *Codegen objects in the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…degen specs

Addresses Codecov's 66.85% patch coverage warning by exercising the
defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and
the TextGenCodegen contract that previously had no spec hits.

- null-tolerance: feed null into every @JsonProperty (token, model, prompt
  col, system prompt, result col, task, maxNewTokens, temperature) and
  assert generatePythonCode still emits a parseable ProcessTableOperator
  with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS
  clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else
  x` branches that previously had no test that took the null side.
- TextGenCodegen.task: trivial canonical-value check.
- TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and
  assert payloadPython / parsePython still reference self.MODEL_ID and
  body["choices"]…. Catches a future refactor that accidentally splices
  ctx fields into the static snippets.

13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec
(117/117 descriptors still py_compile cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plugs the 9-task image family into the dispatcher pattern established
in PR 2:

  image-only      image-classification, object-detection,
                  image-segmentation, image-to-text
  image + prompt  visual-question-answering, document-question-answering,
                  zero-shot-image-classification, image-text-to-text,
                  image-to-image

- ImageTaskCodegen supplies payload + parse Python for all 9 tasks
- TaskCodegen trait gains a `tasks: Set[String]` default method so a
  single codegen can register under multiple task strings; the
  dispatcher map in HuggingFaceInferenceOpDesc is built from
  registeredCodegens.tasks.flatMap(...)
- CodegenContext extended with imageInput + inputImageColumn
  (EncodableString)
- HuggingFaceInferenceOpDesc gains 2 new @JsonProperty fields and
  registers ImageTaskCodegen

PythonCodegenBase grows to host the shared image infrastructure:
- image_only_tasks / image_prompt_tasks / image_tasks tuples and
  image_headers in process_table
- per-row image bytes resolution from upload (self._read_image_input)
  or input column (self._read_binary_value + self._compress_image_bytes)
- use_raw_binary_body / raw_binary_headers state threaded through
  _post_with_fallback (signature extended)
- _post_with_fallback adds the image-text-to-text chat-completions
  branch and the model-author vision branch
- _call_provider adds branches for zai-org's custom API, Replicate
  predictions + polling, Fal-ai, Wavespeed submit+poll, and image
  embedding in OpenAI-compatible / unknown-provider fallbacks
- image-content-type response handling returns data:image URLs
- image helpers added: _read_image_input, _compress_image_bytes,
  _image_input_as_base64, _read_binary_value, _looks_like_html,
  _html_to_image_bytes, _extract_json_arg, _url_to_data_url

User-input strings continue to flow through pyb"..." + EncodableString
so they reach Python as self.decode_python_template('<base64>') rather
than raw literals. PythonCodeRawInvalidTextSpec still passes
(117/117 descriptors py_compile cleanly).

Frontend integration adds only the HF lines (no agent / dataset
noise from the source branch):
- HuggingFaceImageUploadComponent declared in app.module.ts
- huggingface-image-upload formly type registered in formly-config.ts
- Image upload component .ts/.html/.scss cherry-picked from huggingFace
- HuggingFace.png + sample-image.png assets

PR 3 of a stacked 9-PR series. Stacks on hf/02-operator-textgen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added frontend Changes related to the frontend GUI common labels Jun 8, 2026
@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.35821% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.33%. Comparing base (e987f13) to head (8507ca5).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...mage-upload/hugging-face-image-upload.component.ts 50.00% 41 Missing and 1 partial ⚠️
...ge-upload/hugging-face-image-upload.component.html 38.88% 11 Missing ⚠️
...rator/huggingFace/HuggingFaceInferenceOpDesc.scala 95.58% 0 Missing and 3 partials ⚠️
...ber/operator/huggingFace/codegen/TaskCodegen.scala 88.88% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5574      +/-   ##
============================================
+ Coverage     51.88%   52.33%   +0.44%     
- Complexity     2472     2520      +48     
============================================
  Files          1067     1078      +11     
  Lines         41258    41566     +308     
  Branches       4437     4467      +30     
============================================
+ Hits          21408    21752     +344     
+ Misses        18591    18544      -47     
- Partials       1259     1270      +11     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (+22.39%) ⬆️
agent-service 33.76% <ø> (ø) Carriedforward from 8a83dc2
amber 53.73% <96.96%> (+0.85%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.06% <ø> (ø)
file-service 38.32% <ø> (ø)
frontend 46.42% <48.54%> (+0.05%) ⬆️
pyamber 90.69% <ø> (ø) Carriedforward from 8a83dc2
python 90.83% <ø> (ø) Carriedforward from 8a83dc2
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anishshiva7

Copy link
Copy Markdown
Contributor Author

/request-review @Ma77Ball

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HuggingFace question answering and ranking tasks

5 participants