Skip to content

Compute Worker - Fix submission files duplication#2285

Closed
ihsaan-ullah wants to merge 10 commits intodevelopfrom
compute_worker_duplicate_submissions
Closed

Compute Worker - Fix submission files duplication#2285
ihsaan-ullah wants to merge 10 commits intodevelopfrom
compute_worker_duplicate_submissions

Conversation

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator

@ihsaan-ullah ihsaan-ullah commented Mar 21, 2026

Description

This PR has the following updates:

.env_sample

  • updated for minor changes to WORKER_BUNDLE_URL_REWRITE

src/apps/competitions/tasks.py

  • updated to send submission during both ingestion and scoring, and some other minor updates for clarity

computer_worker.py

  • separate class SubmissionStatus for submission statuses and related updates
  • separate class ProgramKind for program kind and related updates
  • introduced a new class Settings that gathers all env variables. This class is now used all over the code for accessing settings. NOTE: This can be further improved to convert strings to booleans so that we don't use == "true" or `== "false".
  • organized imports
  • renamed program to scoring_program
  • update code start function to clarify which functions will run during ingestion and during scoring, and separated submission and scoring program clearly
  • made submission available ragardless of ingestion or scoring in app/ingested_program
  • rewrote _run_program_directory function to simplify code and logs
  • moved some code from _run_program_directory to a new function _create_container for reusability and clarity
  • minor changes to watch_detailed_results function to avoid looping forever
  • replaced container_id (undeclared) by container.get("Id") in _run_container_engine_cmd`
  • [TO DISCUSS] commented some code (that was introduced in Worker status to FAILED instead of SCORING or FINISHED in case of failure #2030 ) at the end of the start function that was failing submission even if everything went well. Need to discuss this and update the code

Issues this PR resolves

A checklist for hand testing

  • Upload this modified mini autoML bundle to check in ingestion logs and scoring logs that submission files are accessible Modified_Bundle.zip
  • use this submission to test the bundle submittion.zip
  • Carefully check the code to make sure there are no bugs

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@ihsaan-ullah ihsaan-ullah marked this pull request as draft March 21, 2026 23:50
@ihsaan-ullah ihsaan-ullah marked this pull request as ready for review March 22, 2026 11:24
@ihsaan-ullah ihsaan-ullah changed the title Compute Worer - Fix submission files duplication Compute Worker - Fix submission files duplication Mar 22, 2026
@Didayolo Didayolo mentioned this pull request Mar 23, 2026
20 tasks
@Didayolo Didayolo marked this pull request as draft March 23, 2026 13:30
@Didayolo Didayolo requested review from Didayolo and ObadaS March 24, 2026 14:41
@Didayolo
Copy link
Copy Markdown
Member

  • Put the duplication on scoring (ingested program --> results) to avoid breaking current competition design
  • In the future we can optimize this more

@ihsaan-ullah ihsaan-ullah marked this pull request as ready for review March 24, 2026 18:54
@Didayolo Didayolo force-pushed the compute_worker_duplicate_submissions branch from 3fa7c4d to e18c68f Compare March 25, 2026 11:53
@Didayolo
Copy link
Copy Markdown
Member

Upload this modified mini autoML bundle to check in ingestion logs and scoring logs that submission files are accessible Modified_Bundle.zip

@ihsaan-ullah It is needed to change the scoring program / ingestion program to have this working?

When trying locally I get this failure (haven't put back the commented function):

Capture d’écran 2026-03-26 à 12 54 21

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

ihsaan-ullah commented Mar 26, 2026

I think the failure will go away if you comment the code

# Check if scoring program failed
  try:
      program_results, _, _ = task_results
  except:
      program_results, _ = task_results
  # Gather returns either normal values or exception instances when return_exceptions=True
  had_async_exc = isinstance(
      program_results, BaseException
  ) and not isinstance(program_results, asyncio.CancelledError)
  program_rc = getattr(self, "program_exit_code", None)
  failed_rc = (program_rc is None) or (program_rc != 0)
  if had_async_exc or failed_rc:
      self._update_status(
          SubmissionStatus.FAILED,
          extra_information=f"program_rc={program_rc}, async={task_results}",
      )
      # Raise so upstream marks failed immediately
      raise SubmissionException("Child task failed or non-zero return code")

The bundle and submission I have provided should work. There is nothing special in the ingestion/scoring. They just print the content of different directories. The important thing to notice is that the input directory of scoring should have the predictions from ingestion and the submission files

@Didayolo
Copy link
Copy Markdown
Member

I get the same error when it is commented out.

Compute worker logs:

compute_worker  | 2026-03-26 13:45:23.289 | INFO     | compute_worker:_get_bundle:715 - Getting bundle http://docker.for.mac.localhost:9000/private/dataset/2026-03-26-1774525773/3efef493ce29/scoring_program.zip?AWSAccessKeyId=testkey&Signature=Djg5C%2FACl%2F%2BSE%2BtTfuiHEfqTsvc%3D&Expires=1774964722 to unpack @ scoring_program
compute_worker  | 2026-03-26 13:45:23.303 | INFO     | compute_worker:rewrite_bundle_url_if_needed:278 - Rewriting bundle URL for worker: http://docker.for.mac.localhost:9000/private/dataset/2026-03-26-1774525773/3efef493ce29/scoring_program.zip?AWSAccessKeyId=testkey&Signature=Djg5C%2FACl%2F%2BSE%2BtTfuiHEfqTsvc%3D&Expires=1774964722 -> http://minio:9000/private/dataset/2026-03-26-1774525773/3efef493ce29/scoring_program.zip?AWSAccessKeyId=testkey&Signature=Djg5C%2FACl%2F%2BSE%2BtTfuiHEfqTsvc%3D&Expires=1774964722
compute_worker  | 2026-03-26 13:45:23.334 | INFO     | compute_worker:_get_bundle:715 - Getting bundle http://docker.for.mac.localhost:9000/private/dataset/2026-03-26-1774532706/9b6d0e636deb/dev-submission.zip?AWSAccessKeyId=testkey&Signature=nWAJSIXuyqc7o0n2o6V0Pakv5O8%3D&Expires=1774964722 to unpack @ submission
compute_worker  | 2026-03-26 13:45:23.337 | INFO     | compute_worker:rewrite_bundle_url_if_needed:278 - Rewriting bundle URL for worker: http://docker.for.mac.localhost:9000/private/dataset/2026-03-26-1774532706/9b6d0e636deb/dev-submission.zip?AWSAccessKeyId=testkey&Signature=nWAJSIXuyqc7o0n2o6V0Pakv5O8%3D&Expires=1774964722 -> http://minio:9000/private/dataset/2026-03-26-1774532706/9b6d0e636deb/dev-submission.zip?AWSAccessKeyId=testkey&Signature=nWAJSIXuyqc7o0n2o6V0Pakv5O8%3D&Expires=1774964722
compute_worker  | 2026-03-26 13:45:23.367 | INFO     | compute_worker:_get_bundle:715 - Getting bundle http://docker.for.mac.localhost:9000/private/dataset/2026-03-26-1774525774/101b225ebcef/reference_data.zip?AWSAccessKeyId=testkey&Signature=W1iCOUyCG7Q09Ae2OP6svrKTkEs%3D&Expires=1774964722 to unpack @ input/ref
compute_worker  | 2026-03-26 13:45:23.399 | INFO     | compute_worker:_get_bundle:715 - Getting bundle http://docker.for.mac.localhost:9000/private/prediction_result/2026-03-26-1774532716/0d8e3f2c618b/prediction_result.zip?AWSAccessKeyId=testkey&Signature=7fMro2rHx3QOveS8tgcYsIwuqSM%3D&Expires=1774964722 to unpack @ input/res
compute_worker  | 2026-03-26 13:45:23.404 | INFO     | compute_worker:rewrite_bundle_url_if_needed:278 - Rewriting bundle URL for worker: http://docker.for.mac.localhost:9000/private/prediction_result/2026-03-26-1774532716/0d8e3f2c618b/prediction_result.zip?AWSAccessKeyId=testkey&Signature=7fMro2rHx3QOveS8tgcYsIwuqSM%3D&Expires=1774964722 -> http://minio:9000/private/prediction_result/2026-03-26-1774532716/0d8e3f2c618b/prediction_result.zip?AWSAccessKeyId=testkey&Signature=7fMro2rHx3QOveS8tgcYsIwuqSM%3D&Expires=1774964722
compute_worker  | 2026-03-26 13:45:23.413 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:45:43.547 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:46:03.722 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:46:23.911 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:46:43.987 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:47:04.051 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:47:24.147 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:47:44.192 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:48:04.299 | WARNING  | compute_worker:_get_bundle:754 - Failed. Retrying in 20 seconds...
compute_worker  | 2026-03-26 13:48:24.390 | INFO     | compute_worker:_update_submission:602 - Updating submission @ http://django:8000/api/submissions/93/ with data = {'status': 'Failed', 'status_details': 'Submission failed: Bad or empty zip file. See logs for more details.', 'secret': 'd508ee2d-3495-45fd-9565-92a95f45a797'}
compute_worker  | 2026-03-26 13:48:25.903 | INFO     | compute_worker:_update_submission:606 - Submission updated successfully!
compute_worker  | 2026-03-26 13:48:25.906 | WARNING  | compute_worker:clean_up:1529 - CODALAB_IGNORE_CLEANUP_STEP mode enabled, ignoring clean up of: /codabench/uPK-1_sID-93__br4xlja2
compute_worker  | 2026-03-26 13:48:26.060 | ERROR    | celery.app.trace:_log_error:285 - Task compute_worker_run[a835a084-8264-4c0d-8da0-d65a37b06bc7] raised unexpected: SubmissionException('Bad or empty zip file')
compute_worker  | Traceback (most recent call last):
compute_worker  | 
compute_worker  |   File "/app/compute_worker.py", line 746, in _get_bundle
compute_worker  |     with ZipFile(bundle_file, "r") as z:
compute_worker  |          │       └ '/codabench/uPK-1_sID-93__br4xlja2/bundles/tmp4lfxu7ek'
compute_worker  |<class 'zipfile.ZipFile'>
compute_worker  | 
compute_worker  |   File "/root/.local/share/uv/python/cpython-3.13.11-linux-aarch64-gnu/lib/python3.13/zipfile/__init__.py", line 1401, in __init__
compute_worker  |     self._RealGetContents()
compute_worker  |     │    └ <function ZipFile._RealGetContents at 0xffff7f042980>
compute_worker  |<zipfile.ZipFile [closed]>
compute_worker  |   File "/root/.local/share/uv/python/cpython-3.13.11-linux-aarch64-gnu/lib/python3.13/zipfile/__init__.py", line 1468, in _RealGetContents
compute_worker  |     raise BadZipFile("File is not a zip file")
compute_worker  |<class 'zipfile.BadZipFile'>
compute_worker  | 
compute_worker  | zipfile.BadZipFile: File is not a zip file
compute_worker  | 
compute_worker  | 
compute_worker  | During handling of the above exception, another exception occurred:
compute_worker  | 
compute_worker  | 
compute_worker  | Traceback (most recent call last):
compute_worker  | 
compute_worker  | > File "/.venv/lib/python3.13/site-packages/celery/app/trace.py", line 479, in trace_task
compute_worker  |     R = retval = fun(*args, **kwargs)
compute_worker  |   File "/.venv/lib/python3.13/site-packages/celery/app/trace.py", line 779, in __protected_call__
compute_worker  |     return self.run(*args, **kwargs)
compute_worker  | 
compute_worker  |   File "/app/compute_worker.py", line 294, in run_wrapper
compute_worker  |     run.prepare()
compute_worker  | 
compute_worker  |   File "/app/compute_worker.py", line 1260, in prepare
compute_worker  |     zip_file = self._get_bundle(url, path, cache=cache_this_bundle)
compute_worker  | 
compute_worker  |   File "/app/compute_worker.py", line 752, in _get_bundle
compute_worker  |     raise SubmissionException("Bad or empty zip file")
compute_worker  | 
compute_worker  | compute_worker.SubmissionException: Bad or empty zip file

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

Seems like a different issue. Maybe rerun the submission. If this keeps happening, then i will try to reproduce this on my side

@Didayolo
Copy link
Copy Markdown
Member

I am pretty sure the problem is related to the PR.

  • Back to develop the submission is working fine.
  • The failure happens when trying to get the prediction result:
compute_worker  | 2026-03-26 13:45:23.399 | INFO     | compute_worker:_get_bundle:715 - Getting bundle http://docker.for.mac.localhost:9000/private/prediction_result/2026-03-26-1774532716/0d8e3f2c618b/prediction_result.zip?AWSAccessKeyId=testkey&Signature=7fMro2rHx3QOveS8tgcYsIwuqSM%3D&Expires=1774964722 to unpack @ input/res

To reproduce it, I am using:

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

I will check this. I am also going to separate the main change and other cleaning changes in separate simple PRs

@Didayolo
Copy link
Copy Markdown
Member

I think it will be clearer this way. Can we delete this branch?

@ihsaan-ullah ihsaan-ullah deleted the compute_worker_duplicate_submissions branch March 27, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants