When running benchmarks (e.g., swebenchmultimodal), scripts often expect pure JSON output on stdout to be piped to tools like jq.
However, benchmarks/utils/console_logging.py currently configures the console handler to write to stdout when rich logging is enabled (or defaulted).
This causes issues when libraries (like OpenTelemetry) log errors or warnings. For example, a Failed to detach context error from OpenTelemetry is printed to stdout, which then causes jq to fail with:
jq: parse error: Invalid numeric literal
This happens because the log message is mixed with the JSON output.
Fix: Ensure all console logs are written to stderr, leaving stdout exclusively for the script's intended output (JSON).
When running benchmarks (e.g., swebenchmultimodal), scripts often expect pure JSON output on stdout to be piped to tools like
jq.However,
benchmarks/utils/console_logging.pycurrently configures the console handler to write tostdoutwhen rich logging is enabled (or defaulted).This causes issues when libraries (like OpenTelemetry) log errors or warnings. For example, a
Failed to detach contexterror from OpenTelemetry is printed to stdout, which then causesjqto fail with:jq: parse error: Invalid numeric literalThis happens because the log message is mixed with the JSON output.
Fix: Ensure all console logs are written to
stderr, leavingstdoutexclusively for the script's intended output (JSON).