OMNIML-5128 Capture Docker experiment id by ChenhanYu · Pull Request #1840 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-27T02:57:46Z

Summary

Capture nemo-run experiment_id for Docker submit_job by redirecting detached launcher output to a side-channel log and tailing it briefly.
Reuse the launcher output parser for both Docker and Slurm submit paths.
Document Docker PID + experiment_id behavior and ignore the MCP local uv.lock.

Jira

OMNIML-5128

Validation

uv run --project tools/mcp pytest tools/mcp/tests -q

Summary by CodeRabbit

Bug Fixes
- Improved Docker job submission to capture experiment_id from launcher output during a brief tail window, while always returning the detached pid (and stdout_log diagnostics if capture times out).
- Made Slurm identifier extraction more robust across varying launcher output formats.
Documentation
- Updated submit_job docs/module descriptions to clarify Docker PID/experiment_id and stdout_log behavior on timeout.
Tests
- Added/expanded tests for Docker success, timeout/no-id scenarios, parsing edge cases, and log-side-channel failure handling.
Chores
- Updated local tooling ignore rules for .venv/ and uv.lock.

copy-pr-bot · 2026-06-27T02:57:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-27T02:57:53Z

📝 Walkthrough

Walkthrough

Docker submit_job now tails launcher output from a temporary log to capture experiment_id while still returning the detached PID. Slurm submission now uses a shared stdout parser for experiment_id, experiment_dir, and slurm_job_id. Related docs and tool descriptions were updated.

Changes

submit_job output capture

Layer / File(s)	Summary
Launcher parsing helpers `tools/mcp/modelopt_mcp/bridge.py`, `tools/mcp/tests/test_bridge.py`	Adds shared parsing for launcher stdout and applies it to Slurm submission to derive `experiment_id`, `experiment_dir`, and `slurm_job_id`.
Docker detached capture `tools/mcp/modelopt_mcp/bridge.py`, `tools/mcp/tests/test_bridge.py`	Creates a temporary Docker submit log, tails it for `experiment_id`, and adds tests for capture, timeout, and structured failure behavior.
Public surface text `tools/mcp/README.md`, `tools/mcp/modelopt_mcp/__init__.py`, `tools/mcp/modelopt_mcp/server.py`	Updates the MCP tool documentation and descriptions to match the Docker and Slurm identifier return behavior.

Sequence Diagram(s)

sequenceDiagram
  participant submit_job_impl
  participant subprocessPopen as subprocess.Popen
  participant launch_py as launch.py
  participant stdout_log
  participant tail_helper as _tail_docker_launch_log

  submit_job_impl->>subprocessPopen: start detached launch.py with stdout and stderr redirected
  subprocessPopen->>launch_py: run launcher
  launch_py-->>stdout_log: write launcher output
  submit_job_impl->>tail_helper: poll stdout_log for experiment_id
  tail_helper-->>submit_job_impl: experiment_id or timeout diagnostic

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	The PR adds `# nosec` suppressions in `tools/mcp/modelopt_mcp/bridge.py`, and the description has no explicit security justification/approval.	Remove the `# nosec` markers, or add explicit SECURITY.md justification and `@NVIDIA/modelopt-setup-codeowners` approval in the PR description.
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: capturing the Docker experiment id.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/omniml-5128

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-27T03:07:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.36%. Comparing base (4b04e73) to head (351820c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1840   +/-   ##
=======================================
  Coverage   74.36%   74.36%           
=======================================
  Files         513      513           
  Lines       56922    56922           
=======================================
  Hits        42332    42332           
  Misses      14590    14590

Flag	Coverage Δ
unit	`54.64% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu · 2026-06-27T03:41:07Z

/claude review

claude · 2026-06-27T03:44:25Z

+    deadline = time.monotonic() + _docker_experiment_id_capture_timeout()
+    text = ""
+    while True:
+        try:
+            text = log_path.read_text(errors="replace")
+        except OSError:
+            text = ""
+        experiment_id, _, _ = _parse_launcher_submission(text)
+        if experiment_id:
+            return experiment_id, _tail(text, 2000)
+        if time.monotonic() >= deadline or proc.poll() is not None:
+            return None, _tail(text, 2000)
+        time.sleep(0.2)


[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file while it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.

In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.

claude · 2026-06-27T03:44:27Z

+        log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
+        log_dir.mkdir(parents=True, exist_ok=True)
+        log_file = tempfile.NamedTemporaryFile(
+            prefix="submit-",
+            suffix=".log",
+            dir=log_dir,
+            delete=False,
+            mode="w+b",
+        )


[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError → _launcher_not_installed) leaves an empty log file behind.

claude

Claude review passed — no blocking issues found. LGTM. Focused PR (6 files, all under tools/mcp/, +220/-75); reviewed all source files. The change extracts the nemo_run submission-output parser into a shared _parse_launcher_submission() (reused by Slurm + Docker) and adds Docker-path experiment_id capture via a side-channel log file tailed briefly, replacing the prior /dev/null redirect. Verified: extracted parser is logic-identical to the removed inline Slurm block (no Slurm regression); fd handling is correct (parent closes its handle in finally, detached child keeps its dup'd fd, avoiding the original PIPE-buffer deadlock); docstring/README/init.py claims now match the implementation; new tests cover both id-captured and timeout branches. Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2 (non-blocking): (1) tail-loop parsing could match a truncated id mid-line in the 'Experiment Status for' branch; (2) submit-*.log files use delete=False and are never pruned. Risk level: Low — tooling-only change confined to the MCP bridge, no modelopt core / mode-state / export paths touched.

claude · 2026-06-27T03:45:38Z

+    deadline = time.monotonic() + _docker_experiment_id_capture_timeout()
+    text = ""
+    while True:
+        try:
+            text = log_path.read_text(errors="replace")
+        except OSError:
+            text = ""
+        experiment_id, _, _ = _parse_launcher_submission(text)
+        if experiment_id:
+            return experiment_id, _tail(text, 2000)
+        if time.monotonic() >= deadline or proc.poll() is not None:
+            return None, _tail(text, 2000)
+        time.sleep(0.2)


[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file as it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.

In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.

claude · 2026-06-27T03:45:38Z

+        log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
+        log_dir.mkdir(parents=True, exist_ok=True)
+        log_file = tempfile.NamedTemporaryFile(
+            prefix="submit-",
+            suffix=".log",
+            dir=log_dir,
+            delete=False,
+            mode="w+b",
+        )


[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError → _launcher_not_installed) leaves an empty log file behind.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1055-1068: The Docker spawn path in bridge.py should not use a
Bandit suppression comment; remove the `# nosec` annotation from the
`subprocess.Popen` call in the submit/log handling flow and keep the
implementation Bandit-clean using the existing controlled `argv` construction in
the surrounding logic. If a security exception is still needed, route it through
the approved exception process instead of suppressing it inline, and ensure the
change is localized near the `subprocess.Popen` invocation and related log
setup.
- Around line 125-135: The fallback parser in bridge.py is too permissive and
can treat experiment_dir as an experiment id, so tighten the regex and
validation in the experiment_id extraction block to only accept a real
experiment token from stdout. Update the re.search logic in the experiment_id
fallback path to reject directory-like suffixes such as dir and any matches from
experiment_dir, and keep submit_job_impl returning only a usable id for later
job_status/job_logs lookups.
- Around line 1056-1064: The log-file setup in submit_job_impl currently creates
the side-channel log before the Popen error handling, so an OSError escapes
instead of returning the normal structured failure. Wrap the log_dir.mkdir and
tempfile.NamedTemporaryFile creation path in the same structured error handling
used by submit_job_impl, and on failure return the existing error shape rather
than raising. Use the submit_job_impl flow and the log_dir/log_file setup to
locate the fix.

In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 133-136: The docstring for the launch response in server.py is
misleading about Docker return values. Update the wording around the response in
the bridge launch path so it clearly states that Docker always returns pid,
including when experiment_id is missing and set to None after the short-tail
timeout, while Slurm still returns experiment_id only. Use the launch/response
description near the bridge output handling so tool consumers understand pid is
always present for Docker.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8a55ac87-6bd6-4086-91f6-9ace671c667c

📥 Commits

Reviewing files that changed from the base of the PR and between 4b04e73 and c3dcc90.

📒 Files selected for processing (6)

tools/mcp/.gitignore
tools/mcp/README.md
tools/mcp/modelopt_mcp/__init__.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/tests/test_bridge.py

coderabbitai · 2026-06-27T03:47:22Z

        # B603 false positive — argv is a controlled list built above.
+        log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
+        log_dir.mkdir(parents=True, exist_ok=True)
+        log_file = tempfile.NamedTemporaryFile(
+            prefix="submit-",
+            suffix=".log",
+            dir=log_dir,
+            delete=False,
+            mode="w+b",
+        )
+        log_path = Path(log_file.name)
        try:
            proc = subprocess.Popen(  # nosec B603 - fixed launcher argv list; no shell.
                argv,


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove the new # nosec suppression on the Docker spawn path.

Even if this argv is intended to be safe, the repo policy explicitly forbids # nosec as the way to bypass Bandit on security-sensitive code. Please use a Bandit-clean implementation or route the documented security exception through the required approval path. As per coding guidelines, # nosec comments are not allowed as a bypass for security checks.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1055 - 1068, The Docker spawn path in bridge.py should not use a Bandit suppression comment; remove the `# nosec` annotation from the `subprocess.Popen` call in the submit/log handling flow and keep the implementation Bandit-clean using the existing controlled `argv` construction in the surrounding logic. If a security exception is still needed, route it through the approved exception process instead of suppressing it inline, and ensure the change is localized near the `subprocess.Popen` invocation and related log setup.

Source: Coding guidelines

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu · 2026-06-27T04:14:17Z

Addressed review feedback in 351820c:

Tightened launcher parsing so experiment_dir fields cannot be reported as experiment_id, while still returning the actual experiment_dir path.
Hardened Docker log tailing to avoid accepting a partially-written final line before newline/exit/timeout.
Return a structured docker_submit_log_unavailable failure if the side-channel log cannot be created.
Remove the empty log file if launcher startup fails with FileNotFoundError.
Clarified Docker response docs: Docker always returns PID; experiment_id may be None with stdout_log for diagnostics.
Added regression tests for parser false-positive and log creation failure.

Validation:

uv run --project tools/mcp pytest tools/mcp/tests -q -> 55 passed
pre-commit run ruff-check --files changed Python files -> passed
pre-commit run ruff-format --files changed Python files -> passed
pre-commit run bandit --files changed Python files -> passed
pre-commit run markdownlint-cli2 --files tools/mcp/README.md -> passed

Note: I did not remove the specific nosec B603 on the Docker Popen call because the repo's bandit hook fails on that call without the targeted suppression. The surrounding code still uses a controlled argv list and shell=False, matching the existing subprocess pattern in this file.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 172-175: The bridge logic in _parse_launcher_submission handling
should not return a partial Docker experiment id when the last log line is
incomplete. Update the complete_text preparation in bridge.py so it only parses
fully terminated lines before calling _parse_launcher_submission, and make the
experiment_id check in the same block reject any truncated match from the
current tail of text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6fea939f-04c7-4c87-8028-5ad302946213

📥 Commits

Reviewing files that changed from the base of the PR and between c3dcc90 and 351820c.

📒 Files selected for processing (6)

tools/mcp/.gitignore
tools/mcp/README.md
tools/mcp/modelopt_mcp/__init__.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/tests/test_bridge.py

✅ Files skipped from review due to trivial changes (4)

tools/mcp/.gitignore
tools/mcp/modelopt_mcp/server.py
tools/mcp/modelopt_mcp/init.py
tools/mcp/README.md

coderabbitai · 2026-06-27T04:17:06Z

+        complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]
+        experiment_id, _, _ = _parse_launcher_submission(complete_text)
+        if experiment_id:
+            return experiment_id, _tail(text, 2000)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Avoid returning a truncated Docker experiment id.

Line 172 does not drop a single incomplete line: if the log currently contains Experiment Status for docker_ with no newline yet, rsplit("\n", 1)[0] still returns that partial line, so Line 174 can return a truncated id before the launcher finishes writing it.

Proposed fix

- complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0] + if text.endswith(("\n", "\r")): + complete_text = text + elif "\n" in text: + complete_text = text.rsplit("\n", 1)[0] + else: + complete_text = ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]

experiment_id, _, _ = _parse_launcher_submission(complete_text)

if experiment_id:

return experiment_id, _tail(text, 2000)

if text.endswith(("\n", "\r")):

complete_text = text

elif "\n" in text:

complete_text = text.rsplit("\n", 1)[0]

else:

complete_text = ""

experiment_id, _, _ = _parse_launcher_submission(complete_text)

if experiment_id:

return experiment_id, _tail(text, 2000)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 172 - 175, The bridge logic in _parse_launcher_submission handling should not return a partial Docker experiment id when the last log line is incomplete. Update the complete_text preparation in bridge.py so it only parses fully terminated lines before calling _parse_launcher_submission, and make the experiment_id check in the same block reject any truncated match from the current tail of text.

ChenhanYu force-pushed the chenhany/omniml-5128 branch from 31eb637 to c3dcc90 Compare June 27, 2026 03:17

ChenhanYu marked this pull request as ready for review June 27, 2026 03:40

ChenhanYu requested a review from a team as a code owner June 27, 2026 03:40

claude Bot reviewed Jun 27, 2026

View reviewed changes

claude Bot approved these changes Jun 27, 2026

View reviewed changes

claude Bot reviewed Jun 27, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

OMNIML-5128 Capture Docker experiment id

351820c

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the chenhany/omniml-5128 branch from c3dcc90 to 351820c Compare June 27, 2026 04:13

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Uh oh!

Conversation

ChenhanYu commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Jira

Validation

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu commented Jun 27, 2026

Uh oh!

claude Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChenhanYu commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenhanYu commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

codecov Bot commented Jun 27, 2026 •

edited

Loading

ChenhanYu commented Jun 27, 2026 •

edited

Loading