Skip to content

OMNIML-5128 Capture Docker experiment id#1840

Open
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/omniml-5128
Open

OMNIML-5128 Capture Docker experiment id#1840
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/omniml-5128

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Capture nemo-run experiment_id for Docker submit_job by redirecting detached launcher output to a side-channel log and tailing it briefly.
  • Reuse the launcher output parser for both Docker and Slurm submit paths.
  • Document Docker PID + experiment_id behavior and ignore the MCP local uv.lock.

Jira

OMNIML-5128

Validation

  • uv run --project tools/mcp pytest tools/mcp/tests -q

Summary by CodeRabbit

  • Bug Fixes
    • Improved Docker job submission to capture experiment_id from launcher output during a brief tail window, while always returning the detached pid (and stdout_log diagnostics if capture times out).
    • Made Slurm identifier extraction more robust across varying launcher output formats.
  • Documentation
    • Updated submit_job docs/module descriptions to clarify Docker PID/experiment_id and stdout_log behavior on timeout.
  • Tests
    • Added/expanded tests for Docker success, timeout/no-id scenarios, parsing edge cases, and log-side-channel failure handling.
  • Chores
    • Updated local tooling ignore rules for .venv/ and uv.lock.

@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Docker submit_job now tails launcher output from a temporary log to capture experiment_id while still returning the detached PID. Slurm submission now uses a shared stdout parser for experiment_id, experiment_dir, and slurm_job_id. Related docs and tool descriptions were updated.

Changes

submit_job output capture

Layer / File(s) Summary
Launcher parsing helpers
tools/mcp/modelopt_mcp/bridge.py, tools/mcp/tests/test_bridge.py
Adds shared parsing for launcher stdout and applies it to Slurm submission to derive experiment_id, experiment_dir, and slurm_job_id.
Docker detached capture
tools/mcp/modelopt_mcp/bridge.py, tools/mcp/tests/test_bridge.py
Creates a temporary Docker submit log, tails it for experiment_id, and adds tests for capture, timeout, and structured failure behavior.
Public surface text
tools/mcp/README.md, tools/mcp/modelopt_mcp/__init__.py, tools/mcp/modelopt_mcp/server.py
Updates the MCP tool documentation and descriptions to match the Docker and Slurm identifier return behavior.

Sequence Diagram(s)

sequenceDiagram
  participant submit_job_impl
  participant subprocessPopen as subprocess.Popen
  participant launch_py as launch.py
  participant stdout_log
  participant tail_helper as _tail_docker_launch_log

  submit_job_impl->>subprocessPopen: start detached launch.py with stdout and stderr redirected
  subprocessPopen->>launch_py: run launcher
  launch_py-->>stdout_log: write launcher output
  submit_job_impl->>tail_helper: poll stdout_log for experiment_id
  tail_helper-->>submit_job_impl: experiment_id or timeout diagnostic
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The PR adds # nosec suppressions in tools/mcp/modelopt_mcp/bridge.py, and the description has no explicit security justification/approval. Remove the # nosec markers, or add explicit SECURITY.md justification and @NVIDIA/modelopt-setup-codeowners approval in the PR description.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: capturing the Docker experiment id.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/omniml-5128

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.36%. Comparing base (4b04e73) to head (351820c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1840   +/-   ##
=======================================
  Coverage   74.36%   74.36%           
=======================================
  Files         513      513           
  Lines       56922    56922           
=======================================
  Hits        42332    42332           
  Misses      14590    14590           
Flag Coverage Δ
unit 54.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu force-pushed the chenhany/omniml-5128 branch from 31eb637 to c3dcc90 Compare June 27, 2026 03:17
@ChenhanYu ChenhanYu marked this pull request as ready for review June 27, 2026 03:40
@ChenhanYu ChenhanYu requested a review from a team as a code owner June 27, 2026 03:40
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

Comment on lines +165 to +177
deadline = time.monotonic() + _docker_experiment_id_capture_timeout()
text = ""
while True:
try:
text = log_path.read_text(errors="replace")
except OSError:
text = ""
experiment_id, _, _ = _parse_launcher_submission(text)
if experiment_id:
return experiment_id, _tail(text, 2000)
if time.monotonic() >= deadline or proc.poll() is not None:
return None, _tail(text, 2000)
time.sleep(0.2)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file while it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.

In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment on lines +1056 to +1064
log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
log_dir.mkdir(parents=True, exist_ok=True)
log_file = tempfile.NamedTemporaryFile(
prefix="submit-",
suffix=".log",
dir=log_dir,
delete=False,
mode="w+b",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError_launcher_not_installed) leaves an empty log file behind.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM. Focused PR (6 files, all under tools/mcp/, +220/-75); reviewed all source files. The change extracts the nemo_run submission-output parser into a shared _parse_launcher_submission() (reused by Slurm + Docker) and adds Docker-path experiment_id capture via a side-channel log file tailed briefly, replacing the prior /dev/null redirect. Verified: extracted parser is logic-identical to the removed inline Slurm block (no Slurm regression); fd handling is correct (parent closes its handle in finally, detached child keeps its dup'd fd, avoiding the original PIPE-buffer deadlock); docstring/README/init.py claims now match the implementation; new tests cover both id-captured and timeout branches. Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2 (non-blocking): (1) tail-loop parsing could match a truncated id mid-line in the 'Experiment Status for' branch; (2) submit-*.log files use delete=False and are never pruned. Risk level: Low — tooling-only change confined to the MCP bridge, no modelopt core / mode-state / export paths touched.

Comment on lines +165 to +177
deadline = time.monotonic() + _docker_experiment_id_capture_timeout()
text = ""
while True:
try:
text = log_path.read_text(errors="replace")
except OSError:
text = ""
experiment_id, _, _ = _parse_launcher_submission(text)
if experiment_id:
return experiment_id, _tail(text, 2000)
if time.monotonic() >= deadline or proc.poll() is not None:
return None, _tail(text, 2000)
time.sleep(0.2)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file as it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.

In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment on lines +1056 to +1064
log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
log_dir.mkdir(parents=True, exist_ok=True)
log_file = tempfile.NamedTemporaryFile(
prefix="submit-",
suffix=".log",
dir=log_dir,
delete=False,
mode="w+b",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError_launcher_not_installed) leaves an empty log file behind.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1055-1068: The Docker spawn path in bridge.py should not use a
Bandit suppression comment; remove the `# nosec` annotation from the
`subprocess.Popen` call in the submit/log handling flow and keep the
implementation Bandit-clean using the existing controlled `argv` construction in
the surrounding logic. If a security exception is still needed, route it through
the approved exception process instead of suppressing it inline, and ensure the
change is localized near the `subprocess.Popen` invocation and related log
setup.
- Around line 125-135: The fallback parser in bridge.py is too permissive and
can treat experiment_dir as an experiment id, so tighten the regex and
validation in the experiment_id extraction block to only accept a real
experiment token from stdout. Update the re.search logic in the experiment_id
fallback path to reject directory-like suffixes such as dir and any matches from
experiment_dir, and keep submit_job_impl returning only a usable id for later
job_status/job_logs lookups.
- Around line 1056-1064: The log-file setup in submit_job_impl currently creates
the side-channel log before the Popen error handling, so an OSError escapes
instead of returning the normal structured failure. Wrap the log_dir.mkdir and
tempfile.NamedTemporaryFile creation path in the same structured error handling
used by submit_job_impl, and on failure return the existing error shape rather
than raising. Use the submit_job_impl flow and the log_dir/log_file setup to
locate the fix.

In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 133-136: The docstring for the launch response in server.py is
misleading about Docker return values. Update the wording around the response in
the bridge launch path so it clearly states that Docker always returns pid,
including when experiment_id is missing and set to None after the short-tail
timeout, while Slurm still returns experiment_id only. Use the launch/response
description near the bridge output handling so tool consumers understand pid is
always present for Docker.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8a55ac87-6bd6-4086-91f6-9ace671c667c

📥 Commits

Reviewing files that changed from the base of the PR and between 4b04e73 and c3dcc90.

📒 Files selected for processing (6)
  • tools/mcp/.gitignore
  • tools/mcp/README.md
  • tools/mcp/modelopt_mcp/__init__.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/tests/test_bridge.py

Comment thread tools/mcp/modelopt_mcp/bridge.py
Comment on lines 1055 to 1068
# B603 false positive — argv is a controlled list built above.
log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs"
log_dir.mkdir(parents=True, exist_ok=True)
log_file = tempfile.NamedTemporaryFile(
prefix="submit-",
suffix=".log",
dir=log_dir,
delete=False,
mode="w+b",
)
log_path = Path(log_file.name)
try:
proc = subprocess.Popen( # nosec B603 - fixed launcher argv list; no shell.
argv,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove the new # nosec suppression on the Docker spawn path.

Even if this argv is intended to be safe, the repo policy explicitly forbids # nosec as the way to bypass Bandit on security-sensitive code. Please use a Bandit-clean implementation or route the documented security exception through the required approval path. As per coding guidelines, # nosec comments are not allowed as a bypass for security checks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1055 - 1068, The Docker spawn
path in bridge.py should not use a Bandit suppression comment; remove the `#
nosec` annotation from the `subprocess.Popen` call in the submit/log handling
flow and keep the implementation Bandit-clean using the existing controlled
`argv` construction in the surrounding logic. If a security exception is still
needed, route it through the approved exception process instead of suppressing
it inline, and ensure the change is localized near the `subprocess.Popen`
invocation and related log setup.

Source: Coding guidelines

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
Comment thread tools/mcp/modelopt_mcp/server.py Outdated
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the chenhany/omniml-5128 branch from c3dcc90 to 351820c Compare June 27, 2026 04:13
@ChenhanYu

ChenhanYu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed review feedback in 351820c:

  • Tightened launcher parsing so experiment_dir fields cannot be reported as experiment_id, while still returning the actual experiment_dir path.
  • Hardened Docker log tailing to avoid accepting a partially-written final line before newline/exit/timeout.
  • Return a structured docker_submit_log_unavailable failure if the side-channel log cannot be created.
  • Remove the empty log file if launcher startup fails with FileNotFoundError.
  • Clarified Docker response docs: Docker always returns PID; experiment_id may be None with stdout_log for diagnostics.
  • Added regression tests for parser false-positive and log creation failure.

Validation:

  • uv run --project tools/mcp pytest tools/mcp/tests -q -> 55 passed
  • pre-commit run ruff-check --files changed Python files -> passed
  • pre-commit run ruff-format --files changed Python files -> passed
  • pre-commit run bandit --files changed Python files -> passed
  • pre-commit run markdownlint-cli2 --files tools/mcp/README.md -> passed

Note: I did not remove the specific nosec B603 on the Docker Popen call because the repo's bandit hook fails on that call without the targeted suppression. The surrounding code still uses a controlled argv list and shell=False, matching the existing subprocess pattern in this file.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 172-175: The bridge logic in _parse_launcher_submission handling
should not return a partial Docker experiment id when the last log line is
incomplete. Update the complete_text preparation in bridge.py so it only parses
fully terminated lines before calling _parse_launcher_submission, and make the
experiment_id check in the same block reject any truncated match from the
current tail of text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6fea939f-04c7-4c87-8028-5ad302946213

📥 Commits

Reviewing files that changed from the base of the PR and between c3dcc90 and 351820c.

📒 Files selected for processing (6)
  • tools/mcp/.gitignore
  • tools/mcp/README.md
  • tools/mcp/modelopt_mcp/__init__.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/tests/test_bridge.py
✅ Files skipped from review due to trivial changes (4)
  • tools/mcp/.gitignore
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/modelopt_mcp/init.py
  • tools/mcp/README.md

Comment on lines +172 to +175
complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]
experiment_id, _, _ = _parse_launcher_submission(complete_text)
if experiment_id:
return experiment_id, _tail(text, 2000)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Avoid returning a truncated Docker experiment id.

Line 172 does not drop a single incomplete line: if the log currently contains Experiment Status for docker_ with no newline yet, rsplit("\n", 1)[0] still returns that partial line, so Line 174 can return a truncated id before the launcher finishes writing it.

Proposed fix
-        complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]
+        if text.endswith(("\n", "\r")):
+            complete_text = text
+        elif "\n" in text:
+            complete_text = text.rsplit("\n", 1)[0]
+        else:
+            complete_text = ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]
experiment_id, _, _ = _parse_launcher_submission(complete_text)
if experiment_id:
return experiment_id, _tail(text, 2000)
if text.endswith(("\n", "\r")):
complete_text = text
elif "\n" in text:
complete_text = text.rsplit("\n", 1)[0]
else:
complete_text = ""
experiment_id, _, _ = _parse_launcher_submission(complete_text)
if experiment_id:
return experiment_id, _tail(text, 2000)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 172 - 175, The bridge logic in
_parse_launcher_submission handling should not return a partial Docker
experiment id when the last log line is incomplete. Update the complete_text
preparation in bridge.py so it only parses fully terminated lines before calling
_parse_launcher_submission, and make the experiment_id check in the same block
reject any truncated match from the current tail of text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant