OMNIML-5128 Capture Docker experiment id#1840
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughDocker Changessubmit_job output capture
Sequence Diagram(s)sequenceDiagram
participant submit_job_impl
participant subprocessPopen as subprocess.Popen
participant launch_py as launch.py
participant stdout_log
participant tail_helper as _tail_docker_launch_log
submit_job_impl->>subprocessPopen: start detached launch.py with stdout and stderr redirected
subprocessPopen->>launch_py: run launcher
launch_py-->>stdout_log: write launcher output
submit_job_impl->>tail_helper: poll stdout_log for experiment_id
tail_helper-->>submit_job_impl: experiment_id or timeout diagnostic
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1840 +/- ##
=======================================
Coverage 74.36% 74.36%
=======================================
Files 513 513
Lines 56922 56922
=======================================
Hits 42332 42332
Misses 14590 14590
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
31eb637 to
c3dcc90
Compare
|
/claude review |
| deadline = time.monotonic() + _docker_experiment_id_capture_timeout() | ||
| text = "" | ||
| while True: | ||
| try: | ||
| text = log_path.read_text(errors="replace") | ||
| except OSError: | ||
| text = "" | ||
| experiment_id, _, _ = _parse_launcher_submission(text) | ||
| if experiment_id: | ||
| return experiment_id, _tail(text, 2000) | ||
| if time.monotonic() >= deadline or proc.poll() is not None: | ||
| return None, _tail(text, 2000) | ||
| time.sleep(0.2) |
There was a problem hiding this comment.
[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file while it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.
In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.
| log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs" | ||
| log_dir.mkdir(parents=True, exist_ok=True) | ||
| log_file = tempfile.NamedTemporaryFile( | ||
| prefix="submit-", | ||
| suffix=".log", | ||
| dir=log_dir, | ||
| delete=False, | ||
| mode="w+b", | ||
| ) |
There was a problem hiding this comment.
[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError → _launcher_not_installed) leaves an empty log file behind.
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM. Focused PR (6 files, all under tools/mcp/, +220/-75); reviewed all source files. The change extracts the nemo_run submission-output parser into a shared _parse_launcher_submission() (reused by Slurm + Docker) and adds Docker-path experiment_id capture via a side-channel log file tailed briefly, replacing the prior /dev/null redirect. Verified: extracted parser is logic-identical to the removed inline Slurm block (no Slurm regression); fd handling is correct (parent closes its handle in finally, detached child keeps its dup'd fd, avoiding the original PIPE-buffer deadlock); docstring/README/init.py claims now match the implementation; new tests cover both id-captured and timeout branches. Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 2 (non-blocking): (1) tail-loop parsing could match a truncated id mid-line in the 'Experiment Status for' branch; (2) submit-*.log files use delete=False and are never pruned. Risk level: Low — tooling-only change confined to the MCP bridge, no modelopt core / mode-state / export paths touched.
| deadline = time.monotonic() + _docker_experiment_id_capture_timeout() | ||
| text = "" | ||
| while True: | ||
| try: | ||
| text = log_path.read_text(errors="replace") | ||
| except OSError: | ||
| text = "" | ||
| experiment_id, _, _ = _parse_launcher_submission(text) | ||
| if experiment_id: | ||
| return experiment_id, _tail(text, 2000) | ||
| if time.monotonic() >= deadline or proc.poll() is not None: | ||
| return None, _tail(text, 2000) | ||
| time.sleep(0.2) |
There was a problem hiding this comment.
[SUGGESTION] The tail loop re-reads the whole growing log every 0.2s and parses it for an id. Because it matches on the entire file as it's being written, an intermediate read can land mid-line. The primary format Experiment.from_id("<id>") is safe (the regex requires the closing ")), but the secondary Experiment Status for\s+(\S+) branch matches \S+ greedily — if a read catches that line before the trailing newline is flushed, \S+ could capture a truncated id, and the loop then returns early with the wrong value and stops tailing.
In practice nemo_run's line writes are usually flushed atomically so the risk is low, but if you want to harden it you could only accept an id once a newline follows it in the buffer (e.g. parse on text.rsplit("\n", 1)[0]), so a partially-written final line is never matched.
| log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs" | ||
| log_dir.mkdir(parents=True, exist_ok=True) | ||
| log_file = tempfile.NamedTemporaryFile( | ||
| prefix="submit-", | ||
| suffix=".log", | ||
| dir=log_dir, | ||
| delete=False, | ||
| mode="w+b", | ||
| ) |
There was a problem hiding this comment.
[SUGGESTION] Each Docker submit_job now leaves a persistent submit-*.log under $NEMORUN_HOME/.modelopt-mcp/docker-submit-logs/ (delete=False), and nothing ever prunes them. Since the detached launcher keeps writing to its inherited fd, each file also grows for the lifetime of the job. Keeping them as a diagnostic referenced by stdout_log is reasonable, but over many submissions this directory accumulates unboundedly. Consider a lightweight retention policy (e.g. delete logs older than N days on submit) or at least documenting that callers are responsible for cleanup. Note also the error path (FileNotFoundError → _launcher_not_installed) leaves an empty log file behind.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1055-1068: The Docker spawn path in bridge.py should not use a
Bandit suppression comment; remove the `# nosec` annotation from the
`subprocess.Popen` call in the submit/log handling flow and keep the
implementation Bandit-clean using the existing controlled `argv` construction in
the surrounding logic. If a security exception is still needed, route it through
the approved exception process instead of suppressing it inline, and ensure the
change is localized near the `subprocess.Popen` invocation and related log
setup.
- Around line 125-135: The fallback parser in bridge.py is too permissive and
can treat experiment_dir as an experiment id, so tighten the regex and
validation in the experiment_id extraction block to only accept a real
experiment token from stdout. Update the re.search logic in the experiment_id
fallback path to reject directory-like suffixes such as dir and any matches from
experiment_dir, and keep submit_job_impl returning only a usable id for later
job_status/job_logs lookups.
- Around line 1056-1064: The log-file setup in submit_job_impl currently creates
the side-channel log before the Popen error handling, so an OSError escapes
instead of returning the normal structured failure. Wrap the log_dir.mkdir and
tempfile.NamedTemporaryFile creation path in the same structured error handling
used by submit_job_impl, and on failure return the existing error shape rather
than raising. Use the submit_job_impl flow and the log_dir/log_file setup to
locate the fix.
In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 133-136: The docstring for the launch response in server.py is
misleading about Docker return values. Update the wording around the response in
the bridge launch path so it clearly states that Docker always returns pid,
including when experiment_id is missing and set to None after the short-tail
timeout, while Slurm still returns experiment_id only. Use the launch/response
description near the bridge output handling so tool consumers understand pid is
always present for Docker.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8a55ac87-6bd6-4086-91f6-9ace671c667c
📒 Files selected for processing (6)
tools/mcp/.gitignoretools/mcp/README.mdtools/mcp/modelopt_mcp/__init__.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/tests/test_bridge.py
| # B603 false positive — argv is a controlled list built above. | ||
| log_dir = Path(child_env["NEMORUN_HOME"]) / ".modelopt-mcp" / "docker-submit-logs" | ||
| log_dir.mkdir(parents=True, exist_ok=True) | ||
| log_file = tempfile.NamedTemporaryFile( | ||
| prefix="submit-", | ||
| suffix=".log", | ||
| dir=log_dir, | ||
| delete=False, | ||
| mode="w+b", | ||
| ) | ||
| log_path = Path(log_file.name) | ||
| try: | ||
| proc = subprocess.Popen( # nosec B603 - fixed launcher argv list; no shell. | ||
| argv, |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Remove the new # nosec suppression on the Docker spawn path.
Even if this argv is intended to be safe, the repo policy explicitly forbids # nosec as the way to bypass Bandit on security-sensitive code. Please use a Bandit-clean implementation or route the documented security exception through the required approval path. As per coding guidelines, # nosec comments are not allowed as a bypass for security checks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1055 - 1068, The Docker spawn
path in bridge.py should not use a Bandit suppression comment; remove the `#
nosec` annotation from the `subprocess.Popen` call in the submit/log handling
flow and keep the implementation Bandit-clean using the existing controlled
`argv` construction in the surrounding logic. If a security exception is still
needed, route it through the approved exception process instead of suppressing
it inline, and ensure the change is localized near the `subprocess.Popen`
invocation and related log setup.
Source: Coding guidelines
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
c3dcc90 to
351820c
Compare
|
Addressed review feedback in 351820c:
Validation:
Note: I did not remove the specific nosec B603 on the Docker Popen call because the repo's bandit hook fails on that call without the targeted suppression. The surrounding code still uses a controlled argv list and shell=False, matching the existing subprocess pattern in this file. |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 172-175: The bridge logic in _parse_launcher_submission handling
should not return a partial Docker experiment id when the last log line is
incomplete. Update the complete_text preparation in bridge.py so it only parses
fully terminated lines before calling _parse_launcher_submission, and make the
experiment_id check in the same block reject any truncated match from the
current tail of text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6fea939f-04c7-4c87-8028-5ad302946213
📒 Files selected for processing (6)
tools/mcp/.gitignoretools/mcp/README.mdtools/mcp/modelopt_mcp/__init__.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/tests/test_bridge.py
✅ Files skipped from review due to trivial changes (4)
- tools/mcp/.gitignore
- tools/mcp/modelopt_mcp/server.py
- tools/mcp/modelopt_mcp/init.py
- tools/mcp/README.md
| complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0] | ||
| experiment_id, _, _ = _parse_launcher_submission(complete_text) | ||
| if experiment_id: | ||
| return experiment_id, _tail(text, 2000) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Avoid returning a truncated Docker experiment id.
Line 172 does not drop a single incomplete line: if the log currently contains Experiment Status for docker_ with no newline yet, rsplit("\n", 1)[0] still returns that partial line, so Line 174 can return a truncated id before the launcher finishes writing it.
Proposed fix
- complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0]
+ if text.endswith(("\n", "\r")):
+ complete_text = text
+ elif "\n" in text:
+ complete_text = text.rsplit("\n", 1)[0]
+ else:
+ complete_text = ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| complete_text = text if text.endswith(("\n", "\r")) else text.rsplit("\n", 1)[0] | |
| experiment_id, _, _ = _parse_launcher_submission(complete_text) | |
| if experiment_id: | |
| return experiment_id, _tail(text, 2000) | |
| if text.endswith(("\n", "\r")): | |
| complete_text = text | |
| elif "\n" in text: | |
| complete_text = text.rsplit("\n", 1)[0] | |
| else: | |
| complete_text = "" | |
| experiment_id, _, _ = _parse_launcher_submission(complete_text) | |
| if experiment_id: | |
| return experiment_id, _tail(text, 2000) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 172 - 175, The bridge logic in
_parse_launcher_submission handling should not return a partial Docker
experiment id when the last log line is incomplete. Update the complete_text
preparation in bridge.py so it only parses fully terminated lines before calling
_parse_launcher_submission, and make the experiment_id check in the same block
reject any truncated match from the current tail of text.
Summary
Jira
OMNIML-5128
Validation
Summary by CodeRabbit
experiment_idfrom launcher output during a brief tail window, while always returning the detachedpid(andstdout_logdiagnostics if capture times out).submit_jobdocs/module descriptions to clarify Docker PID/experiment_idandstdout_logbehavior on timeout..venv/anduv.lock.