feat: add pypiserver image, docker-compose setup, and CI pipeline#5048
feat: add pypiserver image, docker-compose setup, and CI pipeline#5048qinrui777 wants to merge 29 commits into
Conversation
…nnxruntime==1.16.0
…er}/external) and remove cu124
There was a problem hiding this comment.
Code Review
This pull request introduces a Docker Compose setup for both online and offline modes, including a local PyPI server (pypiserver) to host offline wheels. Key feedback includes fixing the pypiserver command in docker-compose.yaml to ensure wheels are served, using --chown in the Dockerfile to avoid permission errors, correcting invalid CUDA index URLs, and resolving script robustness issues in list-packages.sh (such as potential hangs, pipeline failures, and incorrect relative paths).
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
qinxuye
left a comment
There was a problem hiding this comment.
One issue remains around the offline package mirror build.
| pip download --dest /data/packages/ --no-cache-dir \ | ||
| "$pkg" 2>/dev/null \ | ||
| && echo "OK $pkg" || echo "SKIP $pkg" ;; \ | ||
| *) \ | ||
| pip download --dest /data/packages/ --no-cache-dir \ | ||
| --extra-index-url https://download.pytorch.org/whl/cu130 \ | ||
| --extra-index-url https://wheels.vllm.ai/0.19.0/cu130 \ | ||
| --extra-index-url https://xorbitsai.github.io/xllamacpp/whl/cu128 \ | ||
| "$pkg" 2>/dev/null \ | ||
| && echo "OK $pkg" || echo "SKIP $pkg" ;; \ |
There was a problem hiding this comment.
This still makes the pypiserver image a best-effort mirror: every pip download failure is hidden by 2>/dev/null and converted into SKIP, so the workflow can publish an image that is missing required wheels and only fails later in offline installs. Since this image is meant to support offline use, please collect failed packages and fail the build for required downloads, or otherwise make the best-effort behavior explicit and avoid publishing it as a complete offline mirror.
There was a problem hiding this comment.
Fixed in latest commit. Build Result -> https://github.com/qinrui777/inference/actions/runs/28079279864
- Stopped swallowing stderr — removed
2>/dev/nullfrom both pip invocations, so failures are now visible in the build log. - Recorded failures — each SKIP now appends the package name to
/tmp/failed.txt, then cat the file in console.
rogercloud
left a comment
There was a problem hiding this comment.
Requesting changes for the release gating and offline package integrity issues below.
rogercloud
left a comment
There was a problem hiding this comment.
Three design-level concerns about whether the offline mirror is actually usable in a true air-gapped deployment. These are independent of the silent-failure / release-gating points already raised.
| # ============================================================================ | ||
| # Stage 1 — download every wheel needed by xinference model engines | ||
| # ============================================================================ | ||
| FROM python:3.12-slim AS downloader |
There was a problem hiding this comment.
[Updated — downgraded to a minor note.] Downloader Python should track the runtime image.
I first read the in-repo Dockerfile (base vllm/vllm-openai:v0.17.1, Python 3.10) and worried this python:3.12 downloader would produce wheels that can't be installed at runtime. Inspecting the actually-published xprobe/xinference:latest config, however, it is built on a newer vLLM base with PYTHON_VERSION=3.12 and CUDA_VERSION=13.0.2. So today the downloader (3.12) matches the runtime interpreter and there is no ABI mismatch — and the cu130 extra indexes are correct for this image.
The only residual concern is forward-looking: this python:3.12-slim is hardcoded and decoupled from whatever Python the xinference base image actually ships. Because model venvs are created with python_path=sys.executable (xinference/core/worker.py), a future base-image Python bump would silently make the mirrored wheels uninstallable, and the failure would only surface in an offline install. Consider deriving/pinning the downloader Python (and CUDA tag) from the target image, or documenting the coupling, so they can't drift. Not a blocker.
| environment: | ||
| # - XINFERENCE_MODEL_SRC=modelscope | ||
| - XINFERENCE_HOME=/data | ||
| - PIP_INDEX_URL=${PIP_INDEX_URL:-https://pypi.org/simple} |
There was a problem hiding this comment.
Offline mode is incomplete for vllm/sglang. Pointing PIP_INDEX_URL at the local pypiserver only redirects pip's primary index. When xinference bootstraps a vllm/sglang model venv it injects engine-specific extra indexes from ENGINE_VIRTUALENV_EXTRA_INDEX_URLS (apply_engine_virtualenv_settings in xinference/core/utils.py) — https://wheels.vllm.ai/0.19.0/cu130 and https://download.pytorch.org/whl/cu130, with index_strategy=unsafe-best-match. These are passed to uv as explicit --extra-index-url args, so in a true air-gapped deployment those hosts are unreachable and venv creation for the heavy GPU engines fails despite the local mirror.
Note: setting a PIP_EXTRA_INDEX_URL environment variable does not help — get_pip_config_args() (xinference/utils.py) reads pip config list (config files only), not env vars, so the env var never reaches settings and the engine injection still happens.
Suggested fix (no xinference code change). inherit_pip_config (xinference/core/worker.py, ~L1707) runs before apply_engine_virtualenv_settings (~L1714), and the engine injection only fires when settings.extra_index_url is None. So provide a pip config file in the container (not an env var) that points both indexes at the local mirror:
# /etc/pip.conf
[global]
index-url = http://pypiserver:8080/simple/
extra-index-url = http://pypiserver:8080/simple/
trusted-host = pypiserverThen pip config list -> get_pip_config_args() -> inherit_pip_config sets settings.extra_index_url to the mirror (non-None), so apply_engine_virtualenv_settings skips injecting the wheels.vllm.ai / download.pytorch.org internet indexes, and venv creation resolves entirely from pypiserver (the mirror already holds those cu130 wheels by name). Mount this file (e.g. to /etc/pip.conf, or via PIP_CONFIG_FILE) in the offline profile.
Two caveats:
- This does not fix sglang's
sgl_kernel: it is referenced as a direct GitHub URL at venv-creation time (not an index lookup), so a name-indexed pypiserver cannot serve it and pip will still try to reachgithub.com. That needs xinference core to express such deps as index-resolvable names (or mirror + rewrite). So the config-file fix unblocks vllm, not sglang. - The mechanism is implicit (relies on the inherit-before-apply ordering and the "only inject when
None" guard). A cleaner long-term fix is an explicit offline switch in xinference that disables engine extra-index injection — out of scope for this PR.
| '#vllm_dependencies#'|vllm_dependencies) | ||
| printf 'vllm>=0.11.2\n' ;; | ||
| '#sglang_dependencies#'|sglang_dependencies) | ||
| printf 'pybase64\nzmq\npartial_json_parser\nsentencepiece\ndill\nninja\nnumpy>=2.4.1\nsglang>=0.5.6\nsgl_kernel\n' ;; |
There was a problem hiding this comment.
Duplicated source of truth (already drifting). This script — and the --extra-index-url list in Dockerfile.pypiserver — is a hand-maintained copy of ENGINE_VIRTUALENV_PACKAGES / ENGINE_VIRTUALENV_EXTRA_INDEX_URLS from xinference/core/virtual_env_manager.py. Two sources of truth will drift, and the air-gapped install is the worst place to discover it.
Concrete drift today:
-
sglang
sgl_kernel(this line). The source defines three entries — two pinned CUDA-13.0 direct-URL wheels (sgl_kernel-0.3.21+cu130-cp310-abi3-...x86_64.whl/...aarch64.whl, gated bycuda_version == "13.0" and platform_machine == ...) plus asgl_kernel ; cuda_version < "13.0"fallback. This line collapses all of that to a baresgl_kernel. Since the published runtime image is CUDA 13.0.2, the runtime resolves to the pinned GitHub URL wheel, not PyPIsgl_kernel— so the mirror downloads the wrong artifact. (And see the note below: even mirroring the right wheel doesn't make it servable.) -
Extra indexes: global vs per-engine, plus an out-of-band source. The source applies extra indexes per engine (
vllm → [wheels.vllm.ai/0.19.0/cu130, download.pytorch.org/whl/cu130],sglang → [download.pytorch.org/whl/cu130]) withindex_strategy = unsafe-best-match. The Dockerfile instead appliesdownload.pytorch.org/whl/cu130 + wheels.vllm.ai/0.19.0/cu130 + xorbitsai.github.io/xllamacpp/whl/cu128to every package and passes no index strategy. Note (a)xorbitsai.github.io/xllamacpp/whl/cu128is not in the source at all and itscu128disagrees with thecu130used everywhere else, and (b) a transformers-/diffusers-only model can now pull a cu130-tagged torch into the mirror. -
Version specs are a manual copy.
vllm>=0.11.2,transformers>=4.53.3, etc. happen to match the source right now, but they are hand-transcribed; the next bump inENGINE_VIRTUALENV_PACKAGESsilently leaves the mirror behind, surfacing only during an offline install.
Suggested fix — generate from the source instead of re-encoding in bash. Replace the hardcoded placeholder/index logic with a small Python helper that imports and reuses the real definitions, so the list is exactly what the runtime computes:
from xinference.core.virtual_env_manager import (
ENGINE_VIRTUALENV_PACKAGES,
ENGINE_VIRTUALENV_EXTRA_INDEX_URLS,
expand_engine_dependency_placeholders,
filter_virtualenv_packages_by_markers,
)
# for each engine: expand placeholders, then
# filter_virtualenv_packages_by_markers(expanded, engine, cuda_version="13.0")This makes the package list (incl. the sgl_kernel URLs) and the per-engine extra indexes track the source automatically, killing all three drifts above. The downloader stage would need the module importable — e.g. COPY . /src + pip install --no-deps -e /src (please confirm the import surface stays light enough to avoid pulling heavy deps).
Caveat (related to the offline concern on docker-compose.yaml). Fixing the drift still won't make sglang work offline: the sgl_kernel cu130 dependency is referenced by a direct GitHub URL at venv-creation time, and a name-indexed pypiserver cannot serve a direct-URL wheel — the runtime will still try to reach github.com. Truly supporting these engines offline needs xinference core to express such deps as index-resolvable names (or to mirror + rewrite them), which is beyond this script.
Change list
README.mdwith Docker Compose setup instructionsSimulate Offline Locally
Pull images first (requires internet this one time), then start and verify:
Block Internet from xinference Container
To simulate the air-gap at the container level while keeping host access:
Manually download packages which are needed when deploying
qwen3.5model withvllmengineClean up rules when done