Skip to content

feat: add pypiserver image, docker-compose setup, and CI pipeline#5048

Open
qinrui777 wants to merge 29 commits into
xorbitsai:mainfrom
qinrui777:feat/add-dc-and-pypiserver
Open

feat: add pypiserver image, docker-compose setup, and CI pipeline#5048
qinrui777 wants to merge 29 commits into
xorbitsai:mainfrom
qinrui777:feat/add-dc-and-pypiserver

Conversation

@qinrui777

Copy link
Copy Markdown
Collaborator

Change list

  • Add Docker Compose with online/offline modes via pypiserver profile
  • Add multi-stage Dockerfile to build pypiserver with all model wheels
  • Add list-packages.sh to extract packages from model JSONs by platform
  • Add GitHub Actions workflow to build/push multi-arch pypiserver images
  • Update README.md with Docker Compose setup instructions

Simulate Offline Locally

Pull images first (requires internet this one time), then start and verify:

# 1. Create .env and start
cat > .env << EOF
PIP_INDEX_URL=http://pypiserver:8080/simple/
PIP_TRUSTED_HOST=pypiserver
EOF
docker compose --profile pypiserver up -d

# 2. Verify xinference can reach pypiserver
docker exec xinference curl -s http://pypiserver:8080/simple/

Block Internet from xinference Container

To simulate the air-gap at the container level while keeping host access:

# 3. Get container IPs
XINF_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' xinference)
PYP_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pypiserver)

# 4. Allow established connections (required for host ↔ xinference to work)
sudo iptables -I DOCKER-USER 1 -s $XINF_IP -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# 5. Allow xinference ↔ pypiserver
sudo iptables -I DOCKER-USER 2 -s $XINF_IP -d $PYP_IP -j ACCEPT

# 6. Drop all other outbound from xinference (blocks internet)
sudo iptables -I DOCKER-USER 3 -s $XINF_IP -j DROP

# 7. Verify internet is blocked — should say "OFFLINE"
docker exec xinference curl -s --connect-timeout 3 https://pypi.org \
  && echo "STILL ONLINE" \
  || echo "OFFLINE — good"

# 8. Verify pip is configured to use pypiserver
docker exec xinference python -c "import os; print(os.environ.get('PIP_INDEX_URL'))"
# → http://pypiserver:8080/simple/

# 9. Install a package from pypiserver — should succeed
docker exec xinference pip install --force-reinstall --no-deps --timeout 10 requests

# 10. Try reaching PyPI via pip — should fail (no internet)
docker exec xinference pip install --timeout 5 --index-url https://pypi.org/simple/ pip \
  && echo "STILL ONLINE — something is wrong" \
  || echo "OFFLINE — pip correctly blocked"

Manually download packages which are needed when deploying qwen3.5 model with vllm engine

#------ in offline xinference container ----

$ cat > qwen35-requirements.txt << EOF
av==17.1.0
certifi==2026.5.20
filelock==3.29.4
fsspec==2026.6.0
hf-xet==1.5.1
huggingface-hub==0.36.2
idna==3.18
msgpack==1.2.0
narwhals==2.22.1
numba==0.65.1
numpy==2.4.6
packaging==26.2
pillow==12.2.0
platformdirs==4.10.0
safetensors==0.8.0
scikit-learn==1.9.0
scipy==1.17.1
soundfile==0.14.0
tqdm==4.68.2
transformers==4.57.6
typing-extensions==4.15.0
urllib3==2.7.0
EOF

$ pip install -r qwen35-requirements.txt
Looking in indexes: http://pypiserver:8080/simple/
Collecting av==17.1.0 (from -r qwen35-requirements.txt (line 1))
  Downloading av-17.1.0-cp311-abi3-manylinux_2_28_x86_64.whl (35.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.4/35.4 MB 920.6 MB/s  0:00:00
Collecting certifi==2026.5.20 (from -r qwen35-requirements.txt (line 2))
  Downloading certifi-2026.5.20-py3-none-any.whl (134 kB)
Collecting filelock==3.29.4 (from -r qwen35-requirements.txt (line 3
mistral-common 1.11.2 requires numpy<2.4,>=1.25; python_version <= "3.12", but you have numpy 2.4.6 which is incompatible.
Successfully installed av-17.1.0 certifi-2026.5.20 filelock-3.29.4 fsspec-2026.6.0 hf-xet-1.5.1 huggingface-hub-0.36.2 idna-3.18 msgpack-1.2.0 narwhals-2.22.1 numba-0.65.1 numpy-2.4.6 pillow-12.2.0 platformdirs-4.10.0 safetensors-0.8.0 scikit-learn-1.9.0 scipy-1.17.1 soundfile-0.14.0 tqdm-4.68.2 transformers-4.57.6 urllib3-2.7.0
...

Clean up rules when done

sudo iptables -D DOCKER-USER -s $XINF_IP -j DROP
sudo iptables -D DOCKER-USER -s $XINF_IP -d $PYP_IP -j ACCEPT
sudo iptables -D DOCKER-USER -s $XINF_IP -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

@XprobeBot XprobeBot added this to the v2.x milestone Jun 18, 2026
@qinrui777 qinrui777 requested a review from qinxuye June 18, 2026 03:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Docker Compose setup for both online and offline modes, including a local PyPI server (pypiserver) to host offline wheels. Key feedback includes fixing the pypiserver command in docker-compose.yaml to ensure wheels are served, using --chown in the Dockerfile to avoid permission errors, correcting invalid CUDA index URLs, and resolving script robustness issues in list-packages.sh (such as potential hangs, pipeline failures, and incorrect relative paths).

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread docker-compose.yaml
Comment thread xinference/deploy/docker/pypiserver/Dockerfile.pypiserver
Comment thread xinference/deploy/docker/pypiserver/Dockerfile.pypiserver Outdated
Comment thread xinference/deploy/docker/pypiserver/list-packages.sh
Comment thread xinference/deploy/docker/pypiserver/list-packages.sh
Comment thread xinference/deploy/docker/pypiserver/list-packages.sh
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

@qinxuye qinxuye left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue remains around the offline package mirror build.

Comment on lines +49 to +58
pip download --dest /data/packages/ --no-cache-dir \
"$pkg" 2>/dev/null \
&& echo "OK $pkg" || echo "SKIP $pkg" ;; \
*) \
pip download --dest /data/packages/ --no-cache-dir \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--extra-index-url https://wheels.vllm.ai/0.19.0/cu130 \
--extra-index-url https://xorbitsai.github.io/xllamacpp/whl/cu128 \
"$pkg" 2>/dev/null \
&& echo "OK $pkg" || echo "SKIP $pkg" ;; \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still makes the pypiserver image a best-effort mirror: every pip download failure is hidden by 2>/dev/null and converted into SKIP, so the workflow can publish an image that is missing required wheels and only fails later in offline installs. Since this image is meant to support offline use, please collect failed packages and fail the build for required downloads, or otherwise make the best-effort behavior explicit and avoid publishing it as a complete offline mirror.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit. Build Result -> https://github.com/qinrui777/inference/actions/runs/28079279864

  • Stopped swallowing stderr — removed 2>/dev/null from both pip invocations, so failures are now visible in the build log.
  • Recorded failures — each SKIP now appends the package name to /tmp/failed.txt, then cat the file in console.

@rogercloud rogercloud left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the release gating and offline package integrity issues below.

Comment thread .github/workflows/build-pypiserver-image.yaml
Comment thread xinference/deploy/docker/pypiserver/Dockerfile.pypiserver Outdated
Comment thread docker-compose.yaml

@rogercloud rogercloud left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three design-level concerns about whether the offline mirror is actually usable in a true air-gapped deployment. These are independent of the silent-failure / release-gating points already raised.

# ============================================================================
# Stage 1 — download every wheel needed by xinference model engines
# ============================================================================
FROM python:3.12-slim AS downloader

@rogercloud rogercloud Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Updated — downgraded to a minor note.] Downloader Python should track the runtime image.

I first read the in-repo Dockerfile (base vllm/vllm-openai:v0.17.1, Python 3.10) and worried this python:3.12 downloader would produce wheels that can't be installed at runtime. Inspecting the actually-published xprobe/xinference:latest config, however, it is built on a newer vLLM base with PYTHON_VERSION=3.12 and CUDA_VERSION=13.0.2. So today the downloader (3.12) matches the runtime interpreter and there is no ABI mismatch — and the cu130 extra indexes are correct for this image.

The only residual concern is forward-looking: this python:3.12-slim is hardcoded and decoupled from whatever Python the xinference base image actually ships. Because model venvs are created with python_path=sys.executable (xinference/core/worker.py), a future base-image Python bump would silently make the mirrored wheels uninstallable, and the failure would only surface in an offline install. Consider deriving/pinning the downloader Python (and CUDA tag) from the target image, or documenting the coupling, so they can't drift. Not a blocker.

Comment thread docker-compose.yaml
environment:
# - XINFERENCE_MODEL_SRC=modelscope
- XINFERENCE_HOME=/data
- PIP_INDEX_URL=${PIP_INDEX_URL:-https://pypi.org/simple}

@rogercloud rogercloud Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline mode is incomplete for vllm/sglang. Pointing PIP_INDEX_URL at the local pypiserver only redirects pip's primary index. When xinference bootstraps a vllm/sglang model venv it injects engine-specific extra indexes from ENGINE_VIRTUALENV_EXTRA_INDEX_URLS (apply_engine_virtualenv_settings in xinference/core/utils.py) — https://wheels.vllm.ai/0.19.0/cu130 and https://download.pytorch.org/whl/cu130, with index_strategy=unsafe-best-match. These are passed to uv as explicit --extra-index-url args, so in a true air-gapped deployment those hosts are unreachable and venv creation for the heavy GPU engines fails despite the local mirror.

Note: setting a PIP_EXTRA_INDEX_URL environment variable does not help — get_pip_config_args() (xinference/utils.py) reads pip config list (config files only), not env vars, so the env var never reaches settings and the engine injection still happens.

Suggested fix (no xinference code change). inherit_pip_config (xinference/core/worker.py, ~L1707) runs before apply_engine_virtualenv_settings (~L1714), and the engine injection only fires when settings.extra_index_url is None. So provide a pip config file in the container (not an env var) that points both indexes at the local mirror:

# /etc/pip.conf
[global]
index-url = http://pypiserver:8080/simple/
extra-index-url = http://pypiserver:8080/simple/
trusted-host = pypiserver

Then pip config list -> get_pip_config_args() -> inherit_pip_config sets settings.extra_index_url to the mirror (non-None), so apply_engine_virtualenv_settings skips injecting the wheels.vllm.ai / download.pytorch.org internet indexes, and venv creation resolves entirely from pypiserver (the mirror already holds those cu130 wheels by name). Mount this file (e.g. to /etc/pip.conf, or via PIP_CONFIG_FILE) in the offline profile.

Two caveats:

  1. This does not fix sglang's sgl_kernel: it is referenced as a direct GitHub URL at venv-creation time (not an index lookup), so a name-indexed pypiserver cannot serve it and pip will still try to reach github.com. That needs xinference core to express such deps as index-resolvable names (or mirror + rewrite). So the config-file fix unblocks vllm, not sglang.
  2. The mechanism is implicit (relies on the inherit-before-apply ordering and the "only inject when None" guard). A cleaner long-term fix is an explicit offline switch in xinference that disables engine extra-index injection — out of scope for this PR.

'#vllm_dependencies#'|vllm_dependencies)
printf 'vllm>=0.11.2\n' ;;
'#sglang_dependencies#'|sglang_dependencies)
printf 'pybase64\nzmq\npartial_json_parser\nsentencepiece\ndill\nninja\nnumpy>=2.4.1\nsglang>=0.5.6\nsgl_kernel\n' ;;

@rogercloud rogercloud Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated source of truth (already drifting). This script — and the --extra-index-url list in Dockerfile.pypiserver — is a hand-maintained copy of ENGINE_VIRTUALENV_PACKAGES / ENGINE_VIRTUALENV_EXTRA_INDEX_URLS from xinference/core/virtual_env_manager.py. Two sources of truth will drift, and the air-gapped install is the worst place to discover it.

Concrete drift today:

  1. sglang sgl_kernel (this line). The source defines three entries — two pinned CUDA-13.0 direct-URL wheels (sgl_kernel-0.3.21+cu130-cp310-abi3-...x86_64.whl / ...aarch64.whl, gated by cuda_version == "13.0" and platform_machine == ...) plus a sgl_kernel ; cuda_version < "13.0" fallback. This line collapses all of that to a bare sgl_kernel. Since the published runtime image is CUDA 13.0.2, the runtime resolves to the pinned GitHub URL wheel, not PyPI sgl_kernel — so the mirror downloads the wrong artifact. (And see the note below: even mirroring the right wheel doesn't make it servable.)

  2. Extra indexes: global vs per-engine, plus an out-of-band source. The source applies extra indexes per engine (vllm → [wheels.vllm.ai/0.19.0/cu130, download.pytorch.org/whl/cu130], sglang → [download.pytorch.org/whl/cu130]) with index_strategy = unsafe-best-match. The Dockerfile instead applies download.pytorch.org/whl/cu130 + wheels.vllm.ai/0.19.0/cu130 + xorbitsai.github.io/xllamacpp/whl/cu128 to every package and passes no index strategy. Note (a) xorbitsai.github.io/xllamacpp/whl/cu128 is not in the source at all and its cu128 disagrees with the cu130 used everywhere else, and (b) a transformers-/diffusers-only model can now pull a cu130-tagged torch into the mirror.

  3. Version specs are a manual copy. vllm>=0.11.2, transformers>=4.53.3, etc. happen to match the source right now, but they are hand-transcribed; the next bump in ENGINE_VIRTUALENV_PACKAGES silently leaves the mirror behind, surfacing only during an offline install.

Suggested fix — generate from the source instead of re-encoding in bash. Replace the hardcoded placeholder/index logic with a small Python helper that imports and reuses the real definitions, so the list is exactly what the runtime computes:

from xinference.core.virtual_env_manager import (
    ENGINE_VIRTUALENV_PACKAGES,
    ENGINE_VIRTUALENV_EXTRA_INDEX_URLS,
    expand_engine_dependency_placeholders,
    filter_virtualenv_packages_by_markers,
)
# for each engine: expand placeholders, then
# filter_virtualenv_packages_by_markers(expanded, engine, cuda_version="13.0")

This makes the package list (incl. the sgl_kernel URLs) and the per-engine extra indexes track the source automatically, killing all three drifts above. The downloader stage would need the module importable — e.g. COPY . /src + pip install --no-deps -e /src (please confirm the import surface stays light enough to avoid pulling heavy deps).

Caveat (related to the offline concern on docker-compose.yaml). Fixing the drift still won't make sglang work offline: the sgl_kernel cu130 dependency is referenced by a direct GitHub URL at venv-creation time, and a name-indexed pypiserver cannot serve a direct-URL wheel — the runtime will still try to reach github.com. Truly supporting these engines offline needs xinference core to express such deps as index-resolvable names (or to mirror + rewrite them), which is beyond this script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants