feat: add pypiserver image, docker-compose setup, and CI pipeline by qinrui777 · Pull Request #5048 · xorbitsai/inference

qinrui777 · 2026-06-18T03:19:09Z

Change list

Add Docker Compose with online/offline modes via pypiserver profile
Add multi-stage Dockerfile to build pypiserver with all model wheels
Add list-packages.sh to extract packages from model JSONs by platform
Add GitHub Actions workflow to build/push multi-arch pypiserver images
Update README.md with Docker Compose setup instructions

Simulate Offline Locally

Pull images first (requires internet this one time), then start and verify:

# 1. Create .env and start
cat > .env << EOF
PIP_INDEX_URL=http://pypiserver:8080/simple/
PIP_TRUSTED_HOST=pypiserver
EOF
docker compose --profile pypiserver up -d

# 2. Verify xinference can reach pypiserver
docker exec xinference curl -s http://pypiserver:8080/simple/

Block Internet from xinference Container

To simulate the air-gap at the container level while keeping host access:

# 3. Get container IPs
XINF_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' xinference)
PYP_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pypiserver)

# 4. Allow established connections (required for host ↔ xinference to work)
sudo iptables -I DOCKER-USER 1 -s $XINF_IP -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# 5. Allow xinference ↔ pypiserver
sudo iptables -I DOCKER-USER 2 -s $XINF_IP -d $PYP_IP -j ACCEPT

# 6. Drop all other outbound from xinference (blocks internet)
sudo iptables -I DOCKER-USER 3 -s $XINF_IP -j DROP

# 7. Verify internet is blocked — should say "OFFLINE"
docker exec xinference curl -s --connect-timeout 3 https://pypi.org \
  && echo "STILL ONLINE" \
  || echo "OFFLINE — good"

# 8. Verify pip is configured to use pypiserver
docker exec xinference python -c "import os; print(os.environ.get('PIP_INDEX_URL'))"
# → http://pypiserver:8080/simple/

# 9. Install a package from pypiserver — should succeed
docker exec xinference pip install --force-reinstall --no-deps --timeout 10 requests

# 10. Try reaching PyPI via pip — should fail (no internet)
docker exec xinference pip install --timeout 5 --index-url https://pypi.org/simple/ pip \
  && echo "STILL ONLINE — something is wrong" \
  || echo "OFFLINE — pip correctly blocked"

Manually download packages which are needed when deploying `qwen3.5` model with `vllm` engine

#------ in offline xinference container ----

$ cat > qwen35-requirements.txt << EOF
av==17.1.0
certifi==2026.5.20
filelock==3.29.4
fsspec==2026.6.0
hf-xet==1.5.1
huggingface-hub==0.36.2
idna==3.18
msgpack==1.2.0
narwhals==2.22.1
numba==0.65.1
numpy==2.4.6
packaging==26.2
pillow==12.2.0
platformdirs==4.10.0
safetensors==0.8.0
scikit-learn==1.9.0
scipy==1.17.1
soundfile==0.14.0
tqdm==4.68.2
transformers==4.57.6
typing-extensions==4.15.0
urllib3==2.7.0
EOF

$ pip install -r qwen35-requirements.txt
Looking in indexes: http://pypiserver:8080/simple/
Collecting av==17.1.0 (from -r qwen35-requirements.txt (line 1))
  Downloading av-17.1.0-cp311-abi3-manylinux_2_28_x86_64.whl (35.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.4/35.4 MB 920.6 MB/s  0:00:00
Collecting certifi==2026.5.20 (from -r qwen35-requirements.txt (line 2))
  Downloading certifi-2026.5.20-py3-none-any.whl (134 kB)
Collecting filelock==3.29.4 (from -r qwen35-requirements.txt (line 3
mistral-common 1.11.2 requires numpy<2.4,>=1.25; python_version <= "3.12", but you have numpy 2.4.6 which is incompatible.
Successfully installed av-17.1.0 certifi-2026.5.20 filelock-3.29.4 fsspec-2026.6.0 hf-xet-1.5.1 huggingface-hub-0.36.2 idna-3.18 msgpack-1.2.0 narwhals-2.22.1 numba-0.65.1 numpy-2.4.6 pillow-12.2.0 platformdirs-4.10.0 safetensors-0.8.0 scikit-learn-1.9.0 scipy-1.17.1 soundfile-0.14.0 tqdm-4.68.2 transformers-4.57.6 urllib3-2.7.0
...

Clean up rules when done

sudo iptables -D DOCKER-USER -s $XINF_IP -j DROP
sudo iptables -D DOCKER-USER -s $XINF_IP -d $PYP_IP -j ACCEPT
sudo iptables -D DOCKER-USER -s $XINF_IP -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

… diff packages

…nnxruntime==1.16.0

…er}/external) and remove cu124

gemini-code-assist

Code Review

This pull request introduces a Docker Compose setup for both online and offline modes, including a local PyPI server (pypiserver) to host offline wheels. Key feedback includes fixing the pypiserver command in docker-compose.yaml to ensure wheels are served, using --chown in the Dockerfile to avoid permission errors, correcting invalid CUDA index URLs, and resolving script robustness issues in list-packages.sh (such as potential hangs, pipeline failures, and incorrect relative paths).

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

qinxuye

One issue remains around the offline package mirror build.

qinxuye · 2026-06-24T04:40:43Z

+             pip download --dest /data/packages/ --no-cache-dir \
+               "$pkg" 2>/dev/null \
+               && echo "OK  $pkg" || echo "SKIP $pkg" ;; \
+           *) \
+             pip download --dest /data/packages/ --no-cache-dir \
+               --extra-index-url https://download.pytorch.org/whl/cu130 \
+               --extra-index-url https://wheels.vllm.ai/0.19.0/cu130 \
+               --extra-index-url https://xorbitsai.github.io/xllamacpp/whl/cu128 \
+               "$pkg" 2>/dev/null \
+               && echo "OK  $pkg" || echo "SKIP $pkg" ;; \


This still makes the pypiserver image a best-effort mirror: every pip download failure is hidden by 2>/dev/null and converted into SKIP, so the workflow can publish an image that is missing required wheels and only fails later in offline installs. Since this image is meant to support offline use, please collect failed packages and fail the build for required downloads, or otherwise make the best-effort behavior explicit and avoid publishing it as a complete offline mirror.

Fixed in latest commit. Build Result -> https://github.com/qinrui777/inference/actions/runs/28079279864

Stopped swallowing stderr — removed 2>/dev/null from both pip invocations, so failures are now visible in the build log.

Recorded failures — each SKIP now appends the package name to /tmp/failed.txt, then cat the file in console.

rogercloud

Requesting changes for the release gating and offline package integrity issues below.

rogercloud

Three design-level concerns about whether the offline mirror is actually usable in a true air-gapped deployment. These are independent of the silent-failure / release-gating points already raised.

rogercloud · 2026-06-24T10:20:05Z

+# ============================================================================
+# Stage 1 — download every wheel needed by xinference model engines
+# ============================================================================
+FROM python:3.12-slim AS downloader


[Updated — downgraded to a minor note.] Downloader Python should track the runtime image.

I first read the in-repo Dockerfile (base vllm/vllm-openai:v0.17.1, Python 3.10) and worried this python:3.12 downloader would produce wheels that can't be installed at runtime. Inspecting the actually-published xprobe/xinference:latest config, however, it is built on a newer vLLM base with PYTHON_VERSION=3.12 and CUDA_VERSION=13.0.2. So today the downloader (3.12) matches the runtime interpreter and there is no ABI mismatch — and the cu130 extra indexes are correct for this image.

The only residual concern is forward-looking: this python:3.12-slim is hardcoded and decoupled from whatever Python the xinference base image actually ships. Because model venvs are created with python_path=sys.executable (xinference/core/worker.py), a future base-image Python bump would silently make the mirrored wheels uninstallable, and the failure would only surface in an offline install. Consider deriving/pinning the downloader Python (and CUDA tag) from the target image, or documenting the coupling, so they can't drift. Not a blocker.

rogercloud · 2026-06-24T10:20:05Z

+    environment:
+      # - XINFERENCE_MODEL_SRC=modelscope
+      - XINFERENCE_HOME=/data
+      - PIP_INDEX_URL=${PIP_INDEX_URL:-https://pypi.org/simple}


Offline mode is incomplete for vllm/sglang. Pointing PIP_INDEX_URL at the local pypiserver only redirects pip's primary index. When xinference bootstraps a vllm/sglang model venv it injects engine-specific extra indexes from ENGINE_VIRTUALENV_EXTRA_INDEX_URLS (apply_engine_virtualenv_settings in xinference/core/utils.py) — https://wheels.vllm.ai/0.19.0/cu130 and https://download.pytorch.org/whl/cu130, with index_strategy=unsafe-best-match. These are passed to uv as explicit --extra-index-url args, so in a true air-gapped deployment those hosts are unreachable and venv creation for the heavy GPU engines fails despite the local mirror.

Note: setting a PIP_EXTRA_INDEX_URL environment variable does not help — get_pip_config_args() (xinference/utils.py) reads pip config list (config files only), not env vars, so the env var never reaches settings and the engine injection still happens.

Suggested fix (no xinference code change). inherit_pip_config (xinference/core/worker.py, ~L1707) runs before apply_engine_virtualenv_settings (~L1714), and the engine injection only fires when settings.extra_index_url is None. So provide a pip config file in the container (not an env var) that points both indexes at the local mirror:

# /etc/pip.conf [global] index-url = http://pypiserver:8080/simple/ extra-index-url = http://pypiserver:8080/simple/ trusted-host = pypiserver

Then pip config list -> get_pip_config_args() -> inherit_pip_config sets settings.extra_index_url to the mirror (non-None), so apply_engine_virtualenv_settings skips injecting the wheels.vllm.ai / download.pytorch.org internet indexes, and venv creation resolves entirely from pypiserver (the mirror already holds those cu130 wheels by name). Mount this file (e.g. to /etc/pip.conf, or via PIP_CONFIG_FILE) in the offline profile.

Two caveats:

This does not fix sglang's sgl_kernel: it is referenced as a direct GitHub URL at venv-creation time (not an index lookup), so a name-indexed pypiserver cannot serve it and pip will still try to reach github.com. That needs xinference core to express such deps as index-resolvable names (or mirror + rewrite). So the config-file fix unblocks vllm, not sglang.

The mechanism is implicit (relies on the inherit-before-apply ordering and the "only inject when None" guard). A cleaner long-term fix is an explicit offline switch in xinference that disables engine extra-index injection — out of scope for this PR.

rogercloud · 2026-06-24T10:20:05Z

+    '#vllm_dependencies#'|vllm_dependencies)
+      printf 'vllm>=0.11.2\n' ;;
+    '#sglang_dependencies#'|sglang_dependencies)
+      printf 'pybase64\nzmq\npartial_json_parser\nsentencepiece\ndill\nninja\nnumpy>=2.4.1\nsglang>=0.5.6\nsgl_kernel\n' ;;


Duplicated source of truth (already drifting). This script — and the --extra-index-url list in Dockerfile.pypiserver — is a hand-maintained copy of ENGINE_VIRTUALENV_PACKAGES / ENGINE_VIRTUALENV_EXTRA_INDEX_URLS from xinference/core/virtual_env_manager.py. Two sources of truth will drift, and the air-gapped install is the worst place to discover it.

Concrete drift today:

sglang sgl_kernel (this line). The source defines three entries — two pinned CUDA-13.0 direct-URL wheels (sgl_kernel-0.3.21+cu130-cp310-abi3-...x86_64.whl / ...aarch64.whl, gated by cuda_version == "13.0" and platform_machine == ...) plus a sgl_kernel ; cuda_version < "13.0" fallback. This line collapses all of that to a bare sgl_kernel. Since the published runtime image is CUDA 13.0.2, the runtime resolves to the pinned GitHub URL wheel, not PyPI sgl_kernel — so the mirror downloads the wrong artifact. (And see the note below: even mirroring the right wheel doesn't make it servable.)

Extra indexes: global vs per-engine, plus an out-of-band source. The source applies extra indexes per engine (vllm → [wheels.vllm.ai/0.19.0/cu130, download.pytorch.org/whl/cu130], sglang → [download.pytorch.org/whl/cu130]) with index_strategy = unsafe-best-match. The Dockerfile instead applies download.pytorch.org/whl/cu130 + wheels.vllm.ai/0.19.0/cu130 + xorbitsai.github.io/xllamacpp/whl/cu128 to every package and passes no index strategy. Note (a) xorbitsai.github.io/xllamacpp/whl/cu128 is not in the source at all and its cu128 disagrees with the cu130 used everywhere else, and (b) a transformers-/diffusers-only model can now pull a cu130-tagged torch into the mirror.

Version specs are a manual copy. vllm>=0.11.2, transformers>=4.53.3, etc. happen to match the source right now, but they are hand-transcribed; the next bump in ENGINE_VIRTUALENV_PACKAGES silently leaves the mirror behind, surfacing only during an offline install.

Suggested fix — generate from the source instead of re-encoding in bash. Replace the hardcoded placeholder/index logic with a small Python helper that imports and reuses the real definitions, so the list is exactly what the runtime computes:

from xinference.core.virtual_env_manager import ( ENGINE_VIRTUALENV_PACKAGES, ENGINE_VIRTUALENV_EXTRA_INDEX_URLS, expand_engine_dependency_placeholders, filter_virtualenv_packages_by_markers, ) # for each engine: expand placeholders, then # filter_virtualenv_packages_by_markers(expanded, engine, cuda_version="13.0")

This makes the package list (incl. the sgl_kernel URLs) and the per-engine extra indexes track the source automatically, killing all three drifts above. The downloader stage would need the module importable — e.g. COPY . /src + pip install --no-deps -e /src (please confirm the import surface stays light enough to avoid pulling heavy deps).

Caveat (related to the offline concern on docker-compose.yaml). Fixing the drift still won't make sglang work offline: the sgl_kernel cu130 dependency is referenced by a direct GitHub URL at venv-creation time, and a name-indexed pypiserver cannot serve a direct-URL wheel — the runtime will still try to reach github.com. Truly supporting these engines offline needs xinference core to express such deps as index-resolvable names (or to mirror + rewrite them), which is beyond this script.

qinrui777 added 25 commits June 9, 2026 10:28

New dockerfile for pypiserver and new workflow

a3ea80d

Upgrade action ver to avoid nodejs 20 warning

04396c9

support building arm64 iamge

ce44afe

seprate pypi packages into different type, build diff arch image with…

001291d

… diff packages

fix CMake Error while building

fca3dde

update download script

2d0ea8f

fix: ERROR: Could not find a version that satisfies the requirement o…

f389eaf

…nnxruntime==1.16.0

fix

4fe5190

fix again

f97be80

change to multi arch runner

7f3b8f5

add debug steps

a4d28fb

refactor workflow, remove build xinference package steps

1a83ce9

update

ce1aefb

fix eva-decord not found error

7ecb52d

fix

3cb0583

fix

c8ff380

refactor(pypiserver): nest packages by category (common/compiled/cu{v…

42834d8

…er}/external) and remove cu124

free disk for fixing errors

6c533e0

try to fix disk issue

8637652

debug: using only cu128

c3fb7e6

refactor pypiserver build flow

41f680a

restructure file path

0f783b6

feat/add build pypiserver feature

e8a430c

feat/adjust the image tag naming

86a222e

revert unused changes

ce6201a

XprobeBot added the feature label Jun 18, 2026

XprobeBot added this to the v2.x milestone Jun 18, 2026

qinrui777 requested a review from qinxuye June 18, 2026 03:19

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

Update xinference/deploy/docker/pypiserver/list-packages.sh

77468c1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

qinxuye reviewed Jun 24, 2026

View reviewed changes

bld: Stopped swallowing stderr,Recorded failures

c20062e

rogercloud requested changes Jun 24, 2026

View reviewed changes

Comment thread .github/workflows/build-pypiserver-image.yaml

Comment thread xinference/deploy/docker/pypiserver/Dockerfile.pypiserver Outdated

Comment thread docker-compose.yaml

rogercloud reviewed Jun 24, 2026

View reviewed changes

qinrui777 added 2 commits June 25, 2026 09:57

bld: add check-release-permission

fa11776

bld: make this fail the build when some packages failed downloads

f88b1a7

Uh oh!

Conversation

qinrui777 commented Jun 18, 2026

Change list

Simulate Offline Locally

Block Internet from xinference Container

Manually download packages which are needed when deploying qwen3.5 model with vllm engine

Clean up rules when done

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qinxuye left a comment

Choose a reason for hiding this comment

Uh oh!

qinxuye Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

qinrui777 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

rogercloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rogercloud left a comment

Choose a reason for hiding this comment

Uh oh!

rogercloud Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogercloud Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogercloud Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Manually download packages which are needed when deploying `qwen3.5` model with `vllm` engine

rogercloud Jun 24, 2026 •

edited

Loading

rogercloud Jun 24, 2026 •

edited

Loading

rogercloud Jun 24, 2026 •

edited

Loading