Production inference using vLLM with tensor parallelism across dual MI60 GPUs.
Why vLLM?#
After evaluating Ollama, llama.cpp, and vLLM for MI60 inference, vLLM emerged as the best choice:
-
Tensor Parallelism: Native support for splitting large models across multiple GPUs. With dual MI60s (64GB total), I can run 70B parameter models that wouldn’t fit on a single GPU.
-
PagedAttention: More efficient memory management, allowing higher GPU utilization (90%) without OOM errors.
-
OpenAI-Compatible API: Drop-in replacement for OpenAI API, making integration seamless.
-
AWQ Quantization Support: Native support for 4-bit inference with minimal quality loss.
-
Production-Ready: Continuous batching, request scheduling, and health endpoints.
ROCm Compatibility#
The MI60 uses the gfx906 architecture, requiring a specialized vLLM build. I use the community image nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3 which includes ROCm 6.3 support and gfx906-specific optimizations.
Example: big-chat Configuration#
Running Llama 3.3 70B across both MI60 GPUs using tensor parallelism:
services:
vllm:
image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
container_name: vllm
devices:
- /dev/kfd:/dev/kfd
- /dev/dri/card1:/dev/dri/card1
- /dev/dri/card2:/dev/dri/card2
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/renderD129:/dev/dri/renderD129
group_add:
- "44" # video group
- "992" # render group
shm_size: 16g
environment:
- HIP_VISIBLE_DEVICES=0,1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- casperhansen/llama-3.3-70b-instruct-awq
- --tensor-parallel-size
- "2"
- --max-model-len
- "32768"
- --gpu-memory-utilization
- "0.9"Configuration Choices#
| Parameter | Value | Rationale |
|---|---|---|
tensor-parallel-size |
2 | Split model across both GPUs |
max-model-len |
32768 | Balance context length with memory |
gpu-memory-utilization |
0.9 | Leave 10% headroom for KV cache growth |
| Model | llama-3.3-70b-instruct-awq | AWQ 4-bit quantization fits in 64GB |
AWQ Quantization#
I use AWQ (Activation-Aware Weight Quantization) models because:
- 4-bit weights reduce memory footprint by ~4x vs FP16
- Minimal quality loss compared to GPTQ or naive quantization
- Native vLLM support with optimized HIP kernels
- Tensor parallel compatible when model dimensions align
Note: Not all AWQ models support tensor parallelism. The model’s hidden dimensions must be divisible by
(group_size × tensor_parallel_size). Llama models work well; some MoE models don’t.
Embeddings Service#
Each configuration includes an embeddings service using Infinity:
embeddings:
image: michaelf34/infinity:latest
container_name: infinity-embeddings
ports:
- "8080:7997"
command:
- v2
- --model-id
- nomic-ai/nomic-embed-text-v1.5
- --device
- cpuThis provides OpenAI-compatible embeddings at http://localhost:8080 for RAG and semantic search.