Created
July 31, 2025 20:38
-
-
Save steipete/8a9f8e1fa57b3930e22d91dda0db2d44 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<summary><strong>🤖 Detailed Setup Guide for AI Agents</strong></summary> | |
## Complete Setup Instructions for Self-Hosting Qwen3-Coder-480B | |
This guide provides step-by-step instructions for AI agents to set up Qwen3-Coder-480B on Vast.ai with maximum context window. | |
### Prerequisites | |
- Vast.ai account with billing configured | |
- SSH client installed | |
- 2-3 hours for initial setup | |
### Step 1: Create Vast.ai Instance | |
1. Go to [Vast.ai](https://vast.ai) | |
2. Search for instances with: | |
- 8x H200 GPUs (minimum 4x H200) | |
- vLLM template pre-installed | |
- 1TB+ disk space | |
- Good DLPerf score (>80) | |
3. Select instance and configure: | |
``` | |
Docker Image: Leave default vLLM image | |
Jupyter: Optional (uses port 8080) | |
SSH: Required | |
Direct SSH: Recommended | |
``` | |
4. Deploy instance and note: | |
- SSH port and IP | |
- Instance ID | |
- Cost per hour | |
### Step 2: SSH Configuration | |
Add to `~/.ssh/config`: | |
```bash | |
Host vastai-qwen | |
HostName [YOUR_IP] | |
Port [YOUR_PORT] | |
User root | |
StrictHostKeyChecking no | |
UserKnownHostsFile /dev/null | |
``` | |
Create SSH tunnel: | |
```bash | |
ssh -f -N -L 8000:localhost:8000 vastai-qwen | |
``` | |
### Step 3: Deploy vLLM with Qwen3-480B | |
SSH into instance: | |
```bash | |
ssh vastai-qwen | |
``` | |
Check GPU status: | |
```bash | |
nvidia-smi | |
``` | |
Create deployment script: | |
```bash | |
cat > deploy-qwen3.sh << 'EOF' | |
#!/bin/bash | |
# Kill any existing vLLM processes | |
pkill -f vllm.entrypoints.openai.api_server || true | |
# Start vLLM with optimal settings | |
/venv/main/bin/python -m vllm.entrypoints.openai.api_server \ | |
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ | |
--served-model-name qwen3-coder \ | |
--tensor-parallel-size 8 \ | |
--enable-expert-parallel \ | |
--gpu-memory-utilization 0.95 \ | |
--max-model-len 400000 \ | |
--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}' \ | |
--download-dir /workspace/models \ | |
--host 0.0.0.0 \ | |
--port 8000 \ | |
--trust-remote-code \ | |
--dtype float16 \ | |
--enable-prefix-caching \ | |
--enable-chunked-prefill \ | |
--max-num-batched-tokens 32768 \ | |
> vllm.log 2>&1 & | |
echo "vLLM deployment started. Check vllm.log for progress." | |
EOF | |
chmod +x deploy-qwen3.sh | |
./deploy-qwen3.sh | |
``` | |
### Step 4: Monitor Model Download | |
Model download takes 1-2 hours for 480GB: | |
```bash | |
# Watch download progress | |
tail -f vllm.log | grep -E "Downloading|Loading|Progress" | |
# Check disk usage | |
watch -n 5 'df -h /workspace' | |
``` | |
### Step 5: Disable Vast.ai Authentication | |
Vast.ai uses Caddy proxy with auth. Disable it: | |
```bash | |
# Stop Caddy to remove authentication | |
supervisorctl stop caddy | |
# Verify direct access works | |
curl http://localhost:8000/v1/models | |
``` | |
### Step 6: Configure AI Coding Clients | |
#### For Cline (VS Code Extension): | |
1. Install Cline extension in VS Code | |
2. Open Cline settings | |
3. Configure: | |
``` | |
API Provider: OpenAI Compatible | |
Base URL: http://localhost:8000/v1 | |
API Key: not-needed | |
Model: qwen3-coder | |
``` | |
#### For Cursor: | |
1. Open Cursor settings | |
2. Add custom model: | |
```json | |
{ | |
"openai_api_key": "not-needed", | |
"openai_api_base": "http://localhost:8000/v1", | |
"model": "qwen3-coder" | |
} | |
``` | |
#### For Command Line (qwen CLI): | |
Create config at `~/.config/qwen/config.json`: | |
```json | |
{ | |
"providers": { | |
"qwen3-local": { | |
"type": "openai", | |
"base_url": "http://localhost:8000/v1", | |
"api_key": "not-needed", | |
"models": [{ | |
"id": "qwen3-coder", | |
"name": "Qwen3-Coder-480B (400k context)", | |
"context_window": 400000, | |
"max_tokens": 16384 | |
}] | |
} | |
}, | |
"default_provider": "qwen3-local", | |
"default_model": "qwen3-coder" | |
} | |
``` | |
### Step 7: Test the Deployment | |
Test with curl: | |
```bash | |
curl -X POST http://localhost:8000/v1/chat/completions \ | |
-H "Content-Type: application/json" \ | |
-d '{ | |
"model": "qwen3-coder", | |
"messages": [{"role": "user", "content": "Write a Python hello world"}], | |
"max_tokens": 100 | |
}' | |
``` | |
Test context window: | |
```bash | |
# Create large context test | |
python3 << 'EOF' | |
import requests | |
import json | |
# Create a message with ~100k tokens (roughly 400k characters) | |
large_context = "The quick brown fox jumps over the lazy dog. " * 10000 | |
messages = [ | |
{"role": "system", "content": large_context}, | |
{"role": "user", "content": "Summarize the above in one sentence."} | |
] | |
response = requests.post( | |
"http://localhost:8000/v1/chat/completions", | |
json={"model": "qwen3-coder", "messages": messages, "max_tokens": 50} | |
) | |
print(json.dumps(response.json(), indent=2)) | |
EOF | |
``` | |
### Step 8: Performance Optimization | |
Monitor GPU utilization: | |
```bash | |
# Real-time GPU monitoring | |
watch -n 1 nvidia-smi | |
# Check vLLM metrics | |
curl http://localhost:8000/metrics | |
``` | |
Optimize for your use case: | |
- **For speed**: Reduce max_model_len to 100k-200k | |
- **For context**: Keep at 400k but expect slower responses | |
- **For cost**: Use 4x H200 instead of 8x (limited to 190k context) | |
### Step 9: Troubleshooting | |
Common issues and solutions: | |
#### Model won't load | |
```bash | |
# Check available memory | |
nvidia-smi | |
# Solution: Reduce --gpu-memory-utilization to 0.90 | |
``` | |
#### Authentication errors | |
```bash | |
# Ensure Caddy is stopped | |
supervisorctl status | |
supervisorctl stop caddy | |
``` | |
#### Context too large errors | |
```bash | |
# Reduce max_model_len in deployment script | |
# 4x H200: max 190000 | |
# 8x H200: max 400000 | |
``` | |
#### Slow responses | |
```bash | |
# Check batch settings | |
# Reduce --max-num-batched-tokens to 16384 | |
# Enable streaming in client | |
``` | |
### Step 10: Cost Monitoring | |
Track usage and costs: | |
```bash | |
# Create usage tracker | |
cat > track_usage.py << 'EOF' | |
#!/usr/bin/env python3 | |
import time | |
import datetime | |
start_time = datetime.datetime.now() | |
hourly_rate = 12.40 # Adjust based on your instance | |
while True: | |
elapsed = datetime.datetime.now() - start_time | |
hours = elapsed.total_seconds() / 3600 | |
cost = hours * hourly_rate | |
print(f"\rRunning for: {elapsed} | Cost: ${cost:.2f}", end="") | |
time.sleep(60) | |
EOF | |
chmod +x track_usage.py | |
./track_usage.py | |
``` | |
### Advanced: Context Window Tuning | |
For different context windows, adjust these parameters: | |
#### 100k context (fastest): | |
```bash | |
--max-model-len 100000 \ | |
--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}' | |
``` | |
#### 256k context (native): | |
```bash | |
--max-model-len 262144 \ | |
--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}' | |
``` | |
#### 400k context (current): | |
```bash | |
--max-model-len 400000 \ | |
--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}' | |
``` | |
#### 760k context (maximum, requires 16+ H200s): | |
```bash | |
--max-model-len 760000 \ | |
--rope-scaling '{"rope_type":"yarn","factor":2.9,"original_max_position_embeddings":262144}' | |
``` | |
### Maintenance | |
Regular maintenance tasks: | |
```bash | |
# Check logs for errors | |
tail -n 100 vllm.log | grep ERROR | |
# Monitor disk space | |
df -h /workspace | |
# Restart vLLM if needed | |
pkill -f vllm.entrypoints.openai.api_server | |
./deploy-qwen3.sh | |
# Clean old model files | |
find /workspace/models -type f -mtime +7 -delete | |
``` | |
### Security Notes | |
1. **SSH Tunnel**: Always use SSH tunnel, never expose port 8000 directly | |
2. **API Key**: Even though "not-needed", don't leave blank | |
3. **Firewall**: Ensure only SSH port is open on Vast.ai | |
4. **Monitoring**: Check access logs regularly | |
### Cost-Saving Tips | |
1. **Pause when not using**: Vast.ai charges by the minute | |
2. **Use 4x H200**: Sufficient for 190k context at half the cost | |
3. **Spot instances**: 90% cheaper but risk of termination | |
4. **Optimize context**: Most tasks don't need 400k tokens | |
This completes the detailed setup guide for AI agents. The entire process should take 2-3 hours including model download time. | |
</details> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment