Skip to content

Instantly share code, notes, and snippets.

@steipete
Created July 31, 2025 20:38
Show Gist options
  • Save steipete/8a9f8e1fa57b3930e22d91dda0db2d44 to your computer and use it in GitHub Desktop.
Save steipete/8a9f8e1fa57b3930e22d91dda0db2d44 to your computer and use it in GitHub Desktop.
<summary><strong>🤖 Detailed Setup Guide for AI Agents</strong></summary>
## Complete Setup Instructions for Self-Hosting Qwen3-Coder-480B
This guide provides step-by-step instructions for AI agents to set up Qwen3-Coder-480B on Vast.ai with maximum context window.
### Prerequisites
- Vast.ai account with billing configured
- SSH client installed
- 2-3 hours for initial setup
### Step 1: Create Vast.ai Instance
1. Go to [Vast.ai](https://vast.ai)
2. Search for instances with:
- 8x H200 GPUs (minimum 4x H200)
- vLLM template pre-installed
- 1TB+ disk space
- Good DLPerf score (>80)
3. Select instance and configure:
```
Docker Image: Leave default vLLM image
Jupyter: Optional (uses port 8080)
SSH: Required
Direct SSH: Recommended
```
4. Deploy instance and note:
- SSH port and IP
- Instance ID
- Cost per hour
### Step 2: SSH Configuration
Add to `~/.ssh/config`:
```bash
Host vastai-qwen
HostName [YOUR_IP]
Port [YOUR_PORT]
User root
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
```
Create SSH tunnel:
```bash
ssh -f -N -L 8000:localhost:8000 vastai-qwen
```
### Step 3: Deploy vLLM with Qwen3-480B
SSH into instance:
```bash
ssh vastai-qwen
```
Check GPU status:
```bash
nvidia-smi
```
Create deployment script:
```bash
cat > deploy-qwen3.sh << 'EOF'
#!/bin/bash
# Kill any existing vLLM processes
pkill -f vllm.entrypoints.openai.api_server || true
# Start vLLM with optimal settings
/venv/main/bin/python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--served-model-name qwen3-coder \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--gpu-memory-utilization 0.95 \
--max-model-len 400000 \
--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}' \
--download-dir /workspace/models \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype float16 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 32768 \
> vllm.log 2>&1 &
echo "vLLM deployment started. Check vllm.log for progress."
EOF
chmod +x deploy-qwen3.sh
./deploy-qwen3.sh
```
### Step 4: Monitor Model Download
Model download takes 1-2 hours for 480GB:
```bash
# Watch download progress
tail -f vllm.log | grep -E "Downloading|Loading|Progress"
# Check disk usage
watch -n 5 'df -h /workspace'
```
### Step 5: Disable Vast.ai Authentication
Vast.ai uses Caddy proxy with auth. Disable it:
```bash
# Stop Caddy to remove authentication
supervisorctl stop caddy
# Verify direct access works
curl http://localhost:8000/v1/models
```
### Step 6: Configure AI Coding Clients
#### For Cline (VS Code Extension):
1. Install Cline extension in VS Code
2. Open Cline settings
3. Configure:
```
API Provider: OpenAI Compatible
Base URL: http://localhost:8000/v1
API Key: not-needed
Model: qwen3-coder
```
#### For Cursor:
1. Open Cursor settings
2. Add custom model:
```json
{
"openai_api_key": "not-needed",
"openai_api_base": "http://localhost:8000/v1",
"model": "qwen3-coder"
}
```
#### For Command Line (qwen CLI):
Create config at `~/.config/qwen/config.json`:
```json
{
"providers": {
"qwen3-local": {
"type": "openai",
"base_url": "http://localhost:8000/v1",
"api_key": "not-needed",
"models": [{
"id": "qwen3-coder",
"name": "Qwen3-Coder-480B (400k context)",
"context_window": 400000,
"max_tokens": 16384
}]
}
},
"default_provider": "qwen3-local",
"default_model": "qwen3-coder"
}
```
### Step 7: Test the Deployment
Test with curl:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder",
"messages": [{"role": "user", "content": "Write a Python hello world"}],
"max_tokens": 100
}'
```
Test context window:
```bash
# Create large context test
python3 << 'EOF'
import requests
import json
# Create a message with ~100k tokens (roughly 400k characters)
large_context = "The quick brown fox jumps over the lazy dog. " * 10000
messages = [
{"role": "system", "content": large_context},
{"role": "user", "content": "Summarize the above in one sentence."}
]
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={"model": "qwen3-coder", "messages": messages, "max_tokens": 50}
)
print(json.dumps(response.json(), indent=2))
EOF
```
### Step 8: Performance Optimization
Monitor GPU utilization:
```bash
# Real-time GPU monitoring
watch -n 1 nvidia-smi
# Check vLLM metrics
curl http://localhost:8000/metrics
```
Optimize for your use case:
- **For speed**: Reduce max_model_len to 100k-200k
- **For context**: Keep at 400k but expect slower responses
- **For cost**: Use 4x H200 instead of 8x (limited to 190k context)
### Step 9: Troubleshooting
Common issues and solutions:
#### Model won't load
```bash
# Check available memory
nvidia-smi
# Solution: Reduce --gpu-memory-utilization to 0.90
```
#### Authentication errors
```bash
# Ensure Caddy is stopped
supervisorctl status
supervisorctl stop caddy
```
#### Context too large errors
```bash
# Reduce max_model_len in deployment script
# 4x H200: max 190000
# 8x H200: max 400000
```
#### Slow responses
```bash
# Check batch settings
# Reduce --max-num-batched-tokens to 16384
# Enable streaming in client
```
### Step 10: Cost Monitoring
Track usage and costs:
```bash
# Create usage tracker
cat > track_usage.py << 'EOF'
#!/usr/bin/env python3
import time
import datetime
start_time = datetime.datetime.now()
hourly_rate = 12.40 # Adjust based on your instance
while True:
elapsed = datetime.datetime.now() - start_time
hours = elapsed.total_seconds() / 3600
cost = hours * hourly_rate
print(f"\rRunning for: {elapsed} | Cost: ${cost:.2f}", end="")
time.sleep(60)
EOF
chmod +x track_usage.py
./track_usage.py
```
### Advanced: Context Window Tuning
For different context windows, adjust these parameters:
#### 100k context (fastest):
```bash
--max-model-len 100000 \
--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
```
#### 256k context (native):
```bash
--max-model-len 262144 \
--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
```
#### 400k context (current):
```bash
--max-model-len 400000 \
--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}'
```
#### 760k context (maximum, requires 16+ H200s):
```bash
--max-model-len 760000 \
--rope-scaling '{"rope_type":"yarn","factor":2.9,"original_max_position_embeddings":262144}'
```
### Maintenance
Regular maintenance tasks:
```bash
# Check logs for errors
tail -n 100 vllm.log | grep ERROR
# Monitor disk space
df -h /workspace
# Restart vLLM if needed
pkill -f vllm.entrypoints.openai.api_server
./deploy-qwen3.sh
# Clean old model files
find /workspace/models -type f -mtime +7 -delete
```
### Security Notes
1. **SSH Tunnel**: Always use SSH tunnel, never expose port 8000 directly
2. **API Key**: Even though "not-needed", don't leave blank
3. **Firewall**: Ensure only SSH port is open on Vast.ai
4. **Monitoring**: Check access logs regularly
### Cost-Saving Tips
1. **Pause when not using**: Vast.ai charges by the minute
2. **Use 4x H200**: Sufficient for 190k context at half the cost
3. **Spot instances**: 90% cheaper but risk of termination
4. **Optimize context**: Most tasks don't need 400k tokens
This completes the detailed setup guide for AI agents. The entire process should take 2-3 hours including model download time.
</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment