steipete · July 31, 2025 20:38
diff --git a/gistfile1.txt b/gistfile1.txt
 <summary><strong>🤖 Detailed Setup Guide for AI Agents</strong></summary>

 ## Complete Setup Instructions for Self-Hosting Qwen3-Coder-480B

 This guide provides step-by-step instructions for AI agents to set up Qwen3-Coder-480B on Vast.ai with maximum context window.

 ### Prerequisites
 - Vast.ai account with billing configured
 - SSH client installed
 - 2-3 hours for initial setup

 ### Step 1: Create Vast.ai Instance

 1. Go to [Vast.ai](https://vast.ai)
 2. Search for instances with:
   - 8x H200 GPUs (minimum 4x H200)
   - vLLM template pre-installed
   - 1TB+ disk space
   - Good DLPerf score (>80)

 3. Select instance and configure:
   ```
   Docker Image: Leave default vLLM image
   Jupyter: Optional (uses port 8080)
   SSH: Required
   Direct SSH: Recommended
   ```

 4. Deploy instance and note:
   - SSH port and IP
   - Instance ID
   - Cost per hour

 ### Step 2: SSH Configuration

 Add to `~/.ssh/config`:
 ```bash
 Host vastai-qwen
  HostName [YOUR_IP]
  Port [YOUR_PORT]
  User root
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
 ```

 Create SSH tunnel:
 ```bash
 ssh -f -N -L 8000:localhost:8000 vastai-qwen
 ```

 ### Step 3: Deploy vLLM with Qwen3-480B

 SSH into instance:
 ```bash
 ssh vastai-qwen
 ```

 Check GPU status:
 ```bash
 nvidia-smi
 ```

 Create deployment script:
 ```bash
 cat > deploy-qwen3.sh << 'EOF'
 #!/bin/bash

 # Kill any existing vLLM processes
 pkill -f vllm.entrypoints.openai.api_server || true

 # Start vLLM with optimal settings
 /venv/main/bin/python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --served-model-name qwen3-coder \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.95 \
  --max-model-len 400000 \
  --rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}' \
  --download-dir /workspace/models \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --dtype float16 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 32768 \
  > vllm.log 2>&1 &

 echo "vLLM deployment started. Check vllm.log for progress."
 EOF

 chmod +x deploy-qwen3.sh
 ./deploy-qwen3.sh
 ```

 ### Step 4: Monitor Model Download

 Model download takes 1-2 hours for 480GB:
 ```bash
 # Watch download progress
 tail -f vllm.log | grep -E "Downloading|Loading|Progress"

 # Check disk usage
 watch -n 5 'df -h /workspace'
 ```

 ### Step 5: Disable Vast.ai Authentication

 Vast.ai uses Caddy proxy with auth. Disable it:
 ```bash
 # Stop Caddy to remove authentication
 supervisorctl stop caddy

 # Verify direct access works
 curl http://localhost:8000/v1/models
 ```

 ### Step 6: Configure AI Coding Clients

 #### For Cline (VS Code Extension):

 1. Install Cline extension in VS Code
 2. Open Cline settings
 3. Configure:
   ```
   API Provider: OpenAI Compatible
   Base URL: http://localhost:8000/v1
   API Key: not-needed
   Model: qwen3-coder
   ```

 #### For Cursor:

 1. Open Cursor settings
 2. Add custom model:
   ```json
   {
     "openai_api_key": "not-needed",
     "openai_api_base": "http://localhost:8000/v1",
     "model": "qwen3-coder"
   }
   ```

 #### For Command Line (qwen CLI):

 Create config at `~/.config/qwen/config.json`:
 ```json
 {
  "providers": {
    "qwen3-local": {
      "type": "openai",
      "base_url": "http://localhost:8000/v1",
      "api_key": "not-needed",
      "models": [{
        "id": "qwen3-coder",
        "name": "Qwen3-Coder-480B (400k context)",
        "context_window": 400000,
        "max_tokens": 16384
      }]
    }
  },
  "default_provider": "qwen3-local",
  "default_model": "qwen3-coder"
 }
 ```

 ### Step 7: Test the Deployment

 Test with curl:
 ```bash
 curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-coder",
    "messages": [{"role": "user", "content": "Write a Python hello world"}],
    "max_tokens": 100
  }'
 ```

 Test context window:
 ```bash
 # Create large context test
 python3 << 'EOF'
 import requests
 import json

 # Create a message with ~100k tokens (roughly 400k characters)
 large_context = "The quick brown fox jumps over the lazy dog. " * 10000
 messages = [
    {"role": "system", "content": large_context},
    {"role": "user", "content": "Summarize the above in one sentence."}
 ]

 response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={"model": "qwen3-coder", "messages": messages, "max_tokens": 50}
 )

 print(json.dumps(response.json(), indent=2))
 EOF
 ```

 ### Step 8: Performance Optimization

 Monitor GPU utilization:
 ```bash
 # Real-time GPU monitoring
 watch -n 1 nvidia-smi

 # Check vLLM metrics
 curl http://localhost:8000/metrics
 ```

 Optimize for your use case:
 - **For speed**: Reduce max_model_len to 100k-200k
 - **For context**: Keep at 400k but expect slower responses
 - **For cost**: Use 4x H200 instead of 8x (limited to 190k context)

 ### Step 9: Troubleshooting

 Common issues and solutions:

 #### Model won't load
 ```bash
 # Check available memory
 nvidia-smi
 # Solution: Reduce --gpu-memory-utilization to 0.90
 ```

 #### Authentication errors
 ```bash
 # Ensure Caddy is stopped
 supervisorctl status
 supervisorctl stop caddy
 ```

 #### Context too large errors
 ```bash
 # Reduce max_model_len in deployment script
 # 4x H200: max 190000
 # 8x H200: max 400000
 ```

 #### Slow responses
 ```bash
 # Check batch settings
 # Reduce --max-num-batched-tokens to 16384
 # Enable streaming in client
 ```

 ### Step 10: Cost Monitoring

 Track usage and costs:
 ```bash
 # Create usage tracker
 cat > track_usage.py << 'EOF'
 #!/usr/bin/env python3
 import time
 import datetime

 start_time = datetime.datetime.now()
 hourly_rate = 12.40  # Adjust based on your instance

 while True:
    elapsed = datetime.datetime.now() - start_time
    hours = elapsed.total_seconds() / 3600
    cost = hours * hourly_rate
    
    print(f"\rRunning for: {elapsed} | Cost: ${cost:.2f}", end="")
    time.sleep(60)
 EOF

 chmod +x track_usage.py
 ./track_usage.py
 ```

 ### Advanced: Context Window Tuning

 For different context windows, adjust these parameters:

 #### 100k context (fastest):
 ```bash
 --max-model-len 100000 \
 --rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
 ```

 #### 256k context (native):
 ```bash
 --max-model-len 262144 \
 --rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
 ```

 #### 400k context (current):
 ```bash
 --max-model-len 400000 \
 --rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}'
 ```

 #### 760k context (maximum, requires 16+ H200s):
 ```bash
 --max-model-len 760000 \
 --rope-scaling '{"rope_type":"yarn","factor":2.9,"original_max_position_embeddings":262144}'
 ```

 ### Maintenance

 Regular maintenance tasks:

 ```bash
 # Check logs for errors
 tail -n 100 vllm.log | grep ERROR

 # Monitor disk space
 df -h /workspace

 # Restart vLLM if needed
 pkill -f vllm.entrypoints.openai.api_server
 ./deploy-qwen3.sh

 # Clean old model files
 find /workspace/models -type f -mtime +7 -delete
 ```

 ### Security Notes

 1. **SSH Tunnel**: Always use SSH tunnel, never expose port 8000 directly
 2. **API Key**: Even though "not-needed", don't leave blank
 3. **Firewall**: Ensure only SSH port is open on Vast.ai
 4. **Monitoring**: Check access logs regularly

 ### Cost-Saving Tips

 1. **Pause when not using**: Vast.ai charges by the minute
 2. **Use 4x H200**: Sufficient for 190k context at half the cost
 3. **Spot instances**: 90% cheaper but risk of termination
 4. **Optimize context**: Most tasks don't need 400k tokens

 This completes the detailed setup guide for AI agents. The entire process should take 2-3 hours including model download time.

 </details>
	<summary><strong>🤖 Detailed Setup Guide for AI Agents</strong></summary>

	## Complete Setup Instructions for Self-Hosting Qwen3-Coder-480B

	This guide provides step-by-step instructions for AI agents to set up Qwen3-Coder-480B on Vast.ai with maximum context window.

	### Prerequisites
	- Vast.ai account with billing configured
	- SSH client installed
	- 2-3 hours for initial setup

	### Step 1: Create Vast.ai Instance

	1. Go to [Vast.ai](https://vast.ai)
	2. Search for instances with:
	- 8x H200 GPUs (minimum 4x H200)
	- vLLM template pre-installed
	- 1TB+ disk space
	- Good DLPerf score (>80)

	3. Select instance and configure:
	```
	Docker Image: Leave default vLLM image
	Jupyter: Optional (uses port 8080)
	SSH: Required
	Direct SSH: Recommended
	```

	4. Deploy instance and note:
	- SSH port and IP
	- Instance ID
	- Cost per hour

	### Step 2: SSH Configuration

	Add to `~/.ssh/config`:
	```bash
	Host vastai-qwen
	HostName [YOUR_IP]
	Port [YOUR_PORT]
	User root
	StrictHostKeyChecking no
	UserKnownHostsFile /dev/null
	```

	Create SSH tunnel:
	```bash
	ssh -f -N -L 8000:localhost:8000 vastai-qwen
	```

	### Step 3: Deploy vLLM with Qwen3-480B

	SSH into instance:
	```bash
	ssh vastai-qwen
	```

	Check GPU status:
	```bash
	nvidia-smi
	```

	Create deployment script:
	```bash
	cat > deploy-qwen3.sh << 'EOF'
	#!/bin/bash

	# Kill any existing vLLM processes
	pkill -f vllm.entrypoints.openai.api_server \|\| true

	# Start vLLM with optimal settings
	/venv/main/bin/python -m vllm.entrypoints.openai.api_server \
	--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
	--served-model-name qwen3-coder \
	--tensor-parallel-size 8 \
	--enable-expert-parallel \
	--gpu-memory-utilization 0.95 \
	--max-model-len 400000 \
	--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}' \
	--download-dir /workspace/models \
	--host 0.0.0.0 \
	--port 8000 \
	--trust-remote-code \
	--dtype float16 \
	--enable-prefix-caching \
	--enable-chunked-prefill \
	--max-num-batched-tokens 32768 \
	> vllm.log 2>&1 &

	echo "vLLM deployment started. Check vllm.log for progress."
	EOF

	chmod +x deploy-qwen3.sh
	./deploy-qwen3.sh
	```

	### Step 4: Monitor Model Download

	Model download takes 1-2 hours for 480GB:
	```bash
	# Watch download progress
	tail -f vllm.log \| grep -E "Downloading\|Loading\|Progress"

	# Check disk usage
	watch -n 5 'df -h /workspace'
	```

	### Step 5: Disable Vast.ai Authentication

	Vast.ai uses Caddy proxy with auth. Disable it:
	```bash
	# Stop Caddy to remove authentication
	supervisorctl stop caddy

	# Verify direct access works
	curl http://localhost:8000/v1/models
	```

	### Step 6: Configure AI Coding Clients

	#### For Cline (VS Code Extension):

	1. Install Cline extension in VS Code
	2. Open Cline settings
	3. Configure:
	```
	API Provider: OpenAI Compatible
	Base URL: http://localhost:8000/v1
	API Key: not-needed
	Model: qwen3-coder
	```

	#### For Cursor:

	1. Open Cursor settings
	2. Add custom model:
	```json
	{
	"openai_api_key": "not-needed",
	"openai_api_base": "http://localhost:8000/v1",
	"model": "qwen3-coder"
	}
	```

	#### For Command Line (qwen CLI):

	Create config at `~/.config/qwen/config.json`:
	```json
	{
	"providers": {
	"qwen3-local": {
	"type": "openai",
	"base_url": "http://localhost:8000/v1",
	"api_key": "not-needed",
	"models": [{
	"id": "qwen3-coder",
	"name": "Qwen3-Coder-480B (400k context)",
	"context_window": 400000,
	"max_tokens": 16384
	}]
	}
	},
	"default_provider": "qwen3-local",
	"default_model": "qwen3-coder"
	}
	```

	### Step 7: Test the Deployment

	Test with curl:
	```bash
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3-coder",
	"messages": [{"role": "user", "content": "Write a Python hello world"}],
	"max_tokens": 100
	}'
	```

	Test context window:
	```bash
	# Create large context test
	python3 << 'EOF'
	import requests
	import json

	# Create a message with ~100k tokens (roughly 400k characters)
	large_context = "The quick brown fox jumps over the lazy dog. " * 10000
	messages = [
	{"role": "system", "content": large_context},
	{"role": "user", "content": "Summarize the above in one sentence."}
	]

	response = requests.post(
	"http://localhost:8000/v1/chat/completions",
	json={"model": "qwen3-coder", "messages": messages, "max_tokens": 50}
	)

	print(json.dumps(response.json(), indent=2))
	EOF
	```

	### Step 8: Performance Optimization

	Monitor GPU utilization:
	```bash
	# Real-time GPU monitoring
	watch -n 1 nvidia-smi

	# Check vLLM metrics
	curl http://localhost:8000/metrics
	```

	Optimize for your use case:
	- For speed: Reduce max_model_len to 100k-200k
	- For context: Keep at 400k but expect slower responses
	- For cost: Use 4x H200 instead of 8x (limited to 190k context)

	### Step 9: Troubleshooting

	Common issues and solutions:

	#### Model won't load
	```bash
	# Check available memory
	nvidia-smi
	# Solution: Reduce --gpu-memory-utilization to 0.90
	```

	#### Authentication errors
	```bash
	# Ensure Caddy is stopped
	supervisorctl status
	supervisorctl stop caddy
	```

	#### Context too large errors
	```bash
	# Reduce max_model_len in deployment script
	# 4x H200: max 190000
	# 8x H200: max 400000
	```

	#### Slow responses
	```bash
	# Check batch settings
	# Reduce --max-num-batched-tokens to 16384
	# Enable streaming in client
	```

	### Step 10: Cost Monitoring

	Track usage and costs:
	```bash
	# Create usage tracker
	cat > track_usage.py << 'EOF'
	#!/usr/bin/env python3
	import time
	import datetime

	start_time = datetime.datetime.now()
	hourly_rate = 12.40 # Adjust based on your instance

	while True:
	elapsed = datetime.datetime.now() - start_time
	hours = elapsed.total_seconds() / 3600
	cost = hours * hourly_rate

	print(f"\rRunning for: {elapsed} \| Cost: ${cost:.2f}", end="")
	time.sleep(60)
	EOF

	chmod +x track_usage.py
	./track_usage.py
	```

	### Advanced: Context Window Tuning

	For different context windows, adjust these parameters:

	#### 100k context (fastest):
	```bash
	--max-model-len 100000 \
	--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
	```

	#### 256k context (native):
	```bash
	--max-model-len 262144 \
	--rope-scaling '{"rope_type":"yarn","factor":1.0,"original_max_position_embeddings":262144}'
	```

	#### 400k context (current):
	```bash
	--max-model-len 400000 \
	--rope-scaling '{"rope_type":"yarn","factor":1.53,"original_max_position_embeddings":262144}'
	```

	#### 760k context (maximum, requires 16+ H200s):
	```bash
	--max-model-len 760000 \
	--rope-scaling '{"rope_type":"yarn","factor":2.9,"original_max_position_embeddings":262144}'
	```

	### Maintenance

	Regular maintenance tasks:

	```bash
	# Check logs for errors
	tail -n 100 vllm.log \| grep ERROR

	# Monitor disk space
	df -h /workspace

	# Restart vLLM if needed
	pkill -f vllm.entrypoints.openai.api_server
	./deploy-qwen3.sh

	# Clean old model files
	find /workspace/models -type f -mtime +7 -delete
	```

	### Security Notes

	1. SSH Tunnel: Always use SSH tunnel, never expose port 8000 directly
	2. API Key: Even though "not-needed", don't leave blank
	3. Firewall: Ensure only SSH port is open on Vast.ai
	4. Monitoring: Check access logs regularly

	### Cost-Saving Tips

	1. Pause when not using: Vast.ai charges by the minute
	2. Use 4x H200: Sufficient for 190k context at half the cost
	3. Spot instances: 90% cheaper but risk of termination
	4. Optimize context: Most tasks don't need 400k tokens

	This completes the detailed setup guide for AI agents. The entire process should take 2-3 hours including model download time.

	</details>