Skip to content

Instantly share code, notes, and snippets.

@jph00
Last active June 11, 2025 02:50
Show Gist options
  • Save jph00/b331973e1ebe8c2bc5c3eaab6c555ea9 to your computer and use it in GitHub Desktop.
Save jph00/b331973e1ebe8c2bc5c3eaab6c555ea9 to your computer and use it in GitHub Desktop.
Aider evals
Model % correct Cost Time per case
gemini-2.5-pro-preview-06-05 (32k think) 83.1% $49.88 200.3s
o3 (high) + gpt-4.1 82.7% ? 110.0s
o3 (high) 79.6% $22.20 113.8s
gemini-2.5-pro-preview-06-05 (default think) 79.1% $45.60 175.2s
Gemini 2.5 Pro Preview 05-06 76.9% $37.41 165.3s
Gemini 2.5 Pro Preview 03-25 72.9% $0.00 45.3s
claude-opus-4-20250514 (32k thinking) 72.0% $65.75 44.1s
o4-mini (high) 72.0% $19.64 176.5s
DeepSeek R1 (0528) 71.4% $4.80 716.6s
claude-opus-4-20250514 (no think) 70.7% $68.63 42.5s
claude-3-7-sonnet-20250219 (32k thinking tokens) 64.9% $36.83 105.2s
DeepSeek R1 + claude-3-5-sonnet-20241022 64.0% $13.29 251.6s
o1-2024-12-17 (high) 61.7% $186.50 133.2s
claude-sonnet-4-20250514 (32k thinking) 61.3% $26.58 79.9s
claude-3-7-sonnet-20250219 (no thinking) 60.4% $17.72 28.3s
o3-mini (high) 60.4% $18.16 124.6s
Qwen3 235B A22B diff, no think, Alibaba API 59.6% $0.00 45.4s
DeepSeek R1 56.9% $5.42 113.7s
claude-sonnet-4-20250514 (no thinking) 56.4% $15.82 29.8s
gemini-2.5-flash-preview-05-20 (24k think) 55.1% $8.56 53.9s
DeepSeek V3 (0324) 55.1% $1.12 290.0s
Quasar Alpha 54.7% $0.00 14.8s
o3-mini (medium) 53.8% $8.86 47.2s
Grok 3 Beta 53.3% $11.03 15.3s
Optimus Alpha 52.9% $0.00 18.4s
gpt-4.1 52.4% $9.86 20.5s
claude-3-5-sonnet-20241022 51.6% $14.41 21.4s
Grok 3 Mini Beta (high) 49.3% $0.73 79.1s
DeepSeek Chat V3 (prev) 48.4% $0.34 34.8s
gemini-2.5-flash-preview-04-17 (default) 47.1% $1.85 50.1s
chatgpt-4o-latest (2025-03-29) 45.3% $19.74 10.3s
gpt-4.5-preview 44.9% $183.18 113.5s
gemini-2.5-flash-preview-05-20 (no think) 44.0% $1.14 12.2s
Qwen3 32B 40.0% $0.76 372.2s
gemini-exp-1206 38.2% $0.00 45.5s
Gemini 2.0 Pro exp-02-05 35.6% $0.00 34.8s
Grok 3 Mini Beta (low) 34.7% $0.79 35.1s
o1-mini-2024-09-12 32.9% $18.58 34.7s
gpt-4.1-mini 32.4% $1.99 19.5s
claude-3-5-haiku-20241022 28.0% $6.06 31.8s
chatgpt-4o-latest (2025-02-15) 27.1% $14.37 12.4s
QwQ-32B + Qwen 2.5 Coder Instruct 26.2% $0.00 137.4s
gpt-4o-2024-08-06 23.1% $7.03 16.0s
gemini-2.0-flash-exp 22.2% $0.00 12.2s
qwen-max-2025-01-25 21.8% 39.5s
QwQ-32B 20.9% $0.00 228.6s
gemini-2.0-flash-thinking-exp-01-21 18.2% $0.00 24.2s
gpt-4o-2024-11-20 18.2% $6.74 12.1s
DeepSeek Chat V2.5 17.8% $0.51 184.0s
Qwen2.5-Coder-32B-Instruct 16.4% $0.00 42.0s
Llama 4 Maverick 15.6% $0.00 20.5s
yi-lightning 12.9% $0.00 146.7s
command-a-03-2025-quality 12.0% $0.00 85.1s
Codestral 25.01 11.1% $1.98 9.3s
openhands-lm-32b-v0.1 10.2% $0.00 195.6s
gpt-4.1-nano 8.9% $0.43 12.0s
Qwen2.5-Coder-32B-Instruct 8.0% $0.00 84.4s
gemma-3-27b-it 4.9% $0.00 79.7s
gpt-4o-mini-2024-07-18 3.6% $0.32 17.3s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment