Model | % correct | Cost | Time per case |
---|---|---|---|
gemini-2.5-pro-preview-06-05 (32k think) | 83.1% | $49.88 | 200.3s |
o3 (high) + gpt-4.1 | 82.7% | ? | 110.0s |
o3 (high) | 79.6% | $22.20 | 113.8s |
gemini-2.5-pro-preview-06-05 (default think) | 79.1% | $45.60 | 175.2s |
Gemini 2.5 Pro Preview 05-06 | 76.9% | $37.41 | 165.3s |
Gemini 2.5 Pro Preview 03-25 | 72.9% | $0.00 | 45.3s |
claude-opus-4-20250514 (32k thinking) | 72.0% | $65.75 | 44.1s |
o4-mini (high) | 72.0% | $19.64 | 176.5s |
DeepSeek R1 (0528) | 71.4% | $4.80 | 716.6s |
claude-opus-4-20250514 (no think) | 70.7% | $68.63 | 42.5s |
claude-3-7-sonnet-20250219 (32k thinking tokens) | 64.9% | $36.83 | 105.2s |
DeepSeek R1 + claude-3-5-sonnet-20241022 | 64.0% | $13.29 | 251.6s |
o1-2024-12-17 (high) | 61.7% | $186.50 | 133.2s |
claude-sonnet-4-20250514 (32k thinking) | 61.3% | $26.58 | 79.9s |
claude-3-7-sonnet-20250219 (no thinking) | 60.4% | $17.72 | 28.3s |
o3-mini (high) | 60.4% | $18.16 | 124.6s |
Qwen3 235B A22B diff, no think, Alibaba API | 59.6% | $0.00 | 45.4s |
DeepSeek R1 | 56.9% | $5.42 | 113.7s |
claude-sonnet-4-20250514 (no thinking) | 56.4% | $15.82 | 29.8s |
gemini-2.5-flash-preview-05-20 (24k think) | 55.1% | $8.56 | 53.9s |
DeepSeek V3 (0324) | 55.1% | $1.12 | 290.0s |
Quasar Alpha | 54.7% | $0.00 | 14.8s |
o3-mini (medium) | 53.8% | $8.86 | 47.2s |
Grok 3 Beta | 53.3% | $11.03 | 15.3s |
Optimus Alpha | 52.9% | $0.00 | 18.4s |
gpt-4.1 | 52.4% | $9.86 | 20.5s |
claude-3-5-sonnet-20241022 | 51.6% | $14.41 | 21.4s |
Grok 3 Mini Beta (high) | 49.3% | $0.73 | 79.1s |
DeepSeek Chat V3 (prev) | 48.4% | $0.34 | 34.8s |
gemini-2.5-flash-preview-04-17 (default) | 47.1% | $1.85 | 50.1s |
chatgpt-4o-latest (2025-03-29) | 45.3% | $19.74 | 10.3s |
gpt-4.5-preview | 44.9% | $183.18 | 113.5s |
gemini-2.5-flash-preview-05-20 (no think) | 44.0% | $1.14 | 12.2s |
Qwen3 32B | 40.0% | $0.76 | 372.2s |
gemini-exp-1206 | 38.2% | $0.00 | 45.5s |
Gemini 2.0 Pro exp-02-05 | 35.6% | $0.00 | 34.8s |
Grok 3 Mini Beta (low) | 34.7% | $0.79 | 35.1s |
o1-mini-2024-09-12 | 32.9% | $18.58 | 34.7s |
gpt-4.1-mini | 32.4% | $1.99 | 19.5s |
claude-3-5-haiku-20241022 | 28.0% | $6.06 | 31.8s |
chatgpt-4o-latest (2025-02-15) | 27.1% | $14.37 | 12.4s |
QwQ-32B + Qwen 2.5 Coder Instruct | 26.2% | $0.00 | 137.4s |
gpt-4o-2024-08-06 | 23.1% | $7.03 | 16.0s |
gemini-2.0-flash-exp | 22.2% | $0.00 | 12.2s |
qwen-max-2025-01-25 | 21.8% | 39.5s | |
QwQ-32B | 20.9% | $0.00 | 228.6s |
gemini-2.0-flash-thinking-exp-01-21 | 18.2% | $0.00 | 24.2s |
gpt-4o-2024-11-20 | 18.2% | $6.74 | 12.1s |
DeepSeek Chat V2.5 | 17.8% | $0.51 | 184.0s |
Qwen2.5-Coder-32B-Instruct | 16.4% | $0.00 | 42.0s |
Llama 4 Maverick | 15.6% | $0.00 | 20.5s |
yi-lightning | 12.9% | $0.00 | 146.7s |
command-a-03-2025-quality | 12.0% | $0.00 | 85.1s |
Codestral 25.01 | 11.1% | $1.98 | 9.3s |
openhands-lm-32b-v0.1 | 10.2% | $0.00 | 195.6s |
gpt-4.1-nano | 8.9% | $0.43 | 12.0s |
Qwen2.5-Coder-32B-Instruct | 8.0% | $0.00 | 84.4s |
gemma-3-27b-it | 4.9% | $0.00 | 79.7s |
gpt-4o-mini-2024-07-18 | 3.6% | $0.32 | 17.3s |
Last active
June 11, 2025 02:50
-
-
Save jph00/b331973e1ebe8c2bc5c3eaab6c555ea9 to your computer and use it in GitHub Desktop.
Aider evals
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment