Aider evals

Model	% correct	Cost	Time per case
gemini-2.5-pro-preview-06-05 (32k think)	83.1%	$49.88	200.3s
o3 (high) + gpt-4.1	82.7%	?	110.0s
o3 (high)	79.6%	$22.20	113.8s
gemini-2.5-pro-preview-06-05 (default think)	79.1%	$45.60	175.2s
Gemini 2.5 Pro Preview 05-06	76.9%	$37.41	165.3s
Gemini 2.5 Pro Preview 03-25	72.9%	$0.00	45.3s
claude-opus-4-20250514 (32k thinking)	72.0%	$65.75	44.1s
o4-mini (high)	72.0%	$19.64	176.5s
DeepSeek R1 (0528)	71.4%	$4.80	716.6s
claude-opus-4-20250514 (no think)	70.7%	$68.63	42.5s
claude-3-7-sonnet-20250219 (32k thinking tokens)	64.9%	$36.83	105.2s
DeepSeek R1 + claude-3-5-sonnet-20241022	64.0%	$13.29	251.6s
o1-2024-12-17 (high)	61.7%	$186.50	133.2s
claude-sonnet-4-20250514 (32k thinking)	61.3%	$26.58	79.9s
claude-3-7-sonnet-20250219 (no thinking)	60.4%	$17.72	28.3s
o3-mini (high)	60.4%	$18.16	124.6s
Qwen3 235B A22B diff, no think, Alibaba API	59.6%	$0.00	45.4s
DeepSeek R1	56.9%	$5.42	113.7s
claude-sonnet-4-20250514 (no thinking)	56.4%	$15.82	29.8s
gemini-2.5-flash-preview-05-20 (24k think)	55.1%	$8.56	53.9s
DeepSeek V3 (0324)	55.1%	$1.12	290.0s
Quasar Alpha	54.7%	$0.00	14.8s
o3-mini (medium)	53.8%	$8.86	47.2s
Grok 3 Beta	53.3%	$11.03	15.3s
Optimus Alpha	52.9%	$0.00	18.4s
gpt-4.1	52.4%	$9.86	20.5s
claude-3-5-sonnet-20241022	51.6%	$14.41	21.4s
Grok 3 Mini Beta (high)	49.3%	$0.73	79.1s
DeepSeek Chat V3 (prev)	48.4%	$0.34	34.8s
gemini-2.5-flash-preview-04-17 (default)	47.1%	$1.85	50.1s
chatgpt-4o-latest (2025-03-29)	45.3%	$19.74	10.3s
gpt-4.5-preview	44.9%	$183.18	113.5s
gemini-2.5-flash-preview-05-20 (no think)	44.0%	$1.14	12.2s
Qwen3 32B	40.0%	$0.76	372.2s
gemini-exp-1206	38.2%	$0.00	45.5s
Gemini 2.0 Pro exp-02-05	35.6%	$0.00	34.8s
Grok 3 Mini Beta (low)	34.7%	$0.79	35.1s
o1-mini-2024-09-12	32.9%	$18.58	34.7s
gpt-4.1-mini	32.4%	$1.99	19.5s
claude-3-5-haiku-20241022	28.0%	$6.06	31.8s
chatgpt-4o-latest (2025-02-15)	27.1%	$14.37	12.4s
QwQ-32B + Qwen 2.5 Coder Instruct	26.2%	$0.00	137.4s
gpt-4o-2024-08-06	23.1%	$7.03	16.0s
gemini-2.0-flash-exp	22.2%	$0.00	12.2s
qwen-max-2025-01-25	21.8%		39.5s
QwQ-32B	20.9%	$0.00	228.6s
gemini-2.0-flash-thinking-exp-01-21	18.2%	$0.00	24.2s
gpt-4o-2024-11-20	18.2%	$6.74	12.1s
DeepSeek Chat V2.5	17.8%	$0.51	184.0s
Qwen2.5-Coder-32B-Instruct	16.4%	$0.00	42.0s
Llama 4 Maverick	15.6%	$0.00	20.5s
yi-lightning	12.9%	$0.00	146.7s
command-a-03-2025-quality	12.0%	$0.00	85.1s
Codestral 25.01	11.1%	$1.98	9.3s
openhands-lm-32b-v0.1	10.2%	$0.00	195.6s
gpt-4.1-nano	8.9%	$0.43	12.0s
Qwen2.5-Coder-32B-Instruct	8.0%	$0.00	84.4s
gemma-3-27b-it	4.9%	$0.00	79.7s
gpt-4o-mini-2024-07-18	3.6%	$0.32	17.3s

jph00/aider-results.md