LLM Benchmarks march 2025

Model Scores

#ModelbicompliancecodereasonScoreFeatures
1openai/o3-mini-2025-01-3145%70%100%74%76.7%SO, Reason
2google/gemini-2.5-pro-preview-03-2545%70%93%68%71.1%Reason
3anthropic/claude-3.7-sonnet:thinking54%32%100%67%70.4%Reason
4openai/o1-2024-12-1745%70%84%67%70.0%SO, Reason
5deepseek/deepseek-r127%64%100%63%66.1%SO, Reason, Open
6deepseek/deepseek-r1-distill-llama-70b36%32%96%56%60.0%Open
7deepseek/deepseek-chat-v3-032445%60%70%55%59.6%Reason, Open
8anthropic/claude-3.7-sonnet45%47%65%55%56.5%
9openai/gpt-4o-2024-11-2036%55%62%55%53.6%SO
10openai/gpt-4.5-preview-2025-02-2745%47%62%53%51.9%SO
11deepseek/deepseek-chat36%47%58%49%50.6%SO, Open
12openai/gpt-4o-2024-08-0618%62%63%52%50.5%SO
13microsoft/phi-436%62%57%48%49.7%Open
14meta-llama/llama-4-maverick27%42%70%44%49.1%SO, Open
15qwen/qwen-max45%45%45%50%46.3%
16google/gemma-3-27b-it27%27%70%43%45.0%Open
17anthropic/claude-3.5-sonnet36%32%57%44%43.6%
18meta-llama/llama-3.1-70b-instruct36%50%44%43%42.6%SO, Open
19meta-llama/llama-3.3-70b-instruct27%50%48%41%40.8%SO, Open
20google/gemini-2.0-flash-00127%24%57%38%40.7%
21qwen/qwq-32b36%52%41%37%40.0%SO, Reason, Open
22qwen/qwen-2.5-72b-instruct27%30%47%39%39.2%SO, Open
23mistralai/mistral-small-3.1-24b-instruct36%42%41%39%39.2%SO, Open
24qwen/qwen2.5-32b-instruct27%20%53%36%36.6%Open
25qwen/qwen-2.5-coder-32b-instruct18%35%54%39%36.5%SO, Open
26meta-llama/llama-3.1-405b-instruct18%55%40%38%35.5%SO, Open
27google/gemma-3-12b-it9%17%61%30%33.4%Open
28qwen/qwen-plus18%25%40%31%31.7%
29mistralai/mixtral-8x22b-instruct9%27%47%28%29.2%SO, Open
30openai/gpt-4o-mini-2024-07-189%32%41%30%28.4%SO
31mistral/mistral-small-24b-instruct-250127%22%33%30%27.8%SO, Open
32qwen/qwen-turbo0%15%41%20%21.9%
33deepseek/deepseek-r1-distill-qwen-32b9%22%29%17%21.2%SO, Open
34meta-llama/llama-4-scout9%25%22%16%18.0%SO, Open
35mistral/ministral-8b18%0%20%13%14.8%SO, Open
36meta-llama/llama-3.2-3b-instruct0%17%16%11%10.6%SO, Open
37mistralai/mistral-large-24110%0%0%0%0.0%SO, Open
Averages27%38%55%41%

Gemini-2.5 Pro Preview - takes 2nd place!

Google has released a couple of notable multimodal models. Let’s start with Gemini-2.5 Pro Preview (already available on Vertex AI). This is DeepMind’s most advanced LLM, designed to internally reason through complex problems before answering. This chain-of-thought approach yields high accuracy on difficult tasks, excelling in coding, math, and scientific problem-solving​.

This large thinking model features native multimodality (can work with documents, images, audio and video) and has a theoretical context limit of 1M (in practice, effective context might be much smaller, if the tasks are cognitively challenging).

The model debuted on the top of LLM Arena (arena where humans pick chatbot answers that they like):

Our reasoning benchmark is not as much about chatting, but rather solving precise business problems. Still, Gemini 2.5 Pro managed to get to the second place, beating Claude 3.7 Sonnet Reasoning and OpenAI o1 (not pro).

And the most interesting part - Gemini 2.5 Pro worked without the support from the Structured Outputs (because Google only supports a tiny subset of JSON schema), still it managed to get the second place.

DeepSeek V3-0324 - Progressive update over previous version

DeepSeek V3-0324 is an ultra-large model (685B parameters) that uses a Mixture-of-Experts architecture​. It activates specialized “experts” for different query types, giving it broad knowledge and skills while remaining somewhat efficient for its size.

In theory, anybody can download this model to run it on the local hardware. In practice, its size makes it somewhat useless for its size. We still need to keep the entire model in GPU VRAM, to be able to switch between different experts.

Authors claim that V3-0324 release made major leaps in logic and knowledge tasks, with significantly higher scores on benchmarks like MMLU, GPQA, and AIME than its predecessor.

In our enterprise benchmark we see a similar scale of improvement. Model got better:

NB: There are no OpenRouter providers running V3 Chat with Structured Output at the moment. So it was tested without the support of SO. Still, it managed to improve the score.

Llama 4 Models from Meta - Nothing peculiar

Recently released Llama 4 models are about open-source multimodal intelligence.

Llama 4 introduces MoE to the Llama family, greatly boosting efficiency. The Llama 4 Maverick model leverages 400B total parameters with only ~17B active per query, yielding faster responses and lower inference cost without sacrificing quality​. It targets 1M context size. It has 128 experts under the hood.

Llama 4 Scout model uses 16 experts, keeping it size to 109B parameters and targeting 10M context length.

Both models — target 12 primary languages (English, French, German, Arabic, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese), but pretrained on over 200 languages (via the No Language Left Behind initiative).

However despite powerful linguistic and multimodal capabilities, Llama 4 didn’t score high on our enterprise benchmark.

This is fine and expected for two reasons:

  • Historically Llama models never score high on our benchmarks. Usually fine-tunes come out later that manage to do that better.

  • Llama 4 models were not trained on reasoning workloads. Unlike reasoning models, they couldn’t make use of reasoning slots within the Custom Chain of Thought schemas. They just tried to jump to the answer straight away.

We’ll wait for r1 distills on top of llama-4 pruned expert trees.

Google Gemma 3 - this is where things get interesting

Gemma 3 27B is Google’s latest 27B open model that delivers state-of-the-art results for its size, rivaling or beating models many times larger.

In human evaluations (Chatbot Arena), it outscored a 405B-parameter Llama 3 and the 685B DeepSeek V3, all while running on a single GPU. It is also multimodal, supports context up to 128k and was trained to do function calling.

Numbers check out on our side, too. This small and open model works really well for its small size, beating on our benchmark models much larger than its size.

Best part? Gemma-3-27B was tested without Structured Outputs and managed to respond while following a fairly complex schema and utilising custom chain of thought slots in it.

Its smaller sibling - Gemma-3-12B also delivered similar results - it manages to beat models of much larger size, also without Structured Outputs.

It looks like Google DeepMind has discovered some “secret sauce” that lets them reliably train State-of-the-Art models of different sizes. Obviously, having powerful LLM in Google Cloud is interesting for the business, however having small open models that benefit from reasoning and perform really well - is even better.

Best part? Google didn’t stop there.

You can download Gemma-3-27B-it from Hugging Face and run it on your local server. It is ~55GB of weights which you’ll need to load in GPU in bf16 format (two bytes per weight). It would require ~ 60GB VRAM for text-based tasks and ~70GB VRAM for vision tasks. This means H100 80GB.

However, if you want to run this model on a smaller budget, there is another alternative. Google has also created a special version of this model that is much smaller. It is saved in GGUF format corresponding to Q4_0 quantisation (roughly equals to 4 bits per weight). Thanks to Quantization Aware Training (QAT) this model is able to preserve similar quality as while significantly reducing the memory requirements to load the model.

This overall push towards open, powerful, and small (pick all three) models is great. It is essential for building business systems with Trustworthy AI under the hood. We’ll see, how the things evolve from there.

This trend also aligns well with our strategic focus for this year.

New direction - Enterprise Reasoning and Robotic Process Automation

AI Case portfolio, LLM Benchmarks, Enterprise Challenges and various events - all are our vehicles for pushing AI R&D forward together with a talented community around the world.

Meet us at IBM!

By the way, if you are in Vienna on June 18th, come join IBM, Cloudera and us for “Designing Trustworthy AI” event. We’ll discuss a range of topics, from Data governance to agents and public AI R&D efforts.

Previously we have focused on generic business tasks and reasoning. This reflected in BI, Compliance and Reason categories in the LLM benchmark and culminated in Enterprise RAG Challenge.

Next, we are going to ground our tasks closer to the everyday enterprise challenges. We’ll focus more on the Robotic Process Automation (RPA) and Enterprise Reasoning.

RPA can be described as “automate repetitive SAP workflows via human interfaces”. Historically it was done via rules and browser automation. Recent progress in Large Language Models gives a new angle to approach this problem - via operators and visual agents.

We are going to invest part of this year to dive deeper into this problem:

  • Import visual automation cases into LLM benchmark under RPA column

  • Work on a dedicated Operator/Agent benchmark together with our industry peers.

  • Ultimately - setup and run Enterprise RPA Challenge.

Our partners and customers are quite interested in the possibility of AI automation in modern enterprise software: SAP, Salesforce, ServiceNow. We are going to try to explore this field further, share our findings with the community and, perhaps, push State-of-the-Art forward together here.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!