September has been exciting! In this edition of TMETOACT GROUP LLM Benchmark we’ll talk about pushing the state of the art.
September has been exciting! In this edition of TMETOACT GROUP LLM Benchmark we’ll talk about pushing the state of the art.
We have a lot of ground to cover. Let’s proceed.
The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.
☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license
Model | Code | Crm | Docs | Integrate | Marketing | Reason | Final | Cost | Speed |
---|---|---|---|---|---|---|---|---|---|
GPT o1-preview v1/2024-09-12 ☁️ | 95 | 92 | 94 | 96 | 88 | 87 | 92 | 52.32 € | 0.08 rps |
GPT o1-mini v1/2024-09-12 ☁️ | 93 | 96 | 94 | 85 | 82 | 87 | 90 | 8.15 € | 0.16 rps |
Google Gemini 1.5 Pro v2 ☁️ | 86 | 97 | 94 | 100 | 78 | 74 | 88 | 1.00 € | 1.18 rps |
GPT-4o v1/2024-05-13 ☁️ | 90 | 96 | 100 | 89 | 78 | 74 | 88 | 1.21 € | 1.44 rps |
GPT-4o v3/dyn-2024-08-13 ☁️ | 90 | 97 | 100 | 81 | 79 | 78 | 88 | 1.22 € | 1.21 rps |
GPT-4 Turbo v5/2024-04-09 ☁️ | 86 | 99 | 98 | 100 | 88 | 43 | 86 | 2.45 € | 0.84 rps |
GPT-4o v2/2024-08-06 ☁️ | 90 | 84 | 97 | 92 | 82 | 59 | 84 | 0.63 € | 1.49 rps |
Google Gemini 1.5 Pro 0801 ☁️ | 84 | 92 | 79 | 100 | 70 | 74 | 83 | 0.90 € | 0.83 rps |
Qwen 2.5 72B Instruct ⚠️ | 79 | 92 | 94 | 100 | 71 | 59 | 83 | 0.10 € | 0.66 rps |
Llama 3.1 405B Hermes 3🦙 | 68 | 93 | 89 | 100 | 88 | 53 | 82 | 0.54 € | 0.49 rps |
GPT-4 v1/0314 ☁️ | 90 | 88 | 98 | 70 | 88 | 45 | 80 | 7.04 € | 1.31 rps |
GPT-4 v2/0613 ☁️ | 90 | 83 | 95 | 70 | 88 | 45 | 78 | 7.04 € | 2.16 rps |
Claude 3 Opus ☁️ | 69 | 88 | 100 | 78 | 76 | 58 | 78 | 4.69 € | 0.41 rps |
Claude 3.5 Sonnet ☁️ | 72 | 83 | 89 | 85 | 80 | 58 | 78 | 0.94 € | 0.09 rps |
GPT-4 Turbo v4/0125-preview ☁️ | 66 | 97 | 100 | 85 | 75 | 43 | 78 | 2.45 € | 0.84 rps |
GPT-4o Mini ☁️ | 63 | 87 | 80 | 70 | 100 | 65 | 78 | 0.04 € | 1.46 rps |
Meta Llama3.1 405B Instruct🦙 | 81 | 93 | 92 | 70 | 75 | 48 | 76 | 2.39 € | 1.16 rps |
GPT-4 Turbo v3/1106-preview ☁️ | 66 | 75 | 98 | 70 | 88 | 60 | 76 | 2.46 € | 0.68 rps |
DeepSeek v2.5 236B ⚠️ | 57 | 80 | 91 | 78 | 88 | 57 | 75 | 0.03 € | 0.42 rps |
Google Gemini 1.5 Flash v2 ☁️ | 64 | 96 | 89 | 75 | 81 | 44 | 75 | 0.06 € | 2.01 rps |
Google Gemini 1.5 Pro 0409 ☁️ | 68 | 97 | 96 | 85 | 75 | 26 | 74 | 0.95 € | 0.59 rps |
Meta Llama 3.1 70B Instruct f16🦙 | 74 | 89 | 90 | 70 | 75 | 48 | 74 | 1.79 € | 0.90 rps |
GPT-3.5 v2/0613 ☁️ | 68 | 81 | 73 | 81 | 81 | 50 | 72 | 0.34 € | 1.46 rps |
Meta Llama 3 70B Instruct🦙 | 81 | 83 | 84 | 60 | 81 | 45 | 72 | 0.06 € | 0.85 rps |
Mistral Large 123B v2/2407 ☁️ | 68 | 79 | 68 | 75 | 75 | 70 | 72 | 0.86 € | 1.02 rps |
Google Gemini 1.5 Pro 0514 ☁️ | 73 | 96 | 79 | 100 | 25 | 60 | 72 | 1.07 € | 0.92 rps |
Google Gemini 1.5 Flash 0514 ☁️ | 32 | 97 | 100 | 75 | 72 | 52 | 71 | 0.06 € | 1.77 rps |
Google Gemini 1.0 Pro ☁️ | 66 | 86 | 83 | 78 | 88 | 28 | 71 | 0.37 € | 1.36 rps |
Meta Llama 3.2 90B Vision🦙 | 74 | 84 | 87 | 78 | 71 | 32 | 71 | 0.23 € | 1.10 rps |
GPT-3.5 v3/1106 ☁️ | 68 | 70 | 71 | 78 | 78 | 58 | 70 | 0.24 € | 2.33 rps |
GPT-3.5 v4/0125 ☁️ | 63 | 87 | 71 | 78 | 78 | 43 | 70 | 0.12 € | 1.43 rps |
Qwen1.5 32B Chat f16 ⚠️ | 70 | 90 | 82 | 78 | 78 | 20 | 69 | 0.97 € | 1.66 rps |
Cohere Command R+ ☁️ | 63 | 80 | 76 | 70 | 70 | 58 | 69 | 0.83 € | 1.90 rps |
Gemma 2 27B IT ⚠️ | 61 | 72 | 87 | 70 | 89 | 32 | 69 | 0.07 € | 0.90 rps |
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ | 68 | 87 | 67 | 70 | 88 | 25 | 67 | 0.32 € | 3.39 rps |
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ | 63 | 67 | 84 | 60 | 81 | 46 | 67 | 0.21 € | 5.09 rps |
Meta Llama 3 8B Instruct f16🦙 | 79 | 62 | 68 | 70 | 80 | 41 | 67 | 0.32 € | 3.33 rps |
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ | 63 | 73 | 72 | 69 | 88 | 30 | 66 | 0.32 € | 3.40 rps |
Mistral 7B OpenChat-3.5 v1 f16 ✅ | 58 | 72 | 72 | 70 | 88 | 33 | 65 | 0.49 € | 2.20 rps |
GPT-3.5-instruct 0914 ☁️ | 47 | 92 | 69 | 62 | 88 | 33 | 65 | 0.35 € | 2.15 rps |
GPT-3.5 v1/0301 ☁️ | 55 | 82 | 69 | 78 | 82 | 26 | 65 | 0.35 € | 4.12 rps |
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ | 76 | 51 | 76 | 60 | 88 | 38 | 65 | 0.28 € | 3.79 rps |
Mistral Nemo 12B v1/2407 ☁️ | 54 | 58 | 51 | 100 | 75 | 49 | 64 | 0.03 € | 1.22 rps |
Meta Llama 3.2 11B Vision🦙 | 70 | 71 | 65 | 70 | 71 | 36 | 64 | 0.04 € | 1.49 rps |
Starling 7B-alpha f16 ⚠️ | 58 | 66 | 67 | 70 | 88 | 34 | 64 | 0.58 € | 1.85 rps |
Llama 3 8B Hermes 2 Theta🦙 | 61 | 73 | 74 | 70 | 85 | 16 | 63 | 0.05 € | 0.55 rps |
Yi 1.5 34B Chat f16 ⚠️ | 47 | 78 | 70 | 70 | 86 | 26 | 63 | 1.18 € | 1.37 rps |
Claude 3 Haiku ☁️ | 64 | 69 | 64 | 70 | 75 | 35 | 63 | 0.08 € | 0.52 rps |
Meta Llama 3.1 8B Instruct f16🦙 | 57 | 74 | 62 | 70 | 74 | 32 | 61 | 0.45 € | 2.41 rps |
Qwen2 7B Instruct f32 ⚠️ | 50 | 81 | 81 | 60 | 66 | 31 | 61 | 0.46 € | 2.36 rps |
Mistral Small v3/2409 ☁️ | 43 | 75 | 71 | 75 | 75 | 26 | 61 | 0.06 € | 0.81 rps |
Claude 3 Sonnet ☁️ | 72 | 41 | 74 | 70 | 78 | 28 | 61 | 0.95 € | 0.85 rps |
Mixtral 8x22B API (Instruct) ☁️ | 53 | 62 | 62 | 100 | 75 | 7 | 60 | 0.17 € | 3.12 rps |
Mistral Pixtral 12B ✅ | 53 | 69 | 73 | 60 | 64 | 40 | 60 | 0.03 € | 0.83 rps |
Codestral Mamba 7B v1 ✅ | 53 | 66 | 51 | 100 | 71 | 17 | 60 | 0.30 € | 2.82 rps |
Anthropic Claude Instant v1.2 ☁️ | 58 | 75 | 65 | 75 | 65 | 16 | 59 | 2.10 € | 1.49 rps |
Cohere Command R ☁️ | 45 | 66 | 57 | 70 | 84 | 27 | 58 | 0.13 € | 2.50 rps |
Anthropic Claude v2.0 ☁️ | 63 | 52 | 55 | 60 | 84 | 34 | 58 | 2.19 € | 0.40 rps |
Qwen1.5 7B Chat f16 ⚠️ | 56 | 81 | 60 | 50 | 60 | 36 | 57 | 0.29 € | 3.76 rps |
Mistral Large v1/2402 ☁️ | 37 | 49 | 70 | 78 | 84 | 25 | 57 | 0.58 € | 2.11 rps |
Microsoft WizardLM 2 8x22B ⚠️ | 48 | 76 | 79 | 50 | 62 | 22 | 56 | 0.13 € | 0.70 rps |
Qwen1.5 14B Chat f16 ⚠️ | 50 | 58 | 51 | 70 | 84 | 22 | 56 | 0.36 € | 3.03 rps |
Anthropic Claude v2.1 ☁️ | 29 | 58 | 59 | 78 | 75 | 32 | 55 | 2.25 € | 0.35 rps |
Llama2 13B Vicuna-1.5 f16🦙 | 50 | 37 | 55 | 60 | 82 | 37 | 53 | 0.99 € | 1.09 rps |
Mistral 7B Instruct v0.1 f16 ☁️ | 34 | 71 | 69 | 59 | 62 | 23 | 53 | 0.75 € | 1.43 rps |
Mistral 7B OpenOrca f16 ☁️ | 54 | 57 | 76 | 25 | 78 | 27 | 53 | 0.41 € | 2.65 rps |
Meta Llama 3.2 3B🦙 | 52 | 71 | 66 | 70 | 44 | 14 | 53 | 0.01 € | 1.25 rps |
Google Recurrent Gemma 9B IT f16 ⚠️ | 58 | 27 | 71 | 60 | 56 | 23 | 49 | 0.89 € | 1.21 rps |
Codestral 22B v1 ✅ | 38 | 47 | 44 | 78 | 66 | 13 | 48 | 0.06 € | 4.03 rps |
Llama2 13B Hermes f16🦙 | 50 | 24 | 37 | 74 | 60 | 42 | 48 | 1.00 € | 1.07 rps |
IBM Granite 34B Code Instruct f16 ☁️ | 63 | 49 | 34 | 70 | 57 | 7 | 47 | 1.07 € | 1.51 rps |
Mistral Small v2/2402 ☁️ | 33 | 42 | 45 | 92 | 56 | 8 | 46 | 0.06 € | 3.21 rps |
DBRX 132B Instruct ⚠️ | 43 | 39 | 43 | 77 | 59 | 10 | 45 | 0.26 € | 1.31 rps |
Mistral Medium v1/2312 ☁️ | 41 | 43 | 44 | 61 | 62 | 12 | 44 | 0.81 € | 0.35 rps |
Meta Llama 3.2 1B🦙 | 32 | 40 | 33 | 40 | 68 | 51 | 44 | 0.02 € | 1.69 rps |
Llama2 13B Puffin f16🦙 | 37 | 15 | 44 | 70 | 56 | 39 | 43 | 4.70 € | 0.23 rps |
Mistral Small v1/2312 (Mixtral) ☁️ | 10 | 67 | 63 | 52 | 56 | 8 | 43 | 0.06 € | 2.21 rps |
Microsoft WizardLM 2 7B ⚠️ | 53 | 34 | 42 | 59 | 53 | 13 | 42 | 0.02 € | 0.89 rps |
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ | 22 | 47 | 59 | 38 | 62 | 8 | 39 | 0.05 € | 2.39 rps |
Gemma 2 9B IT ⚠️ | 45 | 25 | 47 | 34 | 68 | 13 | 38 | 0.02 € | 0.88 rps |
Meta Llama2 13B chat f16🦙 | 22 | 38 | 17 | 60 | 75 | 6 | 36 | 0.75 € | 1.44 rps |
Mistral 7B Zephyr-β f16 ✅ | 37 | 34 | 46 | 59 | 29 | 4 | 35 | 0.46 € | 2.34 rps |
Meta Llama2 7B chat f16🦙 | 22 | 33 | 20 | 60 | 50 | 18 | 34 | 0.56 € | 1.93 rps |
Mistral 7B Notus-v1 f16 ⚠️ | 10 | 54 | 25 | 52 | 48 | 4 | 32 | 0.75 € | 1.43 rps |
Orca 2 13B f16 ⚠️ | 18 | 22 | 32 | 22 | 67 | 20 | 30 | 0.95 € | 1.14 rps |
Mistral 7B v0.1 f16 ☁️ | 0 | 9 | 48 | 53 | 52 | 12 | 29 | 0.87 € | 1.23 rps |
Mistral 7B Instruct v0.2 f16 ☁️ | 11 | 30 | 54 | 12 | 58 | 8 | 29 | 0.96 € | 1.12 rps |
Google Gemma 2B IT f16 ⚠️ | 33 | 28 | 16 | 57 | 15 | 20 | 28 | 0.30 € | 3.54 rps |
Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️ | 5 | 34 | 30 | 11 | 47 | 8 | 22 | 0.82 € | 1.32 rps |
Orca 2 7B f16 ⚠️ | 22 | 0 | 26 | 20 | 52 | 4 | 21 | 0.78 € | 1.38 rps |
Google Gemma 7B IT f16 ⚠️ | 0 | 0 | 0 | 9 | 62 | 0 | 12 | 0.99 € | 1.08 rps |
Meta Llama2 7B f16🦙 | 0 | 5 | 22 | 3 | 28 | 2 | 10 | 0.95 € | 1.13 rps |
Yi 1.5 9B Chat f16 ⚠️ | 0 | 4 | 29 | 9 | 0 | 8 | 8 | 1.41 € | 0.76 rps |
Can the model generate code and help with programming?
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
How well does the model support work with product catalogs and marketplaces?
How well can the model work with large documents and knowledge bases?
Can the model easily interact with external APIs, services and plugins?
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
How well can the model reason and draw conclusions in a given context?
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.
OpenAI has released a radically new type of the model called o1-preview that is followed by o1-mini. These unique models differ from all the other LLM models out there - they run their own chain of thought routine for each request. This allow the model to decompose complex problems in smaller tasks and really think the answers through.
That approach, for example, shines in complex full-stack software engineering challenges. o1, if compared to the “ordinary” GPT-4 feels like an experienced Middle Level Software Engineer that requires surprisingly little hand-holding.
There is one downside in this “chain of thought under the hood” process. O1 produces high quality results, but these results take time and cost a lot more. Just look at the comparative pricing within the Cost column.
We are looking forward to see other LLM vendors take a note of this trick and release their own versions of LLMs with tuned chain-of-thought routine.
While speaking of the top results and cloud vendors, there is another new model in the TOP-3. Google has somehow managed to catch up with the rate of the progress and release highly competitive model - Gemini 1.5 Pro v 002.
This model systematically improves over the previous version in multiple categories: Code, CRM, Docs, and Marketing texts. It is also the cheapest model in the TOP-6 of our benchmark.
Practitioners already praise this model for great multi-lingual skills, while users of Google Cloud are happy to have top-tier LLM available in their cloud.
For a long time, it felt like OpenAI and Anthropic are the only companies that can really push the state of the art in top-tier LLM models. It also felt like large mammoth companies are just too slow and old-school to release something worthy. Google was eventually able to prove this wrong.
This is how the progress of Google models looks over the time:
Now it doesn’t feel out of the ordinary to expect models of similar quality from Amazon or Microsoft. Perhaps, this will spur forward a round of competition with further price drops and further quality improvements.
Enough with the cloud vendors. Let’s talk about local models now.
(Local models are the models that you can download and run on your own hardware.)
Recently released Qwen 2.5 Instruct is surprisingly good. This is the first local model that beats Claude 3.5 Sonnet on our business tasks. It also costs less than the other LLM models in the top.
Starting from this benchmark we’ll use OpenRouter pricing as the base price for locally-capable LLM models. This allows to estimate workload costs based on the real-world market. It also factors in any meaningful performance optimisations that LLM vendors are willing to use to improve their margins.
Qwen 2.5 72B diligently follows instructions (if compared to Sonnet 3.5 or older GPT-4 versions) and has a decent Reason capability. This Chinese model has gaps in Code and Marketing capabilities.
DeepSeek 2.5 didn’t perform nearly as well in our product benchmarks, despite having a huge size of 236B parameters. It runs roughly on the level of older versions of GPT-4 and Gemini 1.5 Pro.
These actually are outstanding news: more and more local models reach the level of GPT-4 Turbo intelligence. And the fact that a smaller Qwen 72B model has beaten it by a big margin - is worth a separate celebration 🚀
We think, this is not the last celebration of this kind for this year.
Meta has just released their new versions of Llama - 3.2 model range.
Larger models are now multi-modal. This happened at the cost of the cognitive capabilities in text-driven business tasks, if compared to the previous model versions. Llama 3.2 is still far from the top.
If we look at the table:
Llama 3.2 90B Vision works on the level of Llama 3/3.1 70B but with worse Reason.
Llama 3.2 11B Vision works on the level of previous 8B, but with worse reason.
This doesn’t make the new models worse - they have more capabilities now. Our benchmark currently tests only text-based business tasks. Vision tasks will be added in v2.
Having said that, there is a small nuance that really makes this Llama 3.2 release outstanding. Size of that nuance is 1B and 3B. These are the sizes of new tiny Llama 3.2 models that are designed to run in resource-constrained environments and on the edge (optimised for ARM processors, Qualcomm and MediaTek hardware). Despite resource constraints, these models feature 128k token context and surprisingly high response quality in business tasks.
For example, do you remember a huge DBRX 132B Instruct model that claimed to be “a new state-of-the-art for established open LLMs”? Well, Llama 3.2 1B model catches up with it in our benchmark and 3B beats it by a big margin. Just look at the neighbours of these models on this table:
Keep in mind that these benchmarks results are for the base Llama versions. Customised fine-tunes tend to improve overall scores even further.
As you can see, the progress doesn’t stand still. We’ll be waiting for the continuation of the trend where more and more companies manage to package better cognitive capability in smaller models.
To visualise such a trend, we’ve plotted all releases of locally-capable models over the timeline. Then we’ve grouped them together based on the rough hardware requirements for running them. For each group we’ve computed current trend (linregress)
Note: this grouping is very rough. We are using for most frequent hardware combinations that we’ve seen among our customers and within the AI Research. We are also assuming that we are running inference under fp16 without any further quantisations and with enough spare VRAM to keep some context around.
Here are a few observations.
These observations are obvious. You don’t need a chart to figure them out. However visualisations make rate of the progress more comprehensible. It could then be translated to customers and accounted for in long-term plans.
Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.
Martin Warnung
martin.warnung@timetoact.at