The best Large Language Models of December 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in december 2024.

We’ve been benchmarking LLMs in business automation tasks for a year and a half already. It feels only appropriate that at the end of 2024, right when we are planning Benchmark v2, you will see our old benchmarks beaten. You can probably already guess the name of the winning model. But let’s not get ahead of ourselves.

Benchmarking Llama 3.3, Amazon Nova - nothing outstanding
Google Gemini 1206, Gemini 2.0 Flash Experimental - TOP 10
DeepSeek v3
Manual benchmark of OpenAI o1 pro - Gold Standard.
Base o1 (medium reasoning effort) - 3rd place
Our thoughts about recently announced o3
Our predictions for the 2025 landscape of LLM in business integration
Enterprise RAG Challenge will take place on February 27th

LLM Benchmarks | December 2024

The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Can the model generate code and help with programming?

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

How well does the model support work with product catalogs and marketplaces?

How well can the model work with large documents and knowledge bases?

Can the model easily interact with external APIs, services and plugins?

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

How well can the model reason and draw conclusions in a given context?

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Hide Cost

Model	Code	Crm	Docs	Integrate	Marketing	Reason	Final	cost	Speed
1. GPT o1 pro (manual) ☁️	100	100	97	100	95	87	97	200.00 €	1.00 rps
2. GPT o1-preview v1/2024-09-12 ☁️	95	92	94	95	88	87	92	52.32 €	0.08 rps
3. GPT o1 v1/2024-12-17 ☁️	100	95	94	91	82	83	91	30.63 €	0.17 rps
4. GPT o1-mini v1/2024-09-12 ☁️	93	96	94	83	82	87	89	8.15 €	0.16 rps
5. GPT-4o v3/2024-11-20 ☁️	86	97	94	95	88	72	89	0.63 €	1.14 rps
6. GPT-4o v1/2024-05-13 ☁️	90	96	100	92	78	74	88	1.21 €	1.44 rps
7. Google Gemini 1.5 Pro v2 ☁️	86	97	94	99	78	74	88	1.00 €	1.18 rps
8. X-AI Grok 2 v2/1212 ⚠️	66	95	97	97	88	78	87	0.58 €	0.99 rps
9. GPT-4 Turbo v5/2024-04-09 ☁️	86	99	98	96	88	43	85	2.45 €	0.84 rps
10. Google Gemini 2.0 Flash Exp ☁️	63	96	100	100	82	62	84	0.03 €	0.85 rps
11. Google Gemini Exp 1121 ☁️	70	97	97	95	72	72	84	0.89 €	0.49 rps
12. GPT-4o v2/2024-08-06 ☁️	90	84	97	86	82	59	83	0.63 €	1.49 rps
13. Google Gemini 1.5 Pro 0801 ☁️	84	92	79	100	70	74	83	0.90 €	0.83 rps
14. Qwen 2.5 72B Instruct ⚠️	79	92	94	97	71	59	82	0.10 €	0.66 rps
15. Llama 3.1 405B Hermes 3🦙	68	93	89	98	88	53	81	0.54 €	0.49 rps
16. Claude 3.5 Sonnet v2 ☁️	82	97	93	84	71	57	81	0.95 €	0.09 rps
17. GPT-4 v1/0314 ☁️	90	88	98	73	88	45	80	7.04 €	1.31 rps
18. X-AI Grok 2 v1/1012 ⚠️	63	93	87	90	88	58	80	1.03 €	0.31 rps
19. GPT-4 v2/0613 ☁️	90	83	95	73	88	45	79	7.04 €	2.16 rps
20. DeepSeek v3 671B ⚠️	62	95	97	85	75	55	78	0.03 €	0.49 rps
21. GPT-4o Mini ☁️	63	87	80	73	100	65	78	0.04 €	1.46 rps
22. Claude 3.5 Sonnet v1 ☁️	72	83	89	87	80	58	78	0.94 €	0.09 rps
23. Claude 3 Opus ☁️	69	88	100	74	76	58	77	4.69 €	0.41 rps
24. Meta Llama3.1 405B Instruct🦙	81	93	92	75	75	48	77	2.39 €	1.16 rps
25. GPT-4 Turbo v4/0125-preview ☁️	66	97	100	83	75	43	77	2.45 €	0.84 rps
26. Google LearnLM 1.5 Pro Experimental ⚠️	48	97	85	96	64	72	77	0.31 €	0.83 rps
27. GPT-4 Turbo v3/1106-preview ☁️	66	75	98	73	88	60	76	2.46 €	0.68 rps
28. Google Gemini Exp 1206 ☁️	52	100	85	77	75	69	76	0.88 €	0.16 rps
29. Qwen 2.5 32B Coder Instruct ⚠️	43	94	98	98	76	46	76	0.05 €	0.82 rps
30. DeepSeek v2.5 236B ⚠️	57	80	91	80	88	57	75	0.03 €	0.42 rps
31. Meta Llama 3.1 70B Instruct f16🦙	74	89	90	75	75	48	75	1.79 €	0.90 rps
32. Google Gemini 1.5 Flash v2 ☁️	64	96	89	76	81	44	75	0.06 €	2.01 rps
33. Google Gemini 1.5 Pro 0409 ☁️	68	97	96	80	75	26	74	0.95 €	0.59 rps
34. Meta Llama 3 70B Instruct🦙	81	83	84	67	81	45	73	0.06 €	0.85 rps
35. GPT-3.5 v2/0613 ☁️	68	81	73	87	81	50	73	0.34 €	1.46 rps
36. Amazon Nova Lite ⚠️	67	78	74	94	62	62	73	0.02 €	2.19 rps
37. Mistral Large 123B v2/2407 ☁️	68	79	68	75	75	70	72	0.57 €	1.02 rps
38. Google Gemini Flash 1.5 8B ☁️	70	93	78	67	76	48	72	0.01 €	1.19 rps
39. Google Gemini 1.5 Pro 0514 ☁️	73	96	79	100	25	60	72	1.07 €	0.92 rps
40. Google Gemini 1.5 Flash 0514 ☁️	32	97	100	76	72	52	72	0.06 €	1.77 rps
41. Google Gemini 1.0 Pro ☁️	66	86	83	79	88	28	71	0.37 €	1.36 rps
42. Meta Llama 3.2 90B Vision🦙	74	84	87	77	71	32	71	0.23 €	1.10 rps
43. GPT-3.5 v3/1106 ☁️	68	70	71	81	78	58	71	0.24 €	2.33 rps
44. Claude 3.5 Haiku ☁️	52	80	72	75	75	68	70	0.32 €	1.24 rps
45. Meta Llama 3.3 70B Instruct🦙	74	78	74	77	71	46	70	0.10 €	0.71 rps
46. GPT-3.5 v4/0125 ☁️	63	87	71	77	78	43	70	0.12 €	1.43 rps
47. Cohere Command R+ ☁️	63	80	76	72	70	58	70	0.83 €	1.90 rps
48. Mistral Large 123B v3/2411 ☁️	68	75	64	76	82	51	70	0.56 €	0.66 rps
49. Qwen1.5 32B Chat f16 ⚠️	70	90	82	76	78	20	69	0.97 €	1.66 rps
50. Gemma 2 27B IT ⚠️	61	72	87	74	89	32	69	0.07 €	0.90 rps
51. Mistral 7B OpenChat-3.5 v3 0106 f16 ✅	68	87	67	74	88	25	68	0.32 €	3.39 rps
52. Meta Llama 3 8B Instruct f16🦙	79	62	68	70	80	41	67	0.32 €	3.33 rps
53. Gemma 7B OpenChat-3.5 v3 0106 f16 ✅	63	67	84	58	81	46	67	0.21 €	5.09 rps
54. GPT-3.5-instruct 0914 ☁️	47	92	69	69	88	33	66	0.35 €	2.15 rps
55. Amazon Nova Pro ⚠️	64	78	82	79	52	41	66	0.22 €	1.34 rps
56. GPT-3.5 v1/0301 ☁️	55	82	69	81	82	26	66	0.35 €	4.12 rps
57. Llama 3 8B OpenChat-3.6 20240522 f16 ✅	76	51	76	65	88	38	66	0.28 €	3.79 rps
58. Mistral 7B OpenChat-3.5 v1 f16 ✅	58	72	72	71	88	33	66	0.49 €	2.20 rps
59. Mistral 7B OpenChat-3.5 v2 1210 f16 ✅	63	73	72	66	88	30	65	0.32 €	3.40 rps
60. Qwen 2.5 7B Instruct ⚠️	48	77	80	68	69	47	65	0.07 €	1.25 rps
61. Starling 7B-alpha f16 ⚠️	58	66	67	73	88	34	64	0.58 €	1.85 rps
62. Mistral Nemo 12B v1/2407 ☁️	54	58	51	99	75	49	64	0.03 €	1.22 rps
63. Meta Llama 3.2 11B Vision🦙	70	71	65	70	71	36	64	0.04 €	1.49 rps
64. Llama 3 8B Hermes 2 Theta🦙	61	73	74	74	85	16	64	0.05 €	0.55 rps
65. Claude 3 Haiku ☁️	64	69	64	75	75	35	64	0.08 €	0.52 rps
66. Yi 1.5 34B Chat f16 ⚠️	47	78	70	74	86	26	64	1.18 €	1.37 rps
67. Liquid: LFM 40B MoE ⚠️	72	69	65	63	82	24	63	0.00 €	1.45 rps
68. Meta Llama 3.1 8B Instruct f16🦙	57	74	62	74	74	32	62	0.45 €	2.41 rps
69. Qwen2 7B Instruct f32 ⚠️	50	81	81	61	66	31	62	0.46 €	2.36 rps
70. Claude 3 Sonnet ☁️	72	41	74	74	78	28	61	0.95 €	0.85 rps
71. Mistral Small v3/2409 ☁️	43	75	71	74	75	26	61	0.06 €	0.81 rps
72. Mistral Pixtral 12B ✅	53	69	73	63	64	40	60	0.03 €	0.83 rps
73. Mixtral 8x22B API (Instruct) ☁️	53	62	62	97	75	7	59	0.17 €	3.12 rps
74. Anthropic Claude Instant v1.2 ☁️	58	75	65	77	65	16	59	2.10 €	1.49 rps
75. Codestral Mamba 7B v1 ✅	53	66	51	97	71	17	59	0.30 €	2.82 rps
76. Inflection 3 Productivity ⚠️	46	59	39	70	79	61	59	0.92 €	0.17 rps
77. Anthropic Claude v2.0 ☁️	63	52	55	67	84	34	59	2.19 €	0.40 rps
78. Cohere Command R ☁️	45	66	57	74	84	27	59	0.13 €	2.50 rps
79. Amazon Nova Micro ⚠️	58	68	64	71	59	31	59	0.01 €	2.41 rps
80. Qwen1.5 7B Chat f16 ⚠️	56	81	60	56	60	36	58	0.29 €	3.76 rps
81. Mistral Large v1/2402 ☁️	37	49	70	83	84	25	58	0.58 €	2.11 rps
82. Microsoft WizardLM 2 8x22B ⚠️	48	76	79	59	62	22	58	0.13 €	0.70 rps
83. Qwen1.5 14B Chat f16 ⚠️	50	58	51	72	84	22	56	0.36 €	3.03 rps
84. MistralAI Ministral 8B ✅	56	55	41	82	68	30	55	0.02 €	1.02 rps
85. Anthropic Claude v2.1 ☁️	29	58	59	78	75	32	55	2.25 €	0.35 rps
86. Mistral 7B OpenOrca f16 ☁️	54	57	76	36	78	27	55	0.41 €	2.65 rps
87. MistralAI Ministral 3B ✅	50	48	39	89	60	41	54	0.01 €	1.02 rps
88. Llama2 13B Vicuna-1.5 f16🦙	50	37	55	62	82	37	54	0.99 €	1.09 rps
89. Mistral 7B Instruct v0.1 f16 ☁️	34	71	69	63	62	23	54	0.75 €	1.43 rps
90. Meta Llama 3.2 3B🦙	52	71	66	71	44	14	53	0.01 €	1.25 rps
91. Google Recurrent Gemma 9B IT f16 ⚠️	58	27	71	64	56	23	50	0.89 €	1.21 rps
92. Codestral 22B v1 ✅	38	47	44	84	66	13	49	0.06 €	4.03 rps
93. Qwen: QwQ 32B Preview ⚠️	43	32	74	52	48	40	48	0.05 €	0.63 rps
94. Llama2 13B Hermes f16🦙	50	24	37	75	60	42	48	1.00 €	1.07 rps
95. IBM Granite 34B Code Instruct f16 ☁️	63	49	34	67	57	7	46	1.07 €	1.51 rps
96. Meta Llama 3.2 1B🦙	32	40	33	53	68	51	46	0.02 €	1.69 rps
97. Mistral Small v2/2402 ☁️	33	42	45	88	56	8	46	0.06 €	3.21 rps
98. Mistral Small v1/2312 (Mixtral) ☁️	10	67	63	65	56	8	45	0.06 €	2.21 rps
99. DBRX 132B Instruct ⚠️	43	39	43	74	59	10	45	0.26 €	1.31 rps
100. NVIDIA Llama 3.1 Nemotron 70B Instruct🦙	68	54	25	72	28	21	45	0.09 €	0.53 rps
101. Mistral Medium v1/2312 ☁️	41	43	44	59	62	12	44	0.81 €	0.35 rps
102. Microsoft WizardLM 2 7B ⚠️	53	34	42	66	53	13	43	0.02 €	0.89 rps
103. Llama2 13B Puffin f16🦙	37	15	44	67	56	39	43	4.70 €	0.23 rps
104. Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️	22	47	59	53	62	8	42	0.05 €	2.39 rps
105. Gemma 2 9B IT ⚠️	45	25	47	36	68	13	39	0.02 €	0.88 rps
106. Meta Llama2 13B chat f16🦙	22	38	17	65	75	6	37	0.75 €	1.44 rps
107. Mistral 7B Zephyr-β f16 ✅	37	34	46	62	29	4	35	0.46 €	2.34 rps
108. Meta Llama2 7B chat f16🦙	22	33	20	62	50	18	34	0.56 €	1.93 rps
109. Mistral 7B Notus-v1 f16 ⚠️	10	54	25	60	48	4	33	0.75 €	1.43 rps
110. Orca 2 13B f16 ⚠️	18	22	32	29	67	20	31	0.95 €	1.14 rps
111. Mistral 7B Instruct v0.2 f16 ☁️	11	30	54	25	58	8	31	0.96 €	1.12 rps
112. Mistral 7B v0.1 f16 ☁️	0	9	48	63	52	12	31	0.87 €	1.23 rps
113. Google Gemma 2B IT f16 ⚠️	33	28	16	47	15	20	27	0.30 €	3.54 rps
114. Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️	5	34	30	32	47	8	26	0.82 €	1.32 rps
115. Orca 2 7B f16 ⚠️	22	0	26	26	52	4	22	0.78 €	1.38 rps
116. Google Gemma 7B IT f16 ⚠️	0	0	0	6	62	0	11	0.99 €	1.08 rps
117. Meta Llama2 7B f16🦙	0	5	22	3	28	2	10	0.95 €	1.13 rps
118. Yi 1.5 9B Chat f16 ⚠️	0	4	29	17	0	8	10	1.41 €	0.76 rps

Benchmarking Llama 3.3, Amazon Nova, Gemini 1206

We’ll cover these models in one go.

Meta Llama 3.3 70B Instruct - 45th place.

Llama 3.3 70B Instruct held 40th at the moment of the release, but since then a few other better models showed up. This is the common pattern - if a company doesn’t release better models, it will get pushed down by the competitors pretty fast.

Llama 3.3 70B has a decent Reason, just below Llama 405B and older Llama 3.1 70B, however it doesn’t follow that well instructions in business tasks. This is a typical problem for Llama models. It would normally be fixed by good fine tunes, except that the market started realising that ROI of fine-tuning models in practice is lower than it seemed. So we don’t expect any changes in its position anytime soon.

Amazon Nova - bad

Amazon has released their own versions of LLMs: Amazon Nova Micro, Lite and Pro. They are very cheap to run and quite useless, taking 36, 55 and 79 places.

Do you know what the silver lining here is? These bad models achieve the quality of GPT-3.5 which was ground-breaking back in the day. So the models aren’t that bad, the progress is just making us move the goalposts really fast without noticing it.

Google Gemini Experimental 1206 and 2.0 Flash Experimental

Google Gemini Experimental 1206 - not so good

Google Gemini Experimental took 28th place, which is much worse than Google Gemini 1.5 Pro v2. The latter is really good, if you manage to get over all the Google quirks. However, that’s acceptable because 1206 is only an experimental model, not an official release.

Yet, this matches the quality level of some GPT-4 Turbo versions!

Google Gemini 2.0 Flash Experimental is a more interesting model

It is still experimental, but it made its way into the TOP-10 of our benchmark!

Compared to the previous version of Flash (Gemini 1.5 Flash), this experimental model improved its reasoning capability from 44 to 62, while increasing the overall score from 75 to 84.

Google Gemini 2.0 Flash also pays a lot of attention to instructions (which is really important for Structured Output / Custom Chain of Thought patterns) and has achieved a perfect 100 score in Docs and Integrate. It is the first model to do so.

Google Deepmind writes that the model was created for automatisation and "agentic experiences" —whatever that means — it has 1M input context.

This model also potentially has the lowest usage cost within the top 19 models. 20th model is another cost-wise contender DeepSeek v3 671B.

“Potentially”, because the price for Google Gemini 2.0 Flash is not known at this moment, yet. So we assume it’s the same as Flash 1.5.

Google continues to surprise us in a good way, continuously releasing new models that make it into the TOP-10. This has a side effect of pushing old favourites (Mistral and Anthropic) down from the spotlight of the fame. This doesn’t make these models worse, on the contrary, it means that we are getting more options to choose from!

Deep Seek v3

DeepSeek v3 is a recently released Mixture-of-Experts (MoE) language model with 671B total parameters. It tries to be efficient inference-wise, so only 37B parameters are activated for each token. This reflects in the cost of running this model. This model is locally-capable (you can download it and run on your servers, provided, there are enough GPUs to host the weights).

DeepSeek v3 has improved over the scores of its predecessor DeepSeek v2.5 (now at TOP 30). It can solve business automation tasks in CRM category at 97 score (from 80). Ability to solve software engineering tasks improved from 57 to 62 (although the model still has a long way to go to catch up with good old Sonnet 3.5 Claude v2 at 82).

Even though DeepSeek v3 activates only 37B parameters per token, this doesn’t make it easier to launch the model locally. Mixture of Experts (MoE) means “faster inference”, not “lower VRAM requirements”. We would need something like 8xH200 GPUs to run model inference locally. This makes the model not so suitable for the local use.

What is peculiar about DeepSeek v3 - it is the first model to use FP8 mixed-precision training framework. This approach enables training new LLM models faster, cheaper and with lower VRAM requirements. This should potentially also lead to better out-of-the-box quantisation at inference. Let’s see if that technique will help to create more of small and powerful local models.

Manual benchmark of OpenAI o1 pro - Gold Standard

Let’s move towards the hero of this LLM Benchmark - o1 pro from the OpenAI. But before we proceed, there is an important caveat. There are 6 different flavours of OpenAI o1 model, don’t mix them up!

o1-mini - the smallest and cheapest of all reasoning models. It is available both in ChatGPT UI and over the API.
o1-preview - really capable reasoning version that was previously available in ChatGPT UI. It is no longer available there, o1 base replaced it. o1-preview is still available directly through the API.
o1 - this is the model that replaces o1-preview in ChatGPT UI. This version has by default more limited reasoning capability in the UI, making it less capable than o1-preview. This model isn’t widely accessible via the API, yet (only for Tier-5 accounts). o1 has three possible reasoning effort configurations in the API: high, medium and low. The higher the effort, the more expensive and capable the model becomes.
o1 - pro - this is the most powerful model of them all. It is available in ChatGPT UI for the monthly cost of 200$. It isn’t available via the API, yet.

There you have it - 4 flavours of o1 model, plus 2 additional configurations (high and low) for the o1 model.

This section focuses solely on the o1 pro. This model, as an exception, was not tested via the API (because it isn’t available, yet), but manually through the ChatGPT UI. Here is how it was done:

We took the results from the o1-mini benchmark and selected only the tasks where o1-mini made mistakes. Since o1 pro is way more capable, we assumed that if o1-mini got something right, o1 pro would also answer correctly. This way we didn’t have to run manually a few hundred tasks from the entire benchmark, only a few dozen.
We made sure to disable custom instructions in the ChatGPT UI. Local memory was also disabled.
We converted benchmark requests from API format to a textual format and launched them manually by copy-pasting.

This is where we’ve encountered the first gotchas.

First of all, o1 pro is embedded deeply in the ChatGPT UI, which tries to be convenient. For example, if task has to return YAML, it will get formatted as markdown, breaking the response completely. We had to fix answers like that manually.

Second, we have historically formatted few-shot samples like this:

System: Task explanation
User: sample request 1
Assistant: sample response 1
User: sample request 2
Assistant: sample response 2 
User: real request

We can’t do role-based prompting in the ChatGPT UI. Besides, System prompt isn’t even accessible in o1 lines of models to prevent reasoning tokens from leaking to the end-users (they are generated by the models without alignment and guardrails). The model isn’t only designed to protect its System prompt (it is also called as Platform prompt in the latest documentation), it also tries to work with the user via the dialogue.

This lead to an interesting outcome: the model gave lower priority to the System instructions and tried to find patterns in past conversations with the user. It could occasionally find them and arrive to the incorrect conclusions, leading in low integrate scores.

So we had to start formatting o1 pro tasks like this:

# Task
Task explanation
## Example
User: sample request 1
Assistant: sample response 1
## Example
User: sample request 2
Assistant: sample response 2
# Request
real request

Having that said, what were the results?

o1 pro reached the very top of our benchmark, achieving almost perfect 97 (remaining 3 points could be attributed to ambiguous tasks in our benchmark).

Within our benchmark that measures capabilities of LLM models in business automation tasks this model is like a gold ingot. It is perfect and expensive. It is an overkill.

As always, these are the good news for two reasons:

We have clearly arrived to the point that LLMs can easily solve all tasks in our business automation challenges (from 18 months ago). Now we just need to wait for comparable models that are cheaper to run.
While developing the second version of LLM benchmark, we can keep current o1 pro capabilities in sights and formulate new ones tasks that will challenge even o1 pro. This will make the evaluation complexity curve more smooth, helping the entire benchmark to be more representative in tasks of business automation.

Benchmark of o1 (base) - 🥉TOP-3

Do you remember the disclaimer about different flavours of o1 models above?

This benchmark focuses on o1 (base) model that was tested through the API with reasoning_effort of medium. This is not necessarily the same model configuration as what is available via the ChatGPT UI.

The difference is not only in the different compute limitations, there also is a new chain of command (rules of robotics, as implemented by OpenAI for the reasoning models): Platform > Developer > User > Tool.

O1 base model was tested automatically via the API, just like all the other models (except o1 pro). It scored the 3rd place - slightly better than o1-mini, slightly worse than o1-preview.

reasoning_effort was set to medium (default value) and max_tokens were set to 25000(as per OpenAI recommendation).

What is peculiar, o1 base takes the 3rd place both in capability and in cost. This makes for a very interesting curve: at the very top, reasoning capability is a function of a cost.

o1-preview works better than o1 base, because it generates more tokens, but the result is also better. o1 pro just things deeper and more throughly in general.

This trend also backs up recent research from the Hugging Face on Scaling Test-time compute. It is about improving quality of 3B model to the level of 70B via spending more time on reasoning (and generating possible answers). So we can probably expect more LLM providers to start offering smarter models for an extra price (you pay for the reasoning tokens).

Perhaps afterwards there will also be new ways to launch intensive reasoning locally as well (similarly how it happened with local structured outputs).

What about recently announced o3?

Recently OpenAI has announced its new model o3, which solves tasks from ARC-AGI dataset really well.

Why is there o1 and o3 but no o2? Naming conflict with O2 telecom company could be the root cause.

What is ARC-AGI? It is a set of challenges that attempt to compare human intellect with machine. The website claims that solving ARC-AGI is even more outstanding than the invention of a transformer architecture.

Below is an example of one benchmark. To solve it, machine needs to figure out the rules and produce pixel-perfect response.

As the story goes, o3 was able to solve almost all tasks from this benchmark. This is something that wasn’t considered to be possible before.

This makes o3 theoretically the best LLM model. However, we believe that it wouldn’t have a noticeable impact on business automation tasks in companies any time soon. There is a catch - costs.

Take a look at the chart below from the ARC-AGI announcement. It maps performance of different models vs the cost of solving a single task.

Cost scale is logarithmic. So the cost of solving a single task with O3 HIGH (Tuned) was around 3200 USD per a single pixel-perfect response.

We mentioned earlier that o1 is the gold standard—it’s perfect in business automation and already too expensive to be practical for the most of the cases. o3 pushes the envelope even further.

Yet, adoption for LLM models works out well in cases where we gain a lot of value from automation. This business value is currently achievable in the mundane tasks where LLMs are cheaper, more patient and accurate than humans. These are simple and easily verifiable tasks like data extraction from the documents, request classification, code generation, review of standard contracts etc.

The real issue here is cost-efficiency. o3 from the OpenAI is not cheap at all, so it will not have a big impact. However, it might pave the way for improving quality of the other models, e.g. via the generation of high quality synthetic data that could be used for training.

Our Predictions for 2025

These are the opinionated predictions, based on the patterns observed among our AI Cases.

Hype of fine-tuning of LLMs will die out.

Fine-tuning of LLMs was frequently mentioned as a way “to train LLM on your company data” or “teach LLM new tricks”. Even OpenAI offers fine-tuning as a service.

In theory everything looks simple - just feed the LLM with a lot of documents, and it will learn from them. What happens in practice: instead of getting better accuracy, teams suddenly end up with models that hallucinate a lot more. Most of the times they underestimate the complexity of preparing proper data and following the training regimen.

Among our AI Cases, only one project has successfully fine-tuned LLM (we are not counting embedding models, of course). They had a lot of carefully prepared data for the task, still it took them quite a few iterations.

We believe that in the year 2025 both the businesses and software consultants/vendors will start to realise real complexity and cost of fine-tuning LLMs. They will also understand the power that a good foundational LLM can already provide out-of-the-box, especially if one leverages powerful inference patterns like Structured Outputs and Custom Chain-of-Thought.

Hype of Autonomous Agents will start fading away

We aren’t claiming that autonomous agents are impossible. If enough effort is invested, it is possible to deliver something like that.

However, the concept of an autonomous agent is not very practical. It is a complex product to design, build and integrate, while ensuring predictable quality.

Please let us stress one fact: agents are not technically complex. In essence, it’s just a series of prompts that pass control and context to each other, while using external tools. Yet, due to the shape of the product, it is really hard for humans to setup cost-efficient process of delivering trustworthy agent-driven products. In practice things just start falling apart, budgets run out before the systems stop hallucinating.

Vendors will continue talking about agents in 2025 and selling “enterprise-ready agent frameworks” (they have investments to recoup), but we believe that the hype will start fading.

Will there be an AGI in 2025? What about LLM trends?

There will be no AGI in 2025. Generic intelligence is even a harder task to solve, especially since we are getting particularly adept in moving the goalposts of “what is an AGI?”. As the creators of ARC-AGI have written: “You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.” And they are just working on v2 of their benchmark.

Still, a lot of companies will continue to try compete with OpenAI for the spot of the most intelligent model. There is even a chance that Google will eventually dethrone OpenAI.

Just look at the trends of quality increases among models in 2024 (by different providers and cost categories):

As a new shortcut for improving model reasoning, we believe, more AI vendors will start providing reasoning capabilities, similar to o1 models. This will be a temporary workaround to boost model accuracy quick and without heavy investments: just throw more compute, let the model think longer before answering and charge more for the API.

However, we also believe, that the upcoming hype of providing smart reasoning models that are outrageously expensive will also start fading. It is just not very practical.

We also believe that AI vendors will start exposing more advanced features in their LLMs. Everybody already has large context and Prompt Caching (which already makes dedicated RAG impractical in many cases). But there still are powerful features that are not rolled out widely:

Structured Outputs (constrained decoding) - as a powerful way to increase quality of LLM answers in complex scenarios, especially when coupled with custom chain-of-thoughts. At the moment only OpenAI has a usable implementation. Google is still catching up with its mildly unusable controlled generation that uses VertexAI API format under the hood.
Document reasoning with VLMs. Latest LLMs are no longer text-only, they can also accept images or audio. This allows to handle complex documents with tables and charts. Anthropic has already a flavour of this capability - it internally sends documents both as text and image to its Sonnet 3.5 model, which works as Vision-Language Model (VLM).
Native integration of LLMs with the other tools - similar to how OpenAI has Assistants APIs that allow its LLMs to use local RAG and code execution sandbox. Anthropic is also trying to enter the playing field by introducing Model Context Protocol (a standard for connecting LLMs to data sources and external tools, inspired by Language Server Protocol).

We also expect that AI vendors will try to include unique features into their LLM APIs in order to attract users. There will be some standardisation (e.g. Google is testing VertexAI access via OpenAI libraries) and non-conformity at the same (e.g. compare how prompt caching works differently at Google, OpenAI and Anthropic).

The whole situation is going to resemble browser wars. Ultimately standards will start emerging, but until then we can expect a lot of quirks, frequent migration pains and evolving features.

Fortunately, if we look beyond a single provider at the market situation in general - bigger patterns start emerging. By optimising for the generic trends of the AI market, we can reduce the risk of making costly decisions and heading towards the dead-ends.

One last bit of the news for the next year. We are planning to run the second round of our Enterprise RAG Challenge at the end of February!

Enterprise RAG Challenge is a friendly competition where we compare how different RAG architectures answer questions about business documents.

We had the first round of this challenge last summer. Results were impressive - just with 16 participating teams we were able to compare different RAG architectures and discover the power of using structured outputs in business tasks.

The second round is scheduled for February 27th. Mark your calendars!

Learn more about the Enterprise RAG Challenge

Transform Your Digital Projects with the Best AI Language Models!

Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.

Martin Warnung

martin.warnung@timetoact.at

Blog

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Rinat AbdullinBlog

Blog

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Rinat AbdullinBlog

Blog

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Rinat AbdullinBlog

Blog

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Blog

AI for social good

Discover how leading companies are already profiting from Gen AI!

Jörg EgretzbergerBlog

Blog

8 tips for developing AI assistants

AI assistants for businesses are hype, and many teams were already eagerly and enthusiastically working on their implementation. Unfortunately, however, we have seen that many teams we have observed in Europe and the US have failed at the task. Read about our 8 most valuable tips, so that you will succeed.

Aqeel AlazreeBlog

Blog

Part 1: Data Analysis with ChatGPT

In this new blog series we will give you an overview of how to analyze and visualize data, create code manually and how to make ChatGPT work effectively. Part 1 deals with the following: In the data-driven era, businesses and organizations are constantly seeking ways to extract meaningful insights from their data. One powerful tool that can facilitate this process is ChatGPT, a state-of-the-art natural language processing model developed by OpenAI. In Part 1 pf this blog, we'll explore the proper usage of data analysis with ChatGPT and how it can help you make the most of your data.

Felix KrauseBlog

Blog

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Matus ZilinskyBlog

Blog

Creating a Social Media Posts Generator Website with ChatGPT

Using the GPT-3-turbo and DALL-E models in Node.js to create a social post generator for a fictional product can be really helpful. The author uses ChatGPT to create an API that utilizes the openai library for Node.js., a Vue component with an input for the title and message of the post. This article provides step-by-step instructions for setting up the project and includes links to the code repository.

Martin WarnungBlog

Blog

Common Mistakes in the Development of AI Assistants

How fortunate that people make mistakes: because we can learn from them and improve. We have closely observed how companies around the world have implemented AI assistants in recent months and have, unfortunately, often seen them fail. We would like to share with you how these failures occurred and what can be learned from them for future projects: So that AI assistants can be implemented more successfully in the future!

Blog

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Blog

Second Place - AIM Hackathon 2024: Trustpilot for ESG

The NightWalkers designed a scalable tool that assigns trustworthiness scores based on various types of greenwashing indicators, including unsupported claims and inaccurate data.

Rinat AbdullinBlog

Blog

Let's build an Enterprise AI Assistant

In the previous blog post we have talked about basic principles of building AI assistants. Let’s take them for a spin with a product case that we’ve worked on: using AI to support enterprise sales pipelines.

Rinat AbdullinBlog

Blog

So You are Building an AI Assistant?

So you are building an AI assistant for the business? This is a popular topic in the companies these days. Everybody seems to be doing that. While running AI Research in the last months, I have discovered that many companies in the USA and Europe are building some sort of AI assistant these days, mostly around enterprise workflow automation and knowledge bases. There are common patterns in how such projects work most of the time. So let me tell you a story...

Rinat AbdullinBlog

Blog

The Intersection of AI and Voice Manipulation

The advent of Artificial Intelligence (AI) in text-to-speech (TTS) technologies has revolutionized the way we interact with written content. Natural Readers, standing at the forefront of this innovation, offers a comprehensive suite of features designed to cater to a broad spectrum of needs, from personal leisure to educational support and commercial use. As we delve into the capabilities of Natural Readers, it's crucial to explore both the advantages it brings to the table and the ethical considerations surrounding voice manipulation in TTS technologies.

Aqeel AlazreeBlog

Blog

Part 4: Save Time and Analyze the Database File

ChatGPT-4 enables you to analyze database contents with just two simple steps (copy and paste), facilitating well-informed decision-making.

Aqeel AlazreeBlog

Blog

Part 3: How to Analyze a Database File with GPT-3.5

In this blog, we'll explore the proper usage of data analysis with ChatGPT and how you can analyze and visualize data from a SQLite database to help you make the most of your data.

Blog

SAM Wins First Prize at AIM Hackathon

The winning team of the AIM Hackathon, nexus. Group AI, developed SAM, an AI-powered ESG reporting platform designed to help companies streamline their sustainability compliance.

Blog

Third Place - AIM Hackathon 2024: The Venturers

ESG reports are often filled with vague statements, obscuring key facts investors need. This team created an AI prototype that analyzes these reports sentence-by-sentence, categorizing content to produce a "relevance map".

Workshop

AI Workshops for Companies

Whether it's the basics of AI, prompt engineering, or potential scouting: our diverse AI workshop offerings provide the right content for every need.

Blog

Google Workspace: AI-supported work for every company

The future of work with Google Workspace and Google AI

Blog

Crisis management & building a sustainable future with AI

Non-profit organizations develop AI models to tackle global challenges - and draw lessons for businesses worldwide

Blog

Expanding Opportunities with Generative AI

Discover how nonprofits use generative AI to boost career opportunities, enhance education, and bridge employment gaps for underserved communities.

Blog

The ROI of Gen AI

Discover how leading companies are already profiting from Gen AI!

TIMETOACT

Martin LangeBlog

Blog

License Management – Everything you need to know

License management is not only relevant in terms of compliance but can also minimize costs and risks. Read more in the article.

Kompetenz

Graph Technology

We help you harness the power of graphs to transform your business. Our expertise spans from graph database modelling and graph data science to generative AI.

Felix KrauseBlog

Blog

License Plate Detection for Precise Car Distance Estimation

When it comes to advanced driver-assistance systems or self-driving cars, one needs to find a way of estimating the distance to other vehicles on the road.

TIMETOACT

Technologie

IBM Cloud Pak for Data Accelerator

For a quick start in certain use cases, specifically for certain business areas or industries, IBM offers so-called accelerators based on the "Cloud Pak for Data" solution, which serve as a template for project development and can thus significantly accelerate the implementation of these use cases. The platform itself provides all the necessary functions for all types of analytics projects, and the accelerators provide the respective content.

Blog

Responsible AI: A Guide to Ethical AI Development

Responsible AI is a key requirement in the development and use of AI technologies. You can find everything you need to know here!

Blog

Google Threat Intelligence

Threat intelligence at Google scale for you and your business!

Rinat AbdullinBlog

Blog

5 Inconvenient Questions when hiring an AI company

This article discusses five questions you should ask when buying an AI. These questions are inconvenient for providers of AI products, but they are necessary to ensure that you are getting the best product for your needs. The article also discusses the importance of testing the AI system on your own data to see how it performs.

Aqeel AlazreeBlog

Blog

Database Analysis Report

This report comprehensively analyzes the auto parts sales database. The primary focus is understanding sales trends, identifying high-performing products, Analyzing the most profitable products for the upcoming quarter, and evaluating inventory management efficiency.

TIMETOACT

Referenz

Managed service support for optimal license management

To ensure software compliance, TIMETOACT supports FUNKE Mediengruppe with a SAM Managed Service for Microsoft, Adobe, Oracle and IBM.

TIMETOACT GROUP

Service

Service

AI & Data Science

The amount of data that companies produce and process every day is constantly growing. This data contains valuable information about customers, markets, business processes and much more. But how can companies use this data effectively to make better decisions, improve their products and services and tap into new business opportunities?

TIMETOACT

Referenz

Interactive online portal identifies suitable employees

TIMETOACT digitizes several test procedures for KI.TEST to determine professional intelligence and personality.

TIMETOACT

Referenz

Standardized data management creates basis for reporting

TIMETOACT implements a higher-level data model in a data warehouse for TRUMPF Photonic Components and provides the necessary data integration connection with Talend. With this standardized data management, TRUMPF will receive reports based on reliable data in the future and can also transfer the model to other departments.

Rinat AbdullinBlog

Blog

Using NLP libraries for post-processing

Learn how to analyse sticky notes in miro from event stormings and how this analysis can be carried out with the help of the spaCy library.

Branche

Artificial Intelligence in Treasury Management

Optimize treasury processes with AI: automated reports, forecasts, and risk management.

Laura GaetanoBlog

Blog

Using a Skill/Will matrix for personal career development

Discover how a Skill/Will Matrix helps employees identify strengths and areas for growth, boosting personal and professional development.

Sebastian BelczykBlog

Blog

Building A Shell Application for Micro Frontends | Part 4

We already have a design system, several micro frontends consuming this design system, and now we need a shell application that imports micro frontends and displays them.

TIMETOACT

Referenz

Flexibility in the data evaluation of a theme park

With the support of TIMETOACT, an theme park in Germany has been using TM1 for many years in different areas of the company to carry out reporting, analysis and planning processes easily and flexibly.

Referenz

Automated Planning of Transport Routes

Efficient transport route planning through automation and seamless integration.

Offering

Offering

Advanced Admin Trial

If you’re interested in Advanced Admin, you can test it for free for 14 days via the Marketplace.

Daniel PuchnerBlog

Blog

Make Your Value Stream Visible Through Structured Logging

Boost your value stream visibility with structured logging. Improve traceability and streamline processes in your software development lifecycle.

Ian RussellBlog

Blog

Introduction to Functional Programming in F# – Part 10

Discover Agents and Mailboxes in F#. Build responsive applications using these powerful concurrency tools in functional programming.

Rinat AbdullinBlog

Blog

Process Pipelines

Discover how process pipelines break down complex tasks into manageable steps, optimizing workflows and improving efficiency using Kanban boards.

Ian RussellBlog

Blog

Introduction to Functional Programming in F# – Part 11

Learn type inference and generic functions in F#. Boost efficiency and flexibility in your code with these essential programming concepts.

Ian RussellBlog

Blog

Introduction to Functional Programming in F# – Part 12

Explore reflection and meta-programming in F#. Learn how to dynamically manipulate code and enhance flexibility with advanced techniques.

Rinat AbdullinBlog

Blog

Inbox helps to clear the mind

I hate distractions. They can easily ruin my day when I'm in the middle of working on a cool project. They do that by overloading my mind, buzzing around inside me, and just making me tired. Even though we can think about several things at once, we can only do one thing at a time.

Ian RussellBlog

Blog

Introduction to Partial Function Application in F#

Partial Function Application is one of the core functional programming concepts that everyone should understand as it is widely used in most F# codebases.In this post I will introduce you to the grace and power of partial application. We will start with tupled arguments that most devs will recognise and then move onto curried arguments that allow us to use partial application.

Felix KrauseBlog

Blog

Boosting speed of scikit-learn regression algorithms

The purpose of this blog post is to investigate the performance and prediction speed behavior of popular regression algorithms, i.e. models that predict numerical values based on a set of input variables.

Chrystal LantnikBlog

Blog

CSS :has() & Responsive Design

In my journey to tackle a responsive layout problem, I stumbled upon the remarkable benefits of the :has() pseudo-class. Initially, I attempted various other methods to resolve the issue, but ultimately, embracing the power of :has() proved to be the optimal solution. This blog explores my experience and highlights the advantages of utilizing the :has() pseudo-class in achieving flexible layouts.

Daniel PuchnerBlog

Blog

How we discover and organise domains in an existing product

Software companies and consultants like to flex their Domain Driven Design (DDD) muscles by throwing around terms like Domain, Subdomain and Bounded Context. But what lies behind these buzzwords, and how these apply to customers' diverse environments and needs, are often not as clear. As it turns out it takes a collaborative effort between stakeholders and development team(s) over a longer period of time on a regular basis to get them right.

Christian FolieBlog

Blog

Running Hybrid Workshops

When modernizing or building systems, one major challenge is finding out what to build. In Pre-Covid times on-site workshops were a main source to get an idea about ‘the right thing’. But during Covid everybody got used to working remotely, so now the question can be raised: Is it still worth having on-site, physical workshops?

Felix KrauseBlog

Blog

Creating a Cross-Domain Capable ML Pipeline

As classifying images into categories is a ubiquitous task occurring in various domains, a need for a machine learning pipeline which can accommodate for new categories is easy to justify. In particular, common general requirements are to filter out low-quality (blurred, low contrast etc.) images, and to speed up the learning of new categories if image quality is sufficient. In this blog post we compare several image classification models from the transfer learning perspective.

Rinat AbdullinBlog

Blog

State of Fast Feedback in Data Science Projects

DSML projects can be quite different from the software projects: a lot of R&D in a rapidly evolving landscape, working with data, distributions and probabilities instead of code. However, there is one thing in common: iterative development process matters a lot.

Felix KrauseBlog

Blog

Part 2: Detecting Truck Parking Lots on Satellite Images

In the previous blog post, we created an already pretty powerful image segmentation model in order to detect the shape of truck parking lots on satellite images. However, we will now try to run the code on new hardware and get even better as well as more robust results.

Felix KrauseBlog

Blog

Part 1: Detecting Truck Parking Lots on Satellite Images

Real-time truck tracking is crucial in logistics: to enable accurate planning and provide reliable estimation of delivery times, operators build detailed profiles of loading stations, providing expected durations of truck loading and unloading, as well as resting times. Yet, how to derive an exact truck status based on mere GPS signals?

Laura GaetanoBlog

Blog

5 lessons from running a (remote) design systems book club

Last year I gifted a design systems book I had been reading to a friend and she suggested starting a mini book club so that she’d have some accountability to finish reading the book. I took her up on the offer and so in late spring, our design systems book club was born. But how can you make the meetings fun and engaging even though you're physically separated? Here are a couple of things I learned from running my very first remote book club with my friend!

Ian RussellBlog

Blog

Introduction to Functional Programming in F# – Part 2

Explore functions, types, and modules in F#. Enhance your skills with practical examples and insights in this detailed guide.

Ian RussellBlog

Blog

Introduction to Functional Programming in F# – Part 3

Dive into F# data structures and pattern matching. Simplify code and enhance functionality with these powerful features.

Ian RussellBlog

Blog

Introduction to Functional Programming in F#

Dive into functional programming with F# in our introductory series. Learn how to solve real business problems using F#'s functional programming features. This first part covers setting up your environment, basic F# syntax, and implementing a simple use case. Perfect for developers looking to enhance their skills in functional programming.

Rinat AbdullinBlog

Blog

Machine Learning Pipelines

In this first part, we explain the basics of machine learning pipelines and showcase what they could look like in simple form. Learn about the differences between software development and machine learning as well as which common problems you can tackle with them.

Daniel WellerBlog

Blog

Revolutionizing the Logistics Industry

As the logistics industry becomes increasingly complex, businesses need innovative solutions to manage the challenges of supply chain management, trucking, and delivery. With competitors investing in cutting-edge research and development, it is vital for companies to stay ahead of the curve and embrace the latest technologies to remain competitive. That is why we introduce the TIMETOACT Logistics Simulator Framework, a revolutionary tool for creating a digital twin of your logistics operation.

Rinat AbdullinBlog

Blog

Event Sourcing with Apache Kafka

For a long time, there was a consensus that Kafka and Event Sourcing are not compatible with each other. So it might look like there is no way of working with Event Sourcing. But there is if certain requirements are met.

Aqeel AlazreeBlog

Blog

Part 2: Data Analysis with powerful Python

Analyzing and visualizing data from a SQLite database in Python can be a powerful way to gain insights and present your findings. In Part 2 of this blog series, we will walk you through the steps to retrieve data from a SQLite database file named gold.db and display it in the form of a chart using Python. We'll use some essential tools and libraries for this task.

Bernhard SchauerBlog

Blog

ADRs as a Tool to Build Empowered Teams

Learn how we use Architecture Decision Records (ADRs) to build empowered, autonomous teams, enhancing decision-making and collaboration.

Peter SzarvasBlog

Blog

Why Was Our Project Successful: Coincidence or Blueprint?

“The project exceeded all expectations,” is one among our favourite samples of the very positive feedback from our client. Here's how we did it!

Christian FolieBlog

Blog

The Power of Event Sourcing

This is how we used Event Sourcing to maintain flexibility, handle changes, and ensure efficient error resolution in application development.

Jonathan ChannonBlog

Blog

Tracing IO in .NET Core

Learn how we leverage OpenTelemetry for efficient tracing of IO operations in .NET Core applications, enhancing performance and monitoring.

Rinat AbdullinBlog

Blog

Consistency and Aggregates in Event Sourcing

Learn how we ensures data consistency in event sourcing with effective use of aggregates, enhancing system reliability and performance.

Blog

My Workflows During the Quarantine

The current situation has deeply affected our daily lives. However, in retrospect, it had a surprisingly small impact on how we get work done at TIMETOACT GROUP Austria.

Rinat AbdullinBlog

Blog

Learning + Sharing at TIMETOACT GROUP Austria

Discover how we fosters continuous learning and sharing among employees, encouraging growth and collaboration through dedicated time for skill development.

Laura GaetanoBlog

Blog

My Weekly Shutdown Routine

Discover my weekly shutdown routine to enhance productivity and start each week fresh. Learn effective techniques for reflection and organization.

Rinat AbdullinBlog

Blog

Innovation Incubator at TIMETOACT GROUP Austria

Discover how our Innovation Incubator empowers teams to innovate with collaborative, week-long experiments, driving company-wide creativity and progress.

Rinat AbdullinBlog