Daniel has agreed to publish the source code for his solution. As soon as it is available, we will update the Github repository with the links. The status of the source code release can be seen in the source column.
Under the hood Daniel’s solution uses GPT-4o model with structured outputs. During pre-fill phase it benefits from the fact that possible types of questions were shared publicly with all the participants in a form of public question generator code. So we prepare a checklist with possible types of information to extract, enforce the data types with Structured Outputs and run against all documents to extract necessary information. Large documents are split based on the size.
During the question answering phase we go through each question and pass it to GPT-4o together with the pre-filled checklist data. Resulting answer is shaped into the proper schema by using structured outputs again.
The solution was a bit on the expensive side. Information prefill for 20 PDF consumes almost 6 dollars, while answering 40 questions took $2.44.
In this challenge we don’t place any limits on the cost of the solution, but encourage participants to capture and share cost data. Readers can then prioritise resulting solutions based on their own criteria.
Second Best - Classic RAG with GPT-4o
The second best solution came from Ilya Rice. It scored 76 points, achieving it with GPT-4o and a classical Langchain-based RAG. It used one of the best embedding models - text-embedding-3-large from OpenAI and custom Chain of Thought Prompts. The solution used fitz for text parsing, while chunking texts by character count.
Third Best Solution - Checklists with Gemini Flash
Third best solution was provided by Artem Nurmukhametov. His solution was architecturally similar to the solution of Daniel, but used multi-stage processing for the checklists. It used Gemini Flash model from Google to drive the system.
The solution was also on the expensive side, consuming $4 for the full challenge run.
As you have noticed 2 out of 3 top solutions have used Checklist pattern and Knowledge Mapping to benefit from the fact that the domain is already known in advance. While this is a common case in businesses (we can use Domain-Driven Design and iterative product development to capture similar level of detail), this puts classical RAG-systems at a disadvantage.
To compensate for that, in the next round of the Enterprise RAG Challenge, we will rework the question generator to have a lot more variability, making it prohibitively more expensive to “cheat” by simply using Knowledge mapping.
Best On-Premise Solution
As you have noticed, most of the solutions have used GPT-4o LLM from the OpenAI. According to our benchmarks, this is one of the best and most cost-effective LLMs currently available.
However, in the real world companies are sometimes interested in solutions that can run completely on the premises. This can be desired due to various reasons: cost, IP protection or compliance.
Locality comes at some cost - local models like Llama are less capable than the cloud-based models like OpenAI GPT-4 or Claude Sonnet 3.5. To compensate for that, local AI systems start leveraging advanced techniques that are sometimes possible only for the local models - precise guidance, fine-tuning (full tine-tuning, not the adapters that OpenAI employs), using custom mixtures and ensembles of experts or wide beam search.
It can be hard to compare effective accuracy of drastically different approaches. This Enterprise RAG Challenge allows to start comparing them against the same basis.
6th place is taken by a fully local system with a score of 69. Gap between this system and the winner is much less than what we expected!
Under the hood this system uses Qwen-72B LLM which is quite popular in some parts of Europe and Asia. Overall architecture is based on ReAct agent loops from LangChain with RAG-driven query engine. Table data from PDFs was converted to XML and the RecursiveCharacterTextSplitter was used for text chunking.
The table has two other solutions that can run fully on-the-premises. These are marked with a ⭐ in the "Local" column.
Round 2 - This Fall
The first round was done within a small circle of peers, to test-drive and polish the experience. The reception was much better than we expected.
We are planning to host next round of Enterprise RAG challenge later this fall. This round will be announced publicly, it will include a few small balance changes:
Question generator will be rebalanced to produce fewer questions that result in N/A answer. We’ll still keep a few around to catch hallucination cases.
We will generate more questions and ensure bigger diversity of possible questions. This will make the competition more challenging for the approaches based on Knowledge Mapping approach and Checklist LLM Pattern.
All changes will be made public and shared as open source before the start of the competition. Every participant will be able to use that knowledge to prepare for that competition.
In addition to that, source code of solutions from TIMETOACT GROUP will be shared for everybody to benefit from.
We will also try to gather more data from the participants and make it more consistent.
All of that should make the results from the next round more valuable, helping to push our shared understanding of what does it take to build high-quality AI solutions for the enterprise in practice.
Strategic Outlook
We are heading into the end of the summer holidays and a new busy period for the business. What can we expect from the future months in the world of “LLMs for the Enterprise”?
First, of all architectural approaches for solving customer problems will continue evolving. As we have seen from the RAG Challenge, there isn’t a single best option that clearly beats everybody. Radically different architectures are currently competing: solutions based on Knowledge Mapping, classical vector-based RAGs, systems with dedicated agents and knowledge graphs.
By looking at the architecture alone, it is not possible to tell in advance, if it will be the best solution. Number of lines of code is not a clear indicator either.
Based on the architecture alone, there is still room for the improvement of the quality in LLM-driven solutions.
However, LLM patterns and practices will not be the only factor driving future quality improvements. Let’s not forget that Large Language Models are continuously getting better and cheaper.