8th place in Enterprise RAG Challenge 2025:

Answering Business Questions with LLMs

Key Insights
  • PDF extraction is tough: Annual reports contain vital data, but extracting structured information automatically from PDFs is complex due to their human-centric formatting.
  • Retrieval quality is crucial: AI-generated answers strongly depend on correctly finding relevant pages within large documents.

  • Approach comparison:

    • Naive approach (Gemini 2.0 Flash): Quick and easy but limited with large files or multiple documents.

    • Dense retrieval (Vector Databases): Highly accurate, scalable, but requires extensive preprocessing.

    • Multi-agent systems: Improved accuracy by combining multiple AI models, adding complexity and cost.

  • Best performing solution: Combining Gemini for retrieval and OpenAI for answer generation delivered the best balance of accuracy and ease of use, ranking 8th place on final leaderboard.

3/11/25

Final leaderboard with TAT employees submissions marked. Hours: time it took the team to produce the results, R: Retrieval Score. Max: 100, G: Generation Score. Max: 100, Score: Final score (R/3+G). Max: 133, AI: teams leveraged Rinat Abdullins AI Research (through communities or TimeToAct), Lcl: Local model was used

The Enterprise RAG Challenge on 27.02.2025 demands answering company-specific questions based on annual reports using large language models (LLMs) and sophisticated retrieval-augmented generation (RAG) techniques. 

Participants must extract precise information from extensive PDF documents, which often span 60 to 100 pages, contain complex formatting, and include tables and illustrations (as you can see below). Parsing data from PDFs is a challenge in itself. These files are optimized for visual presentation rather than structured data extraction, making it difficult to retrieve relevant text and numerical information accurately.

Example of a challenging PDF page to parse (Playtech plc Annual Report and Financial Statements 2022, p. 62)

Here are some example questions from the challenge to be answered with such PDFs:

A successful submission must not only provide an accurate answer but also reference the page numbers containing the relevant information. This enables two evaluation criteria (as can be seen in the leaderboard above):

  1. Generation Score (G): Measures the correctness of the extracted answer.
  2. Retrieval Score (R): Assesses whether the retrieval mechanism identified the correct locations in the document.

The generation score is inherently constrained by the retrieval score - if the wrong pages are extracted, even the best model cannot produce the correct answer. A higher retrieval score increases the likelihood of accurate answers.

A properly formatted submission looks like this:

To generate high-quality answers with reliable references, I built several LLM-powered approaches. You can find my implementations and experiments in the source code repository.

Naive Gemini 2.0 Flash Approach

📌 Ranking: 57th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: Gemini 2.0 Flash, chain of thought, structured outputs

Overview:
For the first, naive approach, I took inspiration from a blog post by Philipp Schmid. The key idea was to leverage the one-million-token context window of Gemini 2.0 Flash, which is free within certain usage limits (details).

Google's genAI package enables direct PDF file uploads into the model's context. This meant that each query was passed to Gemini along with a custom system prompt and the full PDF file, allowing the model to process the document in a single step.

Applied LLM Techniques:
To ensure machine-readable responses, I enforced structured outputs (SO), which included:

  • Chain of thought (CoT) reasoning: Encourages the model to explicitly think through the steps before generating an answer. This also helps with debugging and refining prompts.

  • Final answer: The extracted response.

  • Reference list: Page numbers where relevant information was found.

Challenges

One major limitation was API rate limits and file size constraints. Some requests, especially those involving multiple companies, failed because they required loading five separate PDFs into context - exceeding the model’s capacity.

As Philipp Schmid outlined, optimizing PDFs before upload or splitting large files could have helped mitigate these issues. However, due to time constraints, I was unable to iterate systematically on a test set or analyze CoT outputs for fine-tuned prompt engineering. This limited the ability to refine the model’s reasoning and response accuracy.

Pros & Cons

✅ Pros

  • - Easy setup – No pre-processing required.

  • - Cost-effective – Free within usage limits.

  • - Fast deployment – Minimal development effort.

❌ Cons

- Rate limits & file size issues – Needs optimization for large or multi-document queries.

- Lack of iterative refinement – No systematic debugging or CoT analysis.

Multi-agent Approaches

While the naive approach provided an easy way to process PDFs, it struggled with multi-company queries and API rate limits. To improve retrieval accuracy and scalability, I experimented with more advanced multi-agent systems.

The process follows a structured pipeline:

  1. Routing: An OpenAI GPT-4o-based router agent first identifies the relevant companies in a given query and formulates extended subqueries for each.

  2. Company-Specific Retrieval: Specialized agents fetch relevant information for each company.

  3. Merging: A final merger agent compiles all responses into a final structured answer.

Dense Retrieval

Keywords: Custom chunking, Qdrant vector database, query extension, chain of thought, structured outputs

For this approach, each company-specialized agent used the Qdrant vector database for retrieval. PDFs were first converted to Markdown format using docling, offering advanced PDF understanding and parsing, especially for tables.

Custom Chunking Strategy

The extracted markdown content was split into custom chunks:

  • - Chapter-based segmentation to retain context.
  • - Paragraph-based splitting, only when necessary.
  • - Tables preserved as whole units with metadata (paragraphs before table).

The chunks were embedded using OpenAI's "text-embedding-3-small" model. Based on the identified company and refined query of the router, the vector database retrieved the top five relevant chunks per question.

Challenges

- One PDF required manual reshaping due to an incompatible page format. 

- Two reports were too large (one had 1,043 pages); a quick fix was to process only the first 150 pages. 

- Parsing all PDFs took ~10 hours. 

- Some extracted chunks contained artefacts, requiring post-processing before embedding.

Future Improvements

A more structured iterative development cycle with a test set could have helped fine-tune the custom chunking strategy and retrieval parameters (e.g., number of retrieved chunks). A hybrid retrieval approach, using keyword search in the PDF could have further boosted results. For example, finding the values of assets could be easily done by a human by searching for “total assets” in the PDF. Also re-ranking of the obtained results by another agent could be beneficial.

✅ Pros

  1. Structured chunking strategy – Custom chunking improves context retention and table preservation and offers full control.

  2. No need to load the full PDF into context – Only relevant chunks are retrieved, reducing token usage and avoiding API rate limits, enabling handling of lengthy reports.

❌ Cons

  1. Expensive pre-processing – Requires PDF parsing, markdown conversion, chunking, and embedding, increasing complexity and cost.

  2. Longer setup time – Initial document processing took ~10 hours

  3. Chunk retrieval tuning required – The number of retrieved chunks, embedding quality, and query formulation need fine-tuning for optimal results.

  4. Not fully automated – Some manual fixes were needed for problematic PDFs (e.g., reshaping incompatible reports).

IBM Generation Approach

📌 Ranking: 94th (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o, IBM

In this approach, the retrieved chunks from the dense retrieval pipeline were passed to IBM's "granite-20b-code-instruct" model, which served as the company-specialized agent.

  • This agent generated company-specific answers, but only as plain text, instructed to output json data (structured outputs were not supported).

  • The output was then sent to a GPT-4o-based merger agent, which reformatted it into structured outputs suitable for final submission.

  • Chain of thought reasoning was applied at both stages.

  • References were not implemented due to lack of time, negatively impacting the retrieval score and hence the ranking of this approach.

✅ Pros

  1. More sophisticated retrieval method.

❌ Cons

  1. Expensive pre-processing due to PDF conversion, chunking, and embedding.

  2. No structured outputs of model, which makes data handling more challenging

OpenAI Generation Approach

📌 Ranking: 28th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o

Similar to the IBM-based approach, but instead of IBM’s model, OpenAI’s GPT-4o was used for company-specific responses.

  • - Retrieved chunks (along with the extended query) were sent to GPT-4o.

  • - Generated responses included chunk IDs, which were then mapped back to PDF page numbers using fuzzy search.

  • - A merger model then compiled the final structured answer.

✅ Pros

More sophisticated retrieval method, offering higher accuracy.

❌ Cons

Expensive pre-processing due to PDF conversion, chunking, and embedding.

Gemini-based Retrieval + OpenAI-based Generation

📌 Ranking: 19th Place (across all submissions), 8th Place across teams
📌 Code: GitHub Repository
📌 Keywords: multi-agent, Gemini 2.0 Flash, OpenAI GPT-4o

Instead of relying on vector-based retrieval, this method directly queried Gemini 2.0 Flash with the full annual report of the relevant company in context.

Key Difference from the Naive Gemini Approach:

  • - The router agent extended queries before passing them to Gemini.

  • - Multi-company queries were split into separate subqueries, avoiding rate limits.

Challenges

  1. Rate limits still occurred for "Poste Italiane" due to its 1,000+ page report.

  2. A quick fix could have been to optimize or truncate the PDF, but this was not implemented due to time constraints.

✅ Pros

  1. No expensive pre-processing (faster & cheaper).

  2. Achieved the highest scores among tested methods.

❌ Cons

  • Still need to handle API rate limits for extremely large PDFs.

Final Overview & Thoughts

*) rank on the leaderboard with all submissions, placed 8th on final leaderboard considering only the best submission of each team

Key Takeaways

  • For simple use cases, Gemini 2.0 Flash is quick and easy but struggles with large reports and multi-company queries.

  • Dense retrieval provides a scalable and modular solution, reducing API costs by not requiring the full PDF in context. Ultimately leading to the highest retrieval score among the approaches.

  • Multi-agent systems (especially OpenAI-based) achieve higher accuracy but require more processing time and cost.

  • Gemini + OpenAI merging performed best in rankings but still requires rate limit handling for massive PDFs.

  • Better final answer generation could have boosted scores. While the retrieval mechanism performed well (as indicated by the high retrieval score), prompt engineering in iterative assessment could have further improved the accuracy of generated answers.

The multi-agent approaches significantly improved retrieval accuracy and final answer quality compared to simpler single-agent methods. However, they also introduced higher costs and complexity due to their multi-step nature. The dense retrieval approach, while structured and scalable, further amplified these issues with its pre-processing and embedding overhead.

The Gemini-based retrieval with OpenAI generation eliminated many of these challenges by removing the need for dense retrieval, leading to the best scores with minimal setup - making it a highly efficient hybrid solution.

Despite the challenges and room for optimization, the ERC 2025 offered a great opportunity to prototype different solutions, experiment with LLM-powered retrieval, and learn a lot under real-world constraints. Getting my “hands dirty” with RAG techniques, multi-agent collaboration, and retrieval models in such a short timeframe was an amazing experience! 🚀
Insights

These are the proud winners of the Enterprise RAG Challenge

Discover the winners of the Enterprise RAG Challenge! Explore top RAG solutions, watch the official announcement, and see how AI-driven retrieval and LLMs shaped the best-performing models.

Insights

Team-Leaderboard of the Enterprise RAG Challenge

The team-leaderboard includes all submitted entries – including those submitted after the Ground Truth was released. Therefore, we consider this ranking an unofficial overview.

Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Blog 1/21/25

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog

How I Won the Enterprise RAG Challenge

In this article, Ilia Ris describes the approach that helped him achieve first place in both prize categories and the overall SotA leaderboard.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Blog

Crisis management & building a sustainable future with AI

Non-profit organizations develop AI models to tackle global challenges - and draw lessons for businesses worldwide

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Future-Proof Your Business with SAP Cloud ERP

Discover how SAP Cloud ERP transforms business operations with enhanced agility, reduced costs, and real-time decision-making. Download our free guide and future-proof your organization.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 11/4/24

SAM Wins First Prize at AIM Hackathon

The winning team of the AIM Hackathon, nexus. Group AI, developed SAM, an AI-powered ESG reporting platform designed to help companies streamline their sustainability compliance.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Referenz 8/8/22

Interdisciplinary collaboration at C&A with Atlassian

To homogenize the toolchain TIMETOACT replaced two independent ticketing systems for C&A Services GmbH &Co. with the Atlassian product range. With this step into the enterprise cloud, the fashion retailer is putting is putting an exclamation mark on cross-departmental and cross-location digital collaboration.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!