8th place in Enterprise RAG Challenge 2025:

Answering Business Questions with LLMs

Key Insights
  • PDF extraction is tough: Annual reports contain vital data, but extracting structured information automatically from PDFs is complex due to their human-centric formatting.
  • Retrieval quality is crucial: AI-generated answers strongly depend on correctly finding relevant pages within large documents.

  • Approach comparison:

    • Naive approach (Gemini 2.0 Flash): Quick and easy but limited with large files or multiple documents.

    • Dense retrieval (Vector Databases): Highly accurate, scalable, but requires extensive preprocessing.

    • Multi-agent systems: Improved accuracy by combining multiple AI models, adding complexity and cost.

  • Best performing solution: Combining Gemini for retrieval and OpenAI for answer generation delivered the best balance of accuracy and ease of use, ranking 8th place on final leaderboard.

3/11/25

Final leaderboard with TAT employees submissions marked. Hours: time it took the team to produce the results, R: Retrieval Score. Max: 100, G: Generation Score. Max: 100, Score: Final score (R/3+G). Max: 133, AI: teams leveraged Rinat Abdullins AI Research (through communities or TimeToAct), Lcl: Local model was used

The Enterprise RAG Challenge on 27.02.2025 demands answering company-specific questions based on annual reports using large language models (LLMs) and sophisticated retrieval-augmented generation (RAG) techniques. 

Participants must extract precise information from extensive PDF documents, which often span 60 to 100 pages, contain complex formatting, and include tables and illustrations (as you can see below). Parsing data from PDFs is a challenge in itself. These files are optimized for visual presentation rather than structured data extraction, making it difficult to retrieve relevant text and numerical information accurately.

Example of a challenging PDF page to parse (Playtech plc Annual Report and Financial Statements 2022, p. 62)

Here are some example questions from the challenge to be answered with such PDFs:

A successful submission must not only provide an accurate answer but also reference the page numbers containing the relevant information. This enables two evaluation criteria (as can be seen in the leaderboard above):

  1. Generation Score (G): Measures the correctness of the extracted answer.
  2. Retrieval Score (R): Assesses whether the retrieval mechanism identified the correct locations in the document.

The generation score is inherently constrained by the retrieval score - if the wrong pages are extracted, even the best model cannot produce the correct answer. A higher retrieval score increases the likelihood of accurate answers.

A properly formatted submission looks like this:

To generate high-quality answers with reliable references, I built several LLM-powered approaches. You can find my implementations and experiments in the source code repository.

Naive Gemini 2.0 Flash Approach

📌 Ranking: 57th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: Gemini 2.0 Flash, chain of thought, structured outputs

Overview:
For the first, naive approach, I took inspiration from a blog post by Philipp Schmid. The key idea was to leverage the one-million-token context window of Gemini 2.0 Flash, which is free within certain usage limits (details).

Google's genAI package enables direct PDF file uploads into the model's context. This meant that each query was passed to Gemini along with a custom system prompt and the full PDF file, allowing the model to process the document in a single step.

Applied LLM Techniques:
To ensure machine-readable responses, I enforced structured outputs (SO), which included:

  • Chain of thought (CoT) reasoning: Encourages the model to explicitly think through the steps before generating an answer. This also helps with debugging and refining prompts.

  • Final answer: The extracted response.

  • Reference list: Page numbers where relevant information was found.

Challenges

One major limitation was API rate limits and file size constraints. Some requests, especially those involving multiple companies, failed because they required loading five separate PDFs into context - exceeding the model’s capacity.

As Philipp Schmid outlined, optimizing PDFs before upload or splitting large files could have helped mitigate these issues. However, due to time constraints, I was unable to iterate systematically on a test set or analyze CoT outputs for fine-tuned prompt engineering. This limited the ability to refine the model’s reasoning and response accuracy.

Pros & Cons

✅ Pros

  • - Easy setup – No pre-processing required.

  • - Cost-effective – Free within usage limits.

  • - Fast deployment – Minimal development effort.

❌ Cons

- Rate limits & file size issues – Needs optimization for large or multi-document queries.

- Lack of iterative refinement – No systematic debugging or CoT analysis.

Multi-agent Approaches

While the naive approach provided an easy way to process PDFs, it struggled with multi-company queries and API rate limits. To improve retrieval accuracy and scalability, I experimented with more advanced multi-agent systems.

The process follows a structured pipeline:

  1. Routing: An OpenAI GPT-4o-based router agent first identifies the relevant companies in a given query and formulates extended subqueries for each.

  2. Company-Specific Retrieval: Specialized agents fetch relevant information for each company.

  3. Merging: A final merger agent compiles all responses into a final structured answer.

Dense Retrieval

Keywords: Custom chunking, Qdrant vector database, query extension, chain of thought, structured outputs

For this approach, each company-specialized agent used the Qdrant vector database for retrieval. PDFs were first converted to Markdown format using docling, offering advanced PDF understanding and parsing, especially for tables.

Custom Chunking Strategy

The extracted markdown content was split into custom chunks:

  • - Chapter-based segmentation to retain context.
  • - Paragraph-based splitting, only when necessary.
  • - Tables preserved as whole units with metadata (paragraphs before table).

The chunks were embedded using OpenAI's "text-embedding-3-small" model. Based on the identified company and refined query of the router, the vector database retrieved the top five relevant chunks per question.

Challenges

- One PDF required manual reshaping due to an incompatible page format. 

- Two reports were too large (one had 1,043 pages); a quick fix was to process only the first 150 pages. 

- Parsing all PDFs took ~10 hours. 

- Some extracted chunks contained artefacts, requiring post-processing before embedding.

Future Improvements

A more structured iterative development cycle with a test set could have helped fine-tune the custom chunking strategy and retrieval parameters (e.g., number of retrieved chunks). A hybrid retrieval approach, using keyword search in the PDF could have further boosted results. For example, finding the values of assets could be easily done by a human by searching for “total assets” in the PDF. Also re-ranking of the obtained results by another agent could be beneficial.

✅ Pros

  1. Structured chunking strategy – Custom chunking improves context retention and table preservation and offers full control.

  2. No need to load the full PDF into context – Only relevant chunks are retrieved, reducing token usage and avoiding API rate limits, enabling handling of lengthy reports.

❌ Cons

  1. Expensive pre-processing – Requires PDF parsing, markdown conversion, chunking, and embedding, increasing complexity and cost.

  2. Longer setup time – Initial document processing took ~10 hours

  3. Chunk retrieval tuning required – The number of retrieved chunks, embedding quality, and query formulation need fine-tuning for optimal results.

  4. Not fully automated – Some manual fixes were needed for problematic PDFs (e.g., reshaping incompatible reports).

IBM Generation Approach

📌 Ranking: 94th (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o, IBM

In this approach, the retrieved chunks from the dense retrieval pipeline were passed to IBM's "granite-20b-code-instruct" model, which served as the company-specialized agent.

  • This agent generated company-specific answers, but only as plain text, instructed to output json data (structured outputs were not supported).

  • The output was then sent to a GPT-4o-based merger agent, which reformatted it into structured outputs suitable for final submission.

  • Chain of thought reasoning was applied at both stages.

  • References were not implemented due to lack of time, negatively impacting the retrieval score and hence the ranking of this approach.

✅ Pros

  1. More sophisticated retrieval method.

❌ Cons

  1. Expensive pre-processing due to PDF conversion, chunking, and embedding.

  2. No structured outputs of model, which makes data handling more challenging

OpenAI Generation Approach

📌 Ranking: 28th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o

Similar to the IBM-based approach, but instead of IBM’s model, OpenAI’s GPT-4o was used for company-specific responses.

  • - Retrieved chunks (along with the extended query) were sent to GPT-4o.

  • - Generated responses included chunk IDs, which were then mapped back to PDF page numbers using fuzzy search.

  • - A merger model then compiled the final structured answer.

✅ Pros

More sophisticated retrieval method, offering higher accuracy.

❌ Cons

Expensive pre-processing due to PDF conversion, chunking, and embedding.

Gemini-based Retrieval + OpenAI-based Generation

📌 Ranking: 19th Place (across all submissions), 8th Place across teams
📌 Code: GitHub Repository
📌 Keywords: multi-agent, Gemini 2.0 Flash, OpenAI GPT-4o

Instead of relying on vector-based retrieval, this method directly queried Gemini 2.0 Flash with the full annual report of the relevant company in context.

Key Difference from the Naive Gemini Approach:

  • - The router agent extended queries before passing them to Gemini.

  • - Multi-company queries were split into separate subqueries, avoiding rate limits.

Challenges

  1. Rate limits still occurred for "Poste Italiane" due to its 1,000+ page report.

  2. A quick fix could have been to optimize or truncate the PDF, but this was not implemented due to time constraints.

✅ Pros

  1. No expensive pre-processing (faster & cheaper).

  2. Achieved the highest scores among tested methods.

❌ Cons

  • Still need to handle API rate limits for extremely large PDFs.

Final Overview & Thoughts

*) rank on the leaderboard with all submissions, placed 8th on final leaderboard considering only the best submission of each team

Key Takeaways

  • For simple use cases, Gemini 2.0 Flash is quick and easy but struggles with large reports and multi-company queries.

  • Dense retrieval provides a scalable and modular solution, reducing API costs by not requiring the full PDF in context. Ultimately leading to the highest retrieval score among the approaches.

  • Multi-agent systems (especially OpenAI-based) achieve higher accuracy but require more processing time and cost.

  • Gemini + OpenAI merging performed best in rankings but still requires rate limit handling for massive PDFs.

  • Better final answer generation could have boosted scores. While the retrieval mechanism performed well (as indicated by the high retrieval score), prompt engineering in iterative assessment could have further improved the accuracy of generated answers.

The multi-agent approaches significantly improved retrieval accuracy and final answer quality compared to simpler single-agent methods. However, they also introduced higher costs and complexity due to their multi-step nature. The dense retrieval approach, while structured and scalable, further amplified these issues with its pre-processing and embedding overhead.

The Gemini-based retrieval with OpenAI generation eliminated many of these challenges by removing the need for dense retrieval, leading to the best scores with minimal setup - making it a highly efficient hybrid solution.

Despite the challenges and room for optimization, the ERC 2025 offered a great opportunity to prototype different solutions, experiment with LLM-powered retrieval, and learn a lot under real-world constraints. Getting my “hands dirty” with RAG techniques, multi-agent collaboration, and retrieval models in such a short timeframe was an amazing experience! 🚀
Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Blog 1/21/25

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Insights

These are the proud winners of the Enterprise RAG Challenge

Discover the winners of the Enterprise RAG Challenge! Explore top RAG solutions, watch the official announcement, and see how AI-driven retrieval and LLMs shaped the best-performing models.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Leistung

Managed services that scale up with your business

As your projects become more complex and dynamic, you must ensure performance, availability and scalability.

Guide

Future-Proof Your Business with SAP Cloud ERP

Discover how SAP Cloud ERP transforms business operations with enhanced agility, reduced costs, and real-time decision-making. Download our free guide and future-proof your organization.

Event

Digital Enterprise Show Málaga

Discover digital strategies and solutions to boost agility and efficiency. Join the Digital Enterprise Show to connect, innovate, and drive business success with industry leaders.

Navigationsbild zu Business Intelligence
Service

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

Zusammenarbeiten mit Google Workspace
Lösung 3/10/23

Business Continuity

Do you know your critical business functions and their risks? Do you know what to do when something goes wrong?

Navigationsbild zu Business Intelligence
Service

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

Blog 10/31/23

5 Inconvenient Questions when hiring an AI company

This article discusses five questions you should ask when buying an AI. These questions are inconvenient for providers of AI products, but they are necessary to ensure that you are getting the best product for your needs. The article also discusses the importance of testing the AI system on your own data to see how it performs.

Insights

LLM Benchmarks March 2025

What's new in the world of LLMs? Find out and read why Google DeepMind managed to surprise us more than once last month.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Headerbild zu IBM DataStage
Technologie

IBM InfoSphere Information Server

IBM Information Server is a central platform for enterprise-wide information integration. With IBM Information Server, business information can be extracted, consolidated and merged from a wide variety of sources.

Analytics und Business Intelligence
Service

Analytics & Business Intelligence

Analytics & Business Intelligence has become increasingly important in recent years.

Daten einfach speichern und verwalten mit Google Cloud
Produkt

Looker - Business Intelligence

More than typical business intelligence. Looker offers a data experience that customers love. CLOUDPILOTS is the leading partner in the German-speaking region.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!