
Answering Business Questions with LLMs
- PDF extraction is tough: Annual reports contain vital data, but extracting structured information automatically from PDFs is complex due to their human-centric formatting.
Retrieval quality is crucial: AI-generated answers strongly depend on correctly finding relevant pages within large documents.
Approach comparison:
Naive approach (Gemini 2.0 Flash): Quick and easy but limited with large files or multiple documents.
Dense retrieval (Vector Databases): Highly accurate, scalable, but requires extensive preprocessing.
Multi-agent systems: Improved accuracy by combining multiple AI models, adding complexity and cost.
Best performing solution: Combining Gemini for retrieval and OpenAI for answer generation delivered the best balance of accuracy and ease of use, ranking 8th place on final leaderboard.

Final leaderboard with TAT employees submissions marked. Hours: time it took the team to produce the results, R: Retrieval Score. Max: 100, G: Generation Score. Max: 100, Score: Final score (R/3+G). Max: 133, AI: teams leveraged Rinat Abdullins AI Research (through communities or TimeToAct), Lcl: Local model was used
The Enterprise RAG Challenge on 27.02.2025 demands answering company-specific questions based on annual reports using large language models (LLMs) and sophisticated retrieval-augmented generation (RAG) techniques.
Participants must extract precise information from extensive PDF documents, which often span 60 to 100 pages, contain complex formatting, and include tables and illustrations (as you can see below). Parsing data from PDFs is a challenge in itself. These files are optimized for visual presentation rather than structured data extraction, making it difficult to retrieve relevant text and numerical information accurately.

Example of a challenging PDF page to parse (Playtech plc Annual Report and Financial Statements 2022, p. 62)
Here are some example questions from the challenge to be answered with such PDFs:

A successful submission must not only provide an accurate answer but also reference the page numbers containing the relevant information. This enables two evaluation criteria (as can be seen in the leaderboard above):
- Generation Score (G): Measures the correctness of the extracted answer.
- Retrieval Score (R): Assesses whether the retrieval mechanism identified the correct locations in the document.
The generation score is inherently constrained by the retrieval score - if the wrong pages are extracted, even the best model cannot produce the correct answer. A higher retrieval score increases the likelihood of accurate answers.
A properly formatted submission looks like this:

To generate high-quality answers with reliable references, I built several LLM-powered approaches. You can find my implementations and experiments in the source code repository.
Naive Gemini 2.0 Flash Approach
📌 Ranking: 57th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: Gemini 2.0 Flash, chain of thought, structured outputs
Overview:
For the first, naive approach, I took inspiration from a blog post by Philipp Schmid. The key idea was to leverage the one-million-token context window of Gemini 2.0 Flash, which is free within certain usage limits (details).
Google's genAI
package enables direct PDF file uploads into the model's context. This meant that each query was passed to Gemini along with a custom system prompt and the full PDF file, allowing the model to process the document in a single step.
Applied LLM Techniques:
To ensure machine-readable responses, I enforced structured outputs (SO), which included:
Chain of thought (CoT) reasoning: Encourages the model to explicitly think through the steps before generating an answer. This also helps with debugging and refining prompts.
Final answer: The extracted response.
Reference list: Page numbers where relevant information was found.
Challenges
One major limitation was API rate limits and file size constraints. Some requests, especially those involving multiple companies, failed because they required loading five separate PDFs into context - exceeding the model’s capacity.
As Philipp Schmid outlined, optimizing PDFs before upload or splitting large files could have helped mitigate these issues. However, due to time constraints, I was unable to iterate systematically on a test set or analyze CoT outputs for fine-tuned prompt engineering. This limited the ability to refine the model’s reasoning and response accuracy.
Pros & Cons
✅ Pros
- Easy setup – No pre-processing required.
- Cost-effective – Free within usage limits.
- Fast deployment – Minimal development effort.
❌ Cons
- Rate limits & file size issues – Needs optimization for large or multi-document queries.
- Lack of iterative refinement – No systematic debugging or CoT analysis.

Multi-agent Approaches
While the naive approach provided an easy way to process PDFs, it struggled with multi-company queries and API rate limits. To improve retrieval accuracy and scalability, I experimented with more advanced multi-agent systems.
The process follows a structured pipeline:
Routing: An OpenAI GPT-4o-based router agent first identifies the relevant companies in a given query and formulates extended subqueries for each.
Company-Specific Retrieval: Specialized agents fetch relevant information for each company.
Merging: A final merger agent compiles all responses into a final structured answer.
Dense Retrieval
Keywords: Custom chunking, Qdrant vector database, query extension, chain of thought, structured outputs
For this approach, each company-specialized agent used the Qdrant vector database for retrieval. PDFs were first converted to Markdown format using docling, offering advanced PDF understanding and parsing, especially for tables.
Custom Chunking Strategy
The extracted markdown content was split into custom chunks:
- - Chapter-based segmentation to retain context.
- - Paragraph-based splitting, only when necessary.
- - Tables preserved as whole units with metadata (paragraphs before table).
The chunks were embedded using OpenAI's "text-embedding-3-small"
model. Based on the identified company and refined query of the router, the vector database retrieved the top five relevant chunks per question.

Challenges
- One PDF required manual reshaping due to an incompatible page format.
- Two reports were too large (one had 1,043 pages); a quick fix was to process only the first 150 pages.
- Parsing all PDFs took ~10 hours.
- Some extracted chunks contained artefacts, requiring post-processing before embedding.
Future Improvements
A more structured iterative development cycle with a test set could have helped fine-tune the custom chunking strategy and retrieval parameters (e.g., number of retrieved chunks). A hybrid retrieval approach, using keyword search in the PDF could have further boosted results. For example, finding the values of assets could be easily done by a human by searching for “total assets” in the PDF. Also re-ranking of the obtained results by another agent could be beneficial.
✅ Pros
Structured chunking strategy – Custom chunking improves context retention and table preservation and offers full control.
No need to load the full PDF into context – Only relevant chunks are retrieved, reducing token usage and avoiding API rate limits, enabling handling of lengthy reports.
❌ Cons
Expensive pre-processing – Requires PDF parsing, markdown conversion, chunking, and embedding, increasing complexity and cost.
Longer setup time – Initial document processing took ~10 hours
Chunk retrieval tuning required – The number of retrieved chunks, embedding quality, and query formulation need fine-tuning for optimal results.
Not fully automated – Some manual fixes were needed for problematic PDFs (e.g., reshaping incompatible reports).
IBM Generation Approach
📌 Ranking: 94th (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o, IBM
In this approach, the retrieved chunks from the dense retrieval pipeline were passed to IBM's "granite-20b-code-instruct"
model, which served as the company-specialized agent.
This agent generated company-specific answers, but only as plain text, instructed to output json data (structured outputs were not supported).
The output was then sent to a GPT-4o-based merger agent, which reformatted it into structured outputs suitable for final submission.
Chain of thought reasoning was applied at both stages.
References were not implemented due to lack of time, negatively impacting the retrieval score and hence the ranking of this approach.
✅ Pros
More sophisticated retrieval method.
❌ Cons
Expensive pre-processing due to PDF conversion, chunking, and embedding.
No structured outputs of model, which makes data handling more challenging
OpenAI Generation Approach
📌 Ranking: 28th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o
Similar to the IBM-based approach, but instead of IBM’s model, OpenAI’s GPT-4o was used for company-specific responses.
- Retrieved chunks (along with the extended query) were sent to GPT-4o.
- Generated responses included chunk IDs, which were then mapped back to PDF page numbers using fuzzy search.
- A merger model then compiled the final structured answer.
✅ Pros
More sophisticated retrieval method, offering higher accuracy.
❌ Cons
Expensive pre-processing due to PDF conversion, chunking, and embedding.

Gemini-based Retrieval + OpenAI-based Generation
📌 Ranking: 19th Place (across all submissions), 8th Place across teams
📌 Code: GitHub Repository
📌 Keywords: multi-agent, Gemini 2.0 Flash, OpenAI GPT-4o
Instead of relying on vector-based retrieval, this method directly queried Gemini 2.0 Flash with the full annual report of the relevant company in context.
Key Difference from the Naive Gemini Approach:
- The router agent extended queries before passing them to Gemini.
- Multi-company queries were split into separate subqueries, avoiding rate limits.
Challenges
Rate limits still occurred for "Poste Italiane" due to its 1,000+ page report.
A quick fix could have been to optimize or truncate the PDF, but this was not implemented due to time constraints.
✅ Pros
No expensive pre-processing (faster & cheaper).
Achieved the highest scores among tested methods.
❌ Cons
Still need to handle API rate limits for extremely large PDFs.

Final Overview & Thoughts

*) rank on the leaderboard with all submissions, placed 8th on final leaderboard considering only the best submission of each team
Key Takeaways
-
For simple use cases, Gemini 2.0 Flash is quick and easy but struggles with large reports and multi-company queries.
-
Dense retrieval provides a scalable and modular solution, reducing API costs by not requiring the full PDF in context. Ultimately leading to the highest retrieval score among the approaches.
-
Multi-agent systems (especially OpenAI-based) achieve higher accuracy but require more processing time and cost.
-
Gemini + OpenAI merging performed best in rankings but still requires rate limit handling for massive PDFs.
-
Better final answer generation could have boosted scores. While the retrieval mechanism performed well (as indicated by the high retrieval score), prompt engineering in iterative assessment could have further improved the accuracy of generated answers.
The multi-agent approaches significantly improved retrieval accuracy and final answer quality compared to simpler single-agent methods. However, they also introduced higher costs and complexity due to their multi-step nature. The dense retrieval approach, while structured and scalable, further amplified these issues with its pre-processing and embedding overhead.
The Gemini-based retrieval with OpenAI generation eliminated many of these challenges by removing the need for dense retrieval, leading to the best scores with minimal setup - making it a highly efficient hybrid solution.