LLM Performance Series: Batching

Starting from the September TIMETOACT GROUP Austria LLM Benchmark we put special emphasis on the enterprise workloads. These include types of LLM tasks that are common to business digitalisation at scale:

  • generation of marketing texts that present a specific product to a given audience within JTBD framework;

  • information retrieval and Q&A systems;

  • CRM automation;

  • automated lead generation.

While many customers are happy with using ChatGPT from OpenAI or Microsoft Azure, some are still interested in running models locally, completely under their control. This is what the benchmarks are for! They help to track improvement of SOTA (State of the Art) models and pick the best ones for the task at hand.

While cloud models are usually managed (pay-as-you-go), running models locally requires a different investment scheme. Companies either rent GPUs per hour or buy them and install into their own servers.

Either of these options involves a different type of investment. It will also come with different Return on Investment. 

GPUs aren’t exactly cheap or easy to get these days. So we want our customers to make the best use of their investments.

The means that while evaluating models for different projects, we need to take into the account not only their accuracy on the specific tasks, but also performance and cost. To help with that we have added corresponding columns: cost and speed:

Cost is a relative number that estimates “how much money would it cost to run the entire benchmark on this configuration”. Since our benchmarks represent a mix from the real business workloads, costs should help to compare different models.

For the SaaS models (like OpenAI) we use price per token (prompt + completion). For the local models we estimate, how much would it take to rent a sufficiently large GPU from a major cloud vendor for the duration of the benchmark.

Speed represents number of requests per second that we can get from a model while running benchmarks in a single-inference mode (batch_size of 1).

But can we get better results? Indeed, there is a number of performance optimisations that could further boost performance and even quality of these models. The first one is batching.

Performance optimisation: batching

GPU batching is based on the fact that GPU is a very special piece of hardware. In certain cases, it doesn’t care much, if it needs to process 1 request or 10, as long as there is enough memory. Computations will roughly take the same amount of time.

In other words, if we have 10 requests, we could run them on the same GPU roughly in the same amount of time.

To illustrate the concept, we took a recently released LLama2 model - Nous Hermes 13B. It is open for the commercial use and strikes a good balance between accuracy and cost.

The workload involved to generating 20 more tokens in a text completion. The model was run on Nvidia A100 80GB PCIe using HuggingFace transformers library. We tested batch sizes from 4 to 400 with step of 4. Here are the results:

As you can see, adding more requests doesn’t result in a proportional increase of the processing time. This leads to ever growing throughput, measured tokens per second (TPS), until the number hits the ceiling around 2500 tokens per second.

If we stayed at batch size of 1, we would’ve never made it past 200 tokens per second.

Note: GPU memory consumption jumps back and forth because it is gets released only when there is a memory pressure. You can watch lower peaks to see the absolute minimal consumption.

Doesn’t this mean that the highest batch size is absolutely the best? Not necessarily. So far we have fixed a number of tokens for LLM to generate at 20. In some tasks, like generation of marketing texts, we would like to get more.

As we generate longer texts, GPU memory requirements tend to grow, while the speed decreases. This comes from the fact that after we generated one token, we need to pass the entire text back to the model for generating the next one. Rinse and repeat.

Let’s see how our generation capabilities change, as we prompt LLM model for more tokens.

The chart below shows that prompting for more tokens increases computation time. However, we can lower that time a bit by making our batch sizes smaller:

How does this translate into efficiency of the models? Here is another chart that compares throughput (tokens per second) of the same experiments:

Best throughput comes with a large batch size. However, as the completion length increases (more iterations are required), we need to lower our batch size in order to stay within 10 second time budget for the entire completion.

Depending on the expected prompt length in our solutions, we could optimise for the highest throughput and lowest latency, leading to a better customer satisfaction and better ROI. Charts like the one above help with such tasks.

But can we get better performance? There are still many other knobs to tune, for example: quantisation, speculative execution, using different inference runtimes. Each comes with its own trade-offs. Stay tuned!

Wissen 4/14/23

Exceptionally user-friendly thanks to SAP Fiori

The user-friendliness of an operational application determines to a large extent whether the process associated with the application is successfully used and implemented in the company. This applies in particular to idea and innovation management programs that rely on the motivation and participation of employees.

Die Videokonferenzlösung von Google Workspace Meet
Service

Workspace Security Assessment

Two of the most common topics for questions about Cloud services in general, and Google Workspace in particular, are security and data protection.

Service

Application Integration & Process Automation

Digitizing and improving business processes and responding agilely to change – more and more companies are facing these kind of challenges.

Übersicht 9/12/22

Atlassian-Trainings

catworkx is the largest official Atlassian training partner worldwide. We are authorized to teach the training courses developed by Atlassian on Atlassian products.

Headerbild Industrial Internet of Things (IIoT)
Kompetenz 9/16/20

Industrial Internet of Things

Whether in industry, urban planning or in the private sphere: The Internet of Things is making our lives easier. In particular, the digitalization of industrial production, saves companies time and money. We support you with your IoT project!

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Technologie 1/12/22

Our service offer for Mendix

The Dutch software manufacturer gives us the possibilities to create platform-independent low/no-code solutions for you with its products. In addition, we offer a wide range of services related to Mendix and are available to you from conceptual design to hosting and operation of your new solution.

Titelbild IPG Partner Ping Identity
Partner

Ping Identity

Identity security pioneer Ping Identity is one of the largest independent service providers of modern identity security solutions.

Atlassian Jira-Kunden-Info: Microsoft Basic Authentication Abschaltung
News 11/3/22

Microsoft Basic Authentication Deactivation

On October 1, 2022, Microsoft deactivated Standard Authentication (Basic Auth) for Microsoft 365 Exchange Online. Microsoft has now announced an extension until the beginning of 2023. Users now have the option of reactivating the old authentication one last time via self-diagnosis.

Prozessoptimierung nach Automotive SPICE® mit SAFe® und Atlassian-Tools
Referenz

Audit Successful: Process Optimization Per Automotive SPICE®

An automotive supplier faced the challenge of having the processes of a large development project certified by an external auditor according to the Automotive SPICE standard. The focus was also on...

CLOUDPILOTS Software consulting
Produkt

Security

The security features of the Google Cloud Platform are considered the best in the world. Of course, stored data is always stowed away in a GDPR-compliant manner.

News 2/21/24

Trustbit becomes part of TIMETOACT GROUP

TIMETOACT GROUP, a leading provider of IT services for medium-sized businesses, corporations and public institutions, is acquiring Trustbit, an experienced Austrian consulting firm focused on digital transformation and the development of digital business models.

Der Managed Service für Euren Google Workspace
Service

AUTOPILOT Economy Class

Economy is the basic version of our AUTOPILOT. The incidents per month are handled according to the fair use principle. Also, your Google Workspace settings are initially analyzed and adjusted if necessary, so that they are optimally tailored for you.

Navigationsbild zu Data Science
Service

Data Science, Artificial Intelligence and Machine Learning

For some time, Data Science has been considered the supreme discipline in the recognition of valuable information in large amounts of data. It promises to extract hidden, valuable information from data of any structure.

Network Performance Management
Kompetenz

Network Performance Management

With Network Performance Management you can monitor the performance of complex IT landscapes. This enables you to detect capacity bottlenecks, unexpected deviations from normal operation as well as faults at an early stage and to remedy them immediately.

Headerbild zu Operationalisierung von Data Science (MLOps)
Service

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.

Unternehmen

ARS Computer und Consulting GmbH

ARS is one of the leading companies in Software Engineering. For them, Cognitive Solutions and Artificial Intelligence are the future.

News 6/22/23

brainbits is now part of TIMETOACT GROUP

With the akquisition of the cologne based IT expert brainbits we are enlarging our Atlassian and Webdevelopment Know How

Blog 10/31/23

5 Inconvenient Questions when hiring an AI company

This article discusses five questions you should ask when buying an AI. These questions are inconvenient for providers of AI products, but they are necessary to ensure that you are getting the best product for your needs. The article also discusses the importance of testing the AI system on your own data to see how it performs.

Standort 11/8/22

catworkx Austria

As one of the largest Atlassian Platinum and Enterprise Solution Partners in the DACH region, we look forward to welcoming you to our office in Vienna.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!