Boosting speed of scikit-learn regression algorithms

When browsing the web, numerous posts can be found discussing techniques to speed up the training time of well-known machine learning algorithms. Surprisingly, there is limited information available regarding the prediction phase. However, from a practical standpoint, this aspect holds great relevance. Once a satisfactory regression algorithm has been trained, it is typically deployed in a real-world system. In such cases, the speed at which predictions are obtained becomes crucial. Faster prediction times enable real-time or near-real-time decision-making, enhance user experience in interactive applications, and facilitate efficient processing of large-scale datasets. Therefore, optimizing inference speed can have significant implications for various domains and applications.

The purpose of this blog post is to investigate the performance and prediction speed behavior of popular regression algorithms, i.e. models that predict numerical values based on a set of input variables. Considering that scikit-learn is the most extensively utilized machine learning framework [1], our focus will be on exploring methods to accelerate its algorithms' predictions.

Benchmarking regression algorithms

To assess the current state of popular regression algorithms, we selected four popular regression datasets from Kaggle [2], along with an internal dataset from our company. These datasets vary in sample size, number and type of features, capturing performance for different data structures.

To ensure fair comparisons, we need to optimize hyperparameters before testing to unlock the models' full potential. We will benchmark the following regression algorithms:

The different versions of regularized linear regression, such as lasso, ridge, and elastic net, are not analyzed separately as they were comparable to pure linear regression in terms of prediction speed and accuracy in a pre-evaluation step.

Prediction speed vs. accuracy

The plot below displays the benchmarking results on our company's internal dataset. We can observe a sweet spot in the bottom left, where both the error - measured via root mean square error (RMSE) - and prediction times are low. Simple neural networks (Multilayer Perceptron, MLP) and gradient boosted regression trees demonstrate good performance in both dimensions. Random forest also shows decent accuracy but has the highest prediction time. Other algorithms exhibit reasonable prediction speed but relatively high errors.

However, it is crucial to try different algorithms on the specific dataset at hand. Accuracy and prediction time heavily depend on the number of features used, their transformations, as well as the model's parameters. Linear models, for example, may perform well with properly transformed features, while larger MLPs might exhibit longer prediction times. Nevertheless, algorithms like random forest and k-NN are by construction expected to be slower in inference speed.

How to speed up inference

Generally, scikit-learn models are already optimized through compiled Cython extensions or optimized computing libraries [3]. However, there are additional ways to accelerate prediction latency, apart from using faster hardware. In this blog post, we’ll benchmark the following optimizations:

Data-based approaches:

Implementation-based approaches:

Furthermore, we want to mention the following optimization approaches, which we did not include in our benchmark, partly because they are problem specific:

Data-based approaches:

Model-based approaches:

Implementation-based approaches:

  • Implement the prediction step with given weights independently, potentially in a faster programming language, to avoid unnecessary overhead

  • Use cloud services for prediction, such as Google ML, Amazon ML or MS Azure

As you can see, there are numerous ways to influence inference time, ranging from fundamental approaches to simple tricks. Changing the data structure and implementing algorithms from scratch optimized for efficiency may be more involved, while the latter approaches can be easily applied even to existing systems that use scikit-learn.

Note that all of the above approaches do not affect prediction quality, except reducing the number of features and model complexity. For these approaches, it is important to evaluate the trade-off between prediction speed and quality.

In this blog post, we mostly benchmark approaches that do not affect prediction quality, and therefore focus on evaluating the speedup in the next section.

Evaluating some speedup tricks

Check out the technical appendix to see how the time measurement is performed.

Reducing the number of features by half (in our case from 106 to 53 features) only leads to small decreases in inference speed for KNN, SVR while it had an major influence on the MLP. Disabling scikit-learn's finiteness checkup, which is just one line of code, improves prediction speed more significantly. As can be seen below, inference time can be reduced up to 40% depending on the algorithm. Utilizing the Intel extension for scikit-learn, also requiring only one line of code, results in substantial speed improvements for random forest, SVR and the KNN regressor. For the latter two algorithms, a time reduction of more than 50% could be achieved, while for random forest, prediction time decreases by impressive 98%. In the plots below there are no values shown for the other algorithms as the Intel extension currently does not support those.

As can be seen below, most potential lies in bulk inference. By predicting several samples simultaneously (here: 100 or 1000 samples at once), the average prediction time per sample decreases significantly for most of the algorithms. Overall, bulk prediction can lead up to 200-fold speed increases in this test setting. This approach is particularly effective for the MLP as well as linear and tree based methods, greatly accelerating their performance.

Summary

Fast predictions are crucial for various use cases, in particular when it comes to real-time predictions. Moreover, investing in efficiency always pays off by reducing energy consumption, thus saving money and at the same time lowering carbon emissions.

In this blog post we have explored multiple ways to achieve faster prediction times. Firstly, the dimensionality of the data and the algorithm chosen have major influence on inference speed and scalability behaviour. However, there are various tricks to even accelerate existing scikit-learn code. Disabling scikit-learn's finite data validation or utilizing the Intel extension for supported algorithms can already yield considerable improvements depending on the algorithm. However, the most substantial gains can be achieved by addressing fundamental aspects, such as reducing the number of features (in particular for high-dimensional data), implementing bulk prediction or custom prediction methods. These strategies can result in speed increases by factors of several hundred.

In our small test setting, we could additionally show that a small neural network, gradient boosted regressor and random forest appear to be the most promising choices in terms of both accuracy and speed, when using the above-mentioned speedup tricks.


Sources

[1] https://storage.googleapis.com/kaggle-media/surveys/Kaggle%20State%20of%20Machine%20Learning%20and%20Data%20Science%202020.pdf 

[2] House sales: House Sales in King County, USA

red wine quality: Red Wine Quality ,

avocado prices: Avocado Prices ,

medical insurance costs: Medical Cost Personal Datasets

[3] 8. Computing with scikit-learn — scikit-learn 0.23.2 documentation

 


Technical Appendix

Speedtests were performed with all unnecessary background processes stopped.

 

Inference time measurement for one test sample (“atomic prediction”):

n = 500 # number of consecutive runs
r = 10 # number of repeats of above

pred_times = timeit.repeat(stmt=lambda: model.predict(X_test[0]), 
  repeat=r, number=n)
pred_times = np.array(pred_times) / n # divide by number of runs
pred_time = np.min(pred_times) # take minimum of all repetitions

Inference time measurement for several samples at once (“bulk prediction”):

n = 50 # number of consecutive runs
r = 5 # number of repeats of above

X_test_sample = X_test[0:1000] # 100 or 1000
pred_times = timeit.repeat(stmt=lambda: model.predict(X_test_sample), 
  repeat=r, number=n)
pred_times = np.array(pred_times) / n # divide by number of runs
pred_times = pred_times / len(X_test_sample) # divide by number of samples
pred_time = np.min(pred_times) # take minimum of all repetitions

With “model” being the scikit-learn models mentioned above which were trained with the first 10.000 observations of the “house sales” data and using default model paramters.

 

Versions used:

  • Python: 3.9.7

  • Scikit-learn: 1.0.2

  • Scikit-learn-intelex: 2021.20210714.120553

Headerbild zu Operationalisierung von Data Science (MLOps)
Service

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Headerbild für IBM SPSS
Technologie

IBM SPSS Modeler

IBM SPSS Modeler is a tool that can be used to model and execute tasks, for example in the field of Data Science and Data Mining, via a graphical user interface.

Blog 5/1/21

Ways of Creating Single Case Discriminated Unions in F#

There are quite a few ways of creating single case discriminated unions in F# and this makes them popular for wrapping primitives. In this post, I will go through a number of the approaches that I have seen.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Blog 12/19/22

Creating a Cross-Domain Capable ML Pipeline

As classifying images into categories is a ubiquitous task occurring in various domains, a need for a machine learning pipeline which can accommodate for new categories is easy to justify. In particular, common general requirements are to filter out low-quality (blurred, low contrast etc.) images, and to speed up the learning of new categories if image quality is sufficient. In this blog post we compare several image classification models from the transfer learning perspective.

Service

Digitalization and cloud transformation

We adjust the levers of your individual digitalization initiative for speed, adaptability, security and compliance!

Produkt

Cloud Machine Learning

Instead of writing code that describes the action to be performed by the computer, your code provides an algorithm that adapts itself. Learn faster and better with Machine Learning!

Blog 11/30/22

Part 2: Detecting Truck Parking Lots on Satellite Images

In the previous blog post, we created an already pretty powerful image segmentation model in order to detect the shape of truck parking lots on satellite images. However, we will now try to run the code on new hardware and get even better as well as more robust results.

Wissen 4/14/23

The "Beautiful five" and the Power of "One-Number" Reporting

Key figures are a perennial favorite in idea management, have been used for many years (decades) and are now very topical again. The reasons are obvious. You want to set performance benchmarks, define targets, follow up on where things are not going so well and measure the success or failure of idea management.

Wissen 4/14/23

General Data Protection Regulation of idea management

Walldorf-based dacuro GmbH provides the external data protection officer for companies, helps with the fulfillment of documentation obligations and advises on all aspects of data protection. Fulfilling the requirements of the GDPR without blocking everyday life is the claim of dacuro GmbH. The team of lawyers and IT specialists provides support for all GDPR challenges, whether they are of a legal or technical nature.

News 6/22/23

brainbits is now part of TIMETOACT GROUP

With the akquisition of the cologne based IT expert brainbits we are enlarging our Atlassian and Webdevelopment Know How

Headerbild zu IBM Decision Optimization
Technologie

Decision Optimization

Mathematical algorithms enable fast and efficient improvement of partially contradictory specifications. As an integral part of the IBM Data Science platform "Cloud Pak for Data" or "IBM Watson Studio", decision optimisation has been decisively expanded and embedded in the Data Science process.

News 1/26/21

The IPG Group becomes part of the TIMETOACT GROUP

The TIMETOACT GROUP acquires the majority of the shares of IPG Information Process Group Holding AG, based in Winterthur. Through the acquisition, the competencies for Identity and Access Management (IAM) solutions in the DACH market are combined.

News 1/26/21

The IPG Group becomes part of the TIMETOACT GROUP

The TIMETOACT GROUP acquires the majority of the shares of IPG Information Process Group Holding AG, based in Winterthur. Through the acquisition, the competencies for Identity and Access Management (IAM) solutions in the DACH market are combined.

Blog 3/14/25

What is SAP S/4HANA Cloud Public Edition?

This blog post covers everything you need to know about SAP S/4HANA Cloud Public Edition, including its key features, implementation options, and the advantages of adopting this powerful ERP solution.

News 11/4/24

EverIT becomes part of catworkx and the TIMETOACT GROUP

catworkx (part of the TIMETOACT GROUP), a leading partner for enterprise integration based on the Atlassian platform, is acquiring EverIT, a specialized Hungarian Atlassian partner.

News 11/4/24

EverIT becomes part of catworkx and TIMETOACT GROUP

Cologne/Budapest, 4 November 2024 – catworkx (part of TIMETOACT GROUP), a leading partner for Enterprise integration based on the Atlassian platform, is acquiring EverIT, a specialized Hungarian based Atlassian Partner. Together, the companies will build on their long-standing relationship and expand catworkx’s leading market position into Central Eastern Europe and strengthen catworkx’s global offering. The parties agreed not to disclose the details of the transaction.

Insights

These are the proud winners of the Enterprise RAG Challenge

Discover the winners of the Enterprise RAG Challenge! Explore top RAG solutions, watch the official announcement, and see how AI-driven retrieval and LLMs shaped the best-performing models.

Blog 11/10/23

Part 1: Data Analysis with ChatGPT

In this new blog series we will give you an overview of how to analyze and visualize data, create code manually and how to make ChatGPT work effectively. Part 1 deals with the following: In the data-driven era, businesses and organizations are constantly seeking ways to extract meaningful insights from their data. One powerful tool that can facilitate this process is ChatGPT, a state-of-the-art natural language processing model developed by OpenAI. In Part 1 pf this blog, we'll explore the proper usage of data analysis with ChatGPT and how it can help you make the most of your data.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!