Boosting speed of scikit-learn regression algorithms

When browsing the web, numerous posts can be found discussing techniques to speed up the training time of well-known machine learning algorithms. Surprisingly, there is limited information available regarding the prediction phase. However, from a practical standpoint, this aspect holds great relevance. Once a satisfactory regression algorithm has been trained, it is typically deployed in a real-world system. In such cases, the speed at which predictions are obtained becomes crucial. Faster prediction times enable real-time or near-real-time decision-making, enhance user experience in interactive applications, and facilitate efficient processing of large-scale datasets. Therefore, optimizing inference speed can have significant implications for various domains and applications.

The purpose of this blog post is to investigate the performance and prediction speed behavior of popular regression algorithms, i.e. models that predict numerical values based on a set of input variables. Considering that scikit-learn is the most extensively utilized machine learning framework [1], our focus will be on exploring methods to accelerate its algorithms' predictions.

Benchmarking regression algorithms

To assess the current state of popular regression algorithms, we selected four popular regression datasets from Kaggle [2], along with an internal dataset from our company. These datasets vary in sample size, number and type of features, capturing performance for different data structures.

To ensure fair comparisons, we need to optimize hyperparameters before testing to unlock the models' full potential. We will benchmark the following regression algorithms:

The different versions of regularized linear regression, such as lasso, ridge, and elastic net, are not analyzed separately as they were comparable to pure linear regression in terms of prediction speed and accuracy in a pre-evaluation step.

Prediction speed vs. accuracy

The plot below displays the benchmarking results on our company's internal dataset. We can observe a sweet spot in the bottom left, where both the error - measured via root mean square error (RMSE) - and prediction times are low. Simple neural networks (Multilayer Perceptron, MLP) and gradient boosted regression trees demonstrate good performance in both dimensions. Random forest also shows decent accuracy but has the highest prediction time. Other algorithms exhibit reasonable prediction speed but relatively high errors.

However, it is crucial to try different algorithms on the specific dataset at hand. Accuracy and prediction time heavily depend on the number of features used, their transformations, as well as the model's parameters. Linear models, for example, may perform well with properly transformed features, while larger MLPs might exhibit longer prediction times. Nevertheless, algorithms like random forest and k-NN are by construction expected to be slower in inference speed.

How to speed up inference

Generally, scikit-learn models are already optimized through compiled Cython extensions or optimized computing libraries [3]. However, there are additional ways to accelerate prediction latency, apart from using faster hardware. In this blog post, we’ll benchmark the following optimizations:

Data-based approaches:

Implementation-based approaches:

Furthermore, we want to mention the following optimization approaches, which we did not include in our benchmark, partly because they are problem specific:

Data-based approaches:

Model-based approaches:

Implementation-based approaches:

  • Implement the prediction step with given weights independently, potentially in a faster programming language, to avoid unnecessary overhead

  • Use cloud services for prediction, such as Google ML, Amazon ML or MS Azure

As you can see, there are numerous ways to influence inference time, ranging from fundamental approaches to simple tricks. Changing the data structure and implementing algorithms from scratch optimized for efficiency may be more involved, while the latter approaches can be easily applied even to existing systems that use scikit-learn.

Note that all of the above approaches do not affect prediction quality, except reducing the number of features and model complexity. For these approaches, it is important to evaluate the trade-off between prediction speed and quality.

In this blog post, we mostly benchmark approaches that do not affect prediction quality, and therefore focus on evaluating the speedup in the next section.

Evaluating some speedup tricks

Check out the technical appendix to see how the time measurement is performed.

Reducing the number of features by half (in our case from 106 to 53 features) only leads to small decreases in inference speed for KNN, SVR while it had an major influence on the MLP. Disabling scikit-learn's finiteness checkup, which is just one line of code, improves prediction speed more significantly. As can be seen below, inference time can be reduced up to 40% depending on the algorithm. Utilizing the Intel extension for scikit-learn, also requiring only one line of code, results in substantial speed improvements for random forest, SVR and the KNN regressor. For the latter two algorithms, a time reduction of more than 50% could be achieved, while for random forest, prediction time decreases by impressive 98%. In the plots below there are no values shown for the other algorithms as the Intel extension currently does not support those.

As can be seen below, most potential lies in bulk inference. By predicting several samples simultaneously (here: 100 or 1000 samples at once), the average prediction time per sample decreases significantly for most of the algorithms. Overall, bulk prediction can lead up to 200-fold speed increases in this test setting. This approach is particularly effective for the MLP as well as linear and tree based methods, greatly accelerating their performance.

Summary

Fast predictions are crucial for various use cases, in particular when it comes to real-time predictions. Moreover, investing in efficiency always pays off by reducing energy consumption, thus saving money and at the same time lowering carbon emissions.

In this blog post we have explored multiple ways to achieve faster prediction times. Firstly, the dimensionality of the data and the algorithm chosen have major influence on inference speed and scalability behaviour. However, there are various tricks to even accelerate existing scikit-learn code. Disabling scikit-learn's finite data validation or utilizing the Intel extension for supported algorithms can already yield considerable improvements depending on the algorithm. However, the most substantial gains can be achieved by addressing fundamental aspects, such as reducing the number of features (in particular for high-dimensional data), implementing bulk prediction or custom prediction methods. These strategies can result in speed increases by factors of several hundred.

In our small test setting, we could additionally show that a small neural network, gradient boosted regressor and random forest appear to be the most promising choices in terms of both accuracy and speed, when using the above-mentioned speedup tricks.


Sources

[1] https://storage.googleapis.com/kaggle-media/surveys/Kaggle%20State%20of%20Machine%20Learning%20and%20Data%20Science%202020.pdf 

[2] House sales: House Sales in King County, USA

red wine quality: Red Wine Quality ,

avocado prices: Avocado Prices ,

medical insurance costs: Medical Cost Personal Datasets

[3] 8. Computing with scikit-learn — scikit-learn 0.23.2 documentation

 


Technical Appendix

Speedtests were performed with all unnecessary background processes stopped.

 

Inference time measurement for one test sample (“atomic prediction”):

n = 500 # number of consecutive runs
r = 10 # number of repeats of above

pred_times = timeit.repeat(stmt=lambda: model.predict(X_test[0]), 
  repeat=r, number=n)
pred_times = np.array(pred_times) / n # divide by number of runs
pred_time = np.min(pred_times) # take minimum of all repetitions

Inference time measurement for several samples at once (“bulk prediction”):

n = 50 # number of consecutive runs
r = 5 # number of repeats of above

X_test_sample = X_test[0:1000] # 100 or 1000
pred_times = timeit.repeat(stmt=lambda: model.predict(X_test_sample), 
  repeat=r, number=n)
pred_times = np.array(pred_times) / n # divide by number of runs
pred_times = pred_times / len(X_test_sample) # divide by number of samples
pred_time = np.min(pred_times) # take minimum of all repetitions

With “model” being the scikit-learn models mentioned above which were trained with the first 10.000 observations of the “house sales” data and using default model paramters.

 

Versions used:

  • Python: 3.9.7

  • Scikit-learn: 1.0.2

  • Scikit-learn-intelex: 2021.20210714.120553

Headerbild zu Operationalisierung von Data Science (MLOps)
Service

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.

Blog 5/1/21

Ways of Creating Single Case Discriminated Unions in F#

There are quite a few ways of creating single case discriminated unions in F# and this makes them popular for wrapping primitives. In this post, I will go through a number of the approaches that I have seen.

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

News 6/22/23

brainbits is now part of TIMETOACT GROUP

With the akquisition of the cologne based IT expert brainbits we are enlarging our Atlassian and Webdevelopment Know How

News 1/26/21

The IPG Group becomes part of the TIMETOACT GROUP

The TIMETOACT GROUP acquires the majority of the shares of IPG Information Process Group Holding AG, based in Winterthur. Through the acquisition, the competencies for Identity and Access Management (IAM) solutions in the DACH market are combined.

News 1/26/21

The IPG Group becomes part of the TIMETOACT GROUP

The TIMETOACT GROUP acquires the majority of the shares of IPG Information Process Group Holding AG, based in Winterthur. Through the acquisition, the competencies for Identity and Access Management (IAM) solutions in the DACH market are combined.

Blog

catworkx behind the scenes - „The Lord of the Screens”

IIn our new blog article, we take a look behind the scenes and see who actually works at catworkx. Today: The lord of the screens.

Wissen 4/14/23

General Data Protection Regulation of idea management

Walldorf-based dacuro GmbH provides the external data protection officer for companies, helps with the fulfillment of documentation obligations and advises on all aspects of data protection. Fulfilling the requirements of the GDPR without blocking everyday life is the claim of dacuro GmbH. The team of lawyers and IT specialists provides support for all GDPR challenges, whether they are of a legal or technical nature.

Service

Digitalization and cloud transformation

We adjust the levers of your individual digitalization initiative for speed, adaptability, security and compliance!

News 11/4/24

EverIT becomes part of catworkx and the TIMETOACT GROUP

catworkx (part of the TIMETOACT GROUP), a leading partner for enterprise integration based on the Atlassian platform, is acquiring EverIT, a specialized Hungarian Atlassian partner.

News 11/4/24

EverIT becomes part of catworkx and TIMETOACT GROUP

Cologne/Budapest, 4 November 2024 – catworkx (part of TIMETOACT GROUP), a leading partner for Enterprise integration based on the Atlassian platform, is acquiring EverIT, a specialized Hungarian based Atlassian Partner. Together, the companies will build on their long-standing relationship and expand catworkx’s leading market position into Central Eastern Europe and strengthen catworkx’s global offering. The parties agreed not to disclose the details of the transaction.

Referenz

Why performance is decisive

In order to make the performance of an Atlassian toolchain measurable, individual examinations must be carried out. catworkx relies on the controlling of performance values...

Blog 3/14/25

What is SAP S/4HANA Cloud Public Edition?

This blog post covers everything you need to know about SAP S/4HANA Cloud Public Edition, including its key features, implementation options, and the advantages of adopting this powerful ERP solution.

Blog 11/30/22

Part 2: Detecting Truck Parking Lots on Satellite Images

In the previous blog post, we created an already pretty powerful image segmentation model in order to detect the shape of truck parking lots on satellite images. However, we will now try to run the code on new hardware and get even better as well as more robust results.

Wissen 4/14/23

The "Beautiful five" and the Power of "One-Number" Reporting

Key figures are a perennial favorite in idea management, have been used for many years (decades) and are now very topical again. The reasons are obvious. You want to set performance benchmarks, define targets, follow up on where things are not going so well and measure the success or failure of idea management.

Blog 12/19/22

Creating a Cross-Domain Capable ML Pipeline

As classifying images into categories is a ubiquitous task occurring in various domains, a need for a machine learning pipeline which can accommodate for new categories is easy to justify. In particular, common general requirements are to filter out low-quality (blurred, low contrast etc.) images, and to speed up the learning of new categories if image quality is sufficient. In this blog post we compare several image classification models from the transfer learning perspective.

Blog

Switzerland in focus: An interview with CEO Michi

What does leadership look like at catworkx? Michael Bernhard, managing director of catworkx AG Switzerland, provides insights into leadership, values and the everyday life of a leader.

Teaserbild zu Lizenz- und Vertragsoptimierung.
Service

License and contract optimization

Based on the license analysis, we check the feasibility of potential savings from both a technological and commercial point of view.

Blog 9/14/22

Learn & Share video Obsidian

Knowledge is very powerful. So, finding the right tool to help you gather, structure and access information anywhere and anytime, is rather a necessity than an option. You want to accomplish your tasks better? You want a reliable tool which is easy to use, extendable and adaptable to your personal needs? Today I would like to introduce you to the knowledge management system of my choice: Obsidian.

Unternehmen

CLOUDPILOTS Software & Consulting GmbH

CLOUDPILOTS consults and supports companies in the transformation process of business processes and applications to the Cloud. Also, it assists in the implementation of Cloud based IT services (SaaS).

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!