State of Fast Feedback in Data Science Projects

Let’s talk about the productivity in Data Science and Machine Learning projects (DSML)

DSML projects can be quite different from the software projects: a lot of R&D in a rapidly evolving landscape, working with data, distributions and probabilities instead of code. However, there is one thing in common: iterative development process matters a lot.

For example in software engineering, rapid iterations help a lot in debugging complex issues or working towards a tricky issue. In product development, an ability to rapidly roll out new version can be a deal-breaker for achieving customer satisfaction. Paul Graham eloquently covers that in his “Beating the averages” essay.

Likewise, in Data Science and Machine Learning projects, iterations help data scientists to rapidly test their theories and converge towards the solution that will create value. If we assume that 87% of data science projects fail (which looks about right to me), then having a fast feedback loop could help to get to the successful 13% faster.

Yet, there is a problem in the industry with that.

Let use a basic data science pipeline as an example. It will have a predefined structure to make it easier for collaboration between different teams in the department.

There will be the following steps:

  1. Initialise the pipeline run, deriving any per-run variables from the initial config

  2. Load and prepare the training data

  3. Perform model training

  4. Evaluate the model on a separate dataset

  5. Prepare the model for the use

  6. Run batch prediction against the resulting model

The de facto language for the pipelines in Python. We can provide a minimal implementation in a console application and run it locally. On my laptop it takes ~0.3-0.5 sec.

That is good enough.

If the computation overhead of a real pipeline is 5 minutes, then we could run up to 12 iterations in an hour.

However, the industry way to run these pipelines is via Kubeflow (ML toolkit for Kubernetes). Google Vertex is one of the most stable implementations.

If we map our pipeline components to a Kubeflow pipeline, we’ll get something like that:

How many experiments per our can we run here?

At this point, the computation overhead doesn’t even matter. Since it takes 33 minutes per run, we could run only up to experiment per hour.

The execution takes 5000x more time on Vertex than it takes on a local machine. Although that time is a paid compute time, the biggest hit is not a financial one, but more of a productivity loss.

And that is the most frustrating problem with the state of the data science pipelines today. Major hosting players make more money from less efficient data science pipelines. This might reduce incentives to prioritize performance-improving changes. This in turn negatively impacts the ability of small data science teams to have fast feedback loops and innovate efficiently.

Navigationsbild zu Data Science
Service

AI & Data Science

We offer comprehensive solutions in the fields of data science, machine learning and AI that are tailored to your specific challenges and goals.

Navigationsbild zu Data Science
Service

Data Science, Artificial Intelligence and Machine Learning

For some time, Data Science has been considered the supreme discipline in the recognition of valuable information in large amounts of data. It promises to extract hidden, valuable information from data of any structure.

Headerbild zu IBM Cloud Pak for Data Accelerator
Technologie

IBM Cloud Pak for Data Accelerator

For a quick start in certain use cases, specifically for certain business areas or industries, IBM offers so-called accelerators based on the "Cloud Pak for Data" solution,

Data Science & Advanced Analytics
Kompetenz 9/3/20

Data Science, AI & Advanced Analytics

Data Science & Advanced Analytics includes a wide range of tools that can examine business processes, help drive change and improvement.

Articifial Intelligence & Data Science
Service

Artificial Intelligence & Data Science

Data Science is all about extracting valuable information from structured and unstructured data.

Das macht catworkx als Unternehmen aus
Jobs 9/12/22

Why catworkx?

We have a lot going for us: an open corporate culture, varied projects, a team-oriented working environment, flat hierarchies ...

Headerbild zu Digitale Planung, Forecasting und Optimierung
Service

Demand Planning, Forecasting and Optimization

After the data has been prepared and visualized via dashboards and reports, the task is now to use the data obtained accordingly. Digital planning, forecasting and optimization describes all the capabilities of an IT-supported solution in the company to support users in digital analysis and planning.

Blog 9/27/22

Creating solutions and projects in VS code

In this post we are going to create a new Solution containing an F# console project and a test project using the dotnet CLI in Visual Studio Code.

Headerbild Data Insights
Service

Data Insights

With Data Insights, we help you step by step with the appropriate architecture to use new technologies and develop a data-driven corporate culture

Headerbild zu Agile Softwareentwicklung
Service

Agile Software Development

A project rarely turns out as it was initially planned. Agility does not only apply to Project Management, but also to methods and processes of Software Development in order to avoid risks and undesirable developments during the process.

Blog 7/22/24

So You are Building an AI Assistant?

So you are building an AI assistant for the business? This is a popular topic in the companies these days. Everybody seems to be doing that. While running AI Research in the last months, I have discovered that many companies in the USA and Europe are building some sort of AI assistant these days, mostly around enterprise workflow automation and knowledge bases. There are common patterns in how such projects work most of the time. So let me tell you a story...

Headerbild zu Talend Real-Time Big Data Platform
Technologie

Talend Real-Time Big Data Platform

Talend Big Data Platform simplifies complex integrations so you can successfully use Big Data with Apache Spark, Databricks, AWS, IBM Watson, Microsoft Azure, Snowflake, Google Cloud Platform and NoSQL.

Referenz

Buchi Labortechnik AG: Successful migration to the cloud

The Swiss family-owned company Büchi is a global leader in laboratory technology solutions for research and development (R&D), quality assurance and production. By using Atlassian products such as Jira Software and Jira Service Management, the company can ensure modern project management and effective IT support. After implementing the Atlassian Suite, BÜCHI decided to migrate to the Atlassian Cloud with the support of catwork Switzerland.

Headerbild zu Big Data, Data Lake und Data Warehouse
Service

Big Data, Data Lake & Data Warehousing

For the optimal solution – with special consideration of the business requirements – we combine different functionalities.

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Unternehmen 7/30/21

Our promise - Passion for your digital future

A lot has changed since we were founded in 1992. Only one thing has remained: Our mission of high-quality consulting and successful projects for agile development of high-quality software: Passion for your digital future.

Referenz

Integrated Project and User Portal (IPUP)

Transparent and flexible management of projects and users in large environments with Jira Service Management: catworkx has developed a tool for a major customer from the automotive industry, with which projects and the assignment of users involved can be set up largely automatically.

Headerbild IBM Cloud Pak for Data
Technologie

IBM Cloud Pak for Data

The Cloud Pak for Data acts as a central, modular platform for analytical use cases. It integrates functions for the physical and virtual integration of data into a central data pool - a data lake or a data warehouse, a comprehensive data catalogue and numerous possibilities for (AI) analysis up to the operational use of the same.

Headerbild zu Talend Data Fabric
Technologie

Talend Data Fabric

The ultimate solution for your data needs – Talend Data Fabric includes everything your (Data Integration) heart desires and serves all integration needs relating to applications, systems and data.

Easy Cloud Solution
Produkt

Big Data

Extract valuable information from data - Take advantage of serverless, integrated end-to-end data analytics services to leave traditional limitations behind.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!