Skip to content

2. Task Description

Konstantinos Loizas edited this page Jan 19, 2022 · 3 revisions

Daimler AG, a group of companies, is one of the worldwide leading organizations in the mobility spectrum. The company recognized that FOSS (Free and open-source software) has become a key component of its products and decided to contribute to the community by participating in various global projects (https://github.com/Daimler/daimler-foss#daimler-group-foss-projects). Furthermore, many of the Daimler Group companies are now developing open-source software products. Mercedes Benz AG, the most recognizable Daimler brand, launched several FOSS projects to encapsulate the various open-source benefits and give back to the international open-source community.

The Data & Analytics platform (hereafter referred to as the DnA or “platform”) is one of the first FOSS Mercedes Benz projects. On a company level, the main objective of this product is to enable all Data Scientists and non, create Machine and/or Deep Learning (ML/DL) models in an effective, efficient and compliant way. More specifically, the DnA platform is aiming to constitute a toolkit that allows anyone to create, manage and share ML solutions without spending time on unnecessary and unrelated to ML, configuration steps. That also includes GDPR compliant data access -when the platform is used internally.

The application consists of two main parts:

  1. The Solution section, where you can create, manage, share and access other solutions.
  2. The Workspace section, where you can:
  • Create a Workspace and write your code in either:

    • a free, open-source Jupyter Notebook or
    • (if you are an owner of a license) in Dataiku.
  • Make use of a variety of services (API malware scan, etc.)

This Master’s Thesis is related to the utilization, and the optimization of the Jupyter Notebook to transform and deploy any ML model as a microservice. According to Haviv (https://youtu.be/P6m2Z8WdsBk), a significant percentage of AI projects (around 85%) fail to escape from the research environment and reach the production pipeline. That practically is translated to economic loss for the company, as the developed models often can not be used in the real-life environment.

Several reasons can lead to this situation. The local development and training of a Machine Learning solution is only the first step before the company, and any community can practically utilize the model. Further steps include the packaging of the application -probably to a container-, scaling-out, tuning, instrumenting, and maybe automating the whole process (Figure 1). The problem arising here is that while writing code and creating a model in Jupyter Notebook is a simple -implementation-wise- process, everything else is not. The development of an ML solution in the workspace may require only a couple of Data Scientists -if not only one- and the developing time can be counted in weeks. In contrast, that is not the case with the subsequent steps. Data Scientists usually dont (they are not required to) have the necessary developer skills to deploy their models. Consequently, they often must work for months together with developers and software engineers not on the actual solution but its configuration. That translates to costs in time, effort, resources, and respectively in money.

Figure 1

The main goal of this thesis is to study, discover, develop and implement a possible solution to the aforementioned problem, by automating the deployment steps inside the Jupyter Notebook. The outline of an ideal solution is the following:

A Data Scientist can utilize the usage of the DnA platform by writing his code in the Jupyter Notebook and being able to deploy hers/his model as a microservice with only one click of a button.

For testing and developing reasons, a real-life use case -the Chronos problem- will be used.