Advanced Topics in Software Systems (SYS4BIGML)

The current focus of this course is on principles for engineering ML systems in the Computing Continuum.

Overview

This is an advanced course for master and PhD students. The current focus of this course is on principles for engineering ML systems in the Computing Continuum. Big Data and Machine Learning (ML) applications and services and their reliability and robustness are strongly dependent on the underlying systems empowering such applications and services. On the one hand, techniques for supporting performance engineering, configuration management, testing and debugging of Big Data and ML are extremely important. On the other hand, large-scale distributed systems and new computing models have been evolved with new hardware and infrastructure architectures, such as edge systems, tensor processing units, and quantum computing systems. This leads to the computing continuum for advanced Big Data and ML applications and services. Developing and optimizing Big Data and ML applications and services in such systems and models require in-depth understanding of the systems and the roles of systems for Big Data and ML.

Target participants/learners

The course is for students in Doctoral and Master studies. In Aalto the course is for students in Doctoral Programme in Science and the CCIS Master Programme.

This course provides advanced knowledge about computing and software systems that are useful for big data and machine learning domains. Therefore, it connects to various other courses, such as Big Data Platforms, Cloud Computing, Deep Learning and Master thesis, by providing complementary in-depth knowledge w.r.t system aspects.

Required previous knowledge

Students should have knowledge about cloud computing, big data, operating systems, distributed systems and machine learning. Therefore, it is important that students have passed courses with these topics, such as Cloud Computing, Big Data Platforms, Operating Systems, and Machine Learning. Students are expected to be very good with programming skills as well.

Content

First, key system requirements due to the complexity, reliability, and robustness of Big Data and ML applications and services will be analyzed and presented. Based on that we will learn techniques for supporting performance engineering, configuration management, testing and debugging of Big Data and ML. Such techniques are extremely important; they are cross-topics for the course, regardless of the underlying systems empowering Big Data and ML applications and services.

Second, selected areas in systems for Big Data and ML will be presented. We will examine the computing continuum model, dataflows/programming frameworks and orchestration techniques. We will examine the state-of-the-art, strengths and weakness of concepts and techniques. We will focus on engineering frameworks that can be used to development Big Data and ML, according to the above-mentioned cross-topics.

For each selected area, we will focus on the following aspects:

understanding and applying key principles, techniques, and concepts
analyzing/evaluating/creating (new) methods/techniques

Focused Areas in 2024

Computing continuum (edge systems, edge-cloud systems)
Design and evaluation for systems robustness, reliability, resilience and elasticity for Big Data/ML (with also engineering work)
Observability and explainability for ML applications (with also engineering work)
Dataflows and orchestration frameworks for Big Data/ML (with also engineering work)
Quality of analytics for ML in edge-cloud continuum (with also engineering work)

Course Plan and Teaching methods

We define the generic plan of the course as follows:

Lectures given by teachers: students must provide study logs
Hands-on tutorials given by teachers: the goal is to give some concrete examples of the techniques discussed in the lectures. However, since it is a research-oriented course, students can also practice similar problems with different software stack.
Project topic proposal and presentation: students must identify a topic related to the content of the course and present it
Topic implementation and demonstration: students will implement the topic and demonstrate the project
Students will make public material about the topic project available in Git spaces (e.g., in Aalto, Github, Gitlab, ...)

As an advanced and research-oriented course, we will use the pass/fail as a way to evaluate students. Passing the course will require the students to (i) participating in lectures and hands-on, (ii) passing study logs, (iii) passing project topic presentation, and (iv) passing the final demonstration.

Fall 2024 - Schedule

Responsible teacher: Hong-Linh Truong
Other teacher/assistant: Minh-Tri Nguyen
Other teacher/assistant: Hong-Tri Nguyen

Tentative slots

Date	Place	Content	Lead person
11.09.2024	U359b	Course overview, lecture 1 discussion	Linh Truong
18.09.2024	U359b	Lecture 2 discussion	Linh Truong
25.09.2024	U359b	Hands-on tutorial 1	Minh-Tri Nguyen
2.10.2024	U359b	Lecture 3 discussion	Linh Truong
9.10.2024	U359b	Hands-on tutorial 2	Minh-Tri Nguyen
23.10.2024	R030A133 T5	Lecture 4 + Hands-on tutorial 3	Hong-Tri Nguyen
30.10.2024	R030A133 T5	Project topic discussion	Linh Truong, Minh-Tri Nguyen, Hong-Tri Nguyen
flexible		discussion about topics and possible hands-on	All
6.11.2024	R030A133 T5	Checkpoint 1: Topic progress presentation	All
20.11.2024	R030A133 T5	Checkpoint 2: Topic progress/prefinal check	All
flexible		discussion about project progress	All
4.12.2024	R030A133 T5	final project demonstration	All
11.12.2024		final report/code delivery	Individual

Lectures/Discussions

Lecture 0: Introduction to Federated Learning
Lecture 1: Robustness, Reliability, Resilience and Elasticity for Machine Learning Systems in Edge-Cloud Continuum
- Slides: Robustness, Reliability, Resilience and Elasticity (R3E) for Machine Learning Systems in Edge-Cloud Continuum
- Key reading 1: R3E -An Approach to Robustness, Reliability, Resilience and Elasticity Engineering for End-to-End Machine Learning Systems
- Key reading 2: The New Frontier of Machine Learning Systems
- Key reading 3: Hidden Technical Debt in Machine Learning Systems
- Key reading 4: Declarative Machine Learning Systems
- Key reading 5: Technology readiness levels for machine learning systems
- Key reading 6: Serving deep neural networks at the cloud edge for vision applications on mobile platforms
- Key reading 7:From the Edge to the Cloud: Model Serving in ML.NET
- Key reading 8: Machine Learning at Facebook:Understanding Inference at the Edge
- Key reading 9: Distributing Deep Neural Networks with Containerized Partitions at the Edge
- Key reading 10: A survey of federated learning for edge computing: Research problems and solutions
- Key reading 11: Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A Survey on Distributed Machine Learning. ACM Comput. Surv. 53, 2, Article 30 (March 2021)
Lecture 2: Monitoring, Observability and Explainability for ML Systems
- Slides: Monitoring, Observability and Experimenting for Machine Learning Systems
- Key reading 1: Benchmarking big data systems: A survey
- Key reading 2: MLPERF Training Benchmark
- Key reading 3: Data Validation for Machine Learning
- Key reading 4: Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle and ModelDB: a system for machine learning model management
- Key reading 5:Machine Learning Testing: Survey, Landscapes and Horizons
- Site 1: MLCommons
- The collection of Putting Machine Learning into Production Systems is also useful
Lecture 3: Coordination Models and Techniques for Machine Learning Systems
- Slides: Coordination Models and Techniques for Big Data and Machine Learning Systems
- Key reading 1: Cirrus: a Serverless Framework for End-to-end ML Workflows
- Key reading 2: Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)
- Key reading 3: Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
- Key reading 4: KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics
- Key reading 5: Jeff Smith. 2018. Machine Learning Systems: Designs that scale (1st. ed.). Manning Publications Co., USA.
- Key reading 6: Prediction-Serving Systems
Lecture 4 Vulnerability Diagnostics for Machine Learning Systems in Edge-Cloud Continuum

If you need the sources of slides for your teaching, pls. contact Linh Truong

Hands-on tutorials

We have a few hands-on tutorials for the course that students can carry out for the study. Note that only 1-2 hands-on tutorials will be arranged by the teacher and teaching assistants.

Project ideas presentations

Students will propose the project idea. This is an important aspect of research-oriented course. If a student cannot propose an idea, the teacher will suggest some concrete ideas for students.

Final project demonstration

The final project demonstration is organized like an "event" where all students can demonstrate their work and students can discuss experiences in their projects.
List of the student projects

Guides

How to write study/learning logs

Reading list

Interesting and relevant papers and sites

Previous course versions

Citation (if you use the material):

Hong-Linh Truong, Advanced Topics in Software Systems, https://github.com/rdsea/sys4bigml, 2020 BIB Entry

Copyrights/Licences: the lecture slides and course structure/info use CC BY 4.0. Individual tutorials have their own licenses (Apache Apache License 2.0)

Contact

Linh Truong

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
fall-2020		fall-2020
fall-2021		fall-2021
fall-2022		fall-2022
fall-2023		fall-2023
poeml2023		poeml2023
slides		slides
spring-2020		spring-2020
templates		templates
tutorials		tutorials
.gitignore		.gitignore
APACHE-LICENSE-v2.0		APACHE-LICENSE-v2.0
AUTHORS		AUTHORS
README.md		README.md
ReadingList.md		ReadingList.md
StudyLog.md		StudyLog.md
demos.md		demos.md
scenario-template.docx		scenario-template.docx
site.bib		site.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Topics in Software Systems (SYS4BIGML)

Overview

Target participants/learners

Required previous knowledge

Content

Focused Areas in 2024

Course Plan and Teaching methods

Fall 2024 - Schedule

Tentative slots

Lectures/Discussions

Hands-on tutorials

Project ideas presentations

Final project demonstration

Guides

Reading list

Previous course versions

Citation (if you use the material):

Contact

About

Releases

Packages

Contributors 5

Languages

License

rdsea/sys4bigml

Folders and files

Latest commit

History

Repository files navigation

Advanced Topics in Software Systems (SYS4BIGML)

Overview

Target participants/learners

Required previous knowledge

Content

Focused Areas in 2024

Course Plan and Teaching methods

Fall 2024 - Schedule

Tentative slots

Lectures/Discussions

Hands-on tutorials

Project ideas presentations

Final project demonstration

Guides

Reading list

Previous course versions

Citation (if you use the material):

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages