CS 598: Systems for Generative AI (S'25)

Logistics

Lectures: 0216 Siebel Center for Computer Science, WF: 12:30 PM – 01:45 PM

Member (NetID)	Role	Office Hours
Fan Lai (fanlai)	Instructor	3128 Siebel Center. F 2:00 PM – 3:00 PM
Chengsong Zhang (cz81) Jimmy Shong (jimmys2)	TAs	Zoom. Time TBD

Piazza: ALL communication regarding this course must be via Piazza. This includes questions, discussions, announcements, as well as private messages.

Presentation slides and paper summaries should be emailed to [email protected].

Course Description

Learning Objectives: This course will introduce the key concepts and the state-of-the-art in practical, scalable, and fault-tolerant software systems for emerging Generative AI (GenAI). At the end of the course you will be able to:

Critique and evaluate the design details of state-of-the-art GenAI systems
Develop and utilize tools to profile and understand the performance of GenAI systems
Propose new research ideas in topics related to support practical GenAI

Structure: The course will be a mix of lectures, student presentations, seminar-style discussions, and a semester-long project on GenAI topics. We will cover GenAI topics from top conferences that take a systems view to the relevant challenges, including:

Basics of GenAI models from a systems perspective;
Systems for GenAI lifecycle (pre-training, training, fine-tuning/alignment, inference serving, and grounding);
GenAI for systems and etc.

Note that this course is NOT focused on AI methods. Instead, we will focus on how one can build software systems so that existing AI methods can be used in practice and new AI methods can emerge.

Prerequisites: Students are expected to have good programming skills and must have taken at least one undergraduate-level systems-related course (from operating systems, databases, distributed systems, or networking). Having an undergraduate ML/AI course is helpful but not required.

Tentative Schedule and Reading List

This is an evolving list and subject to changes due to the breakneck pace of GenAI innovations.

Date	Readings	Presenter
Jan 22 (GenAI Systems)	Introduction How to Read a Paper (Required) How to Give a Bad Talk (Required) Writing Reviews for Systems Conferences The Shift from Models to Compound AI Systems	Fan
	GenAI Basics
Jan 24 (LLM Fundamentals)	The Illustrated Transformer (Required) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Required) Attention Is All You Need The Transformer Family Version 2.0	Jimmy
Jan 29 (Diffusion Fundamentals)	The Illustrated Stable Diffusion (Required) Scalable Diffusion Models with Transformers (Required) Adding Conditional Control to Text-to-Image Diffusion Models Hierarchical Text-Conditional Image Generation with CLIP Latents	Chengsong
Jan 31	No Lecture / Work on Project Proposal Worse is Better (Required) Hints and Principles for Computer System Design
Feb 5 (LMMs)	Multimodality and Large Multimodal Models (LMMs) (Required) Visual Instruction Tuning DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Feb 7 (MoE)	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Required) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (Required) Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Scaling Vision-Language Models with Sparse Mixture of Experts
Feb 12 (Video Generation)	VideoPoet: A Large Language Model for Zero-Shot Video Generation (Required) CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (Required) Movie Gen: A Cast of Media Foundation Models
	Pre-Training
Feb 14 (Training Infra)	The Llama 3 Herd of Models (Sec 1-4.2, Required) DeepSeek-V3 Technical Report (Sec 3.1-3.2, 3.4, Required) Gemini: A Family of Highly Capable Multimodal Models
Feb 19 (Model Parallelism)	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (Required) Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (Required) LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
Feb 21 (Infra for Parallelism)	MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (Required) RDMA over Ethernet for Distributed AI Training at Meta Scale (Required) GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Feb 26 (Energy Efficiency)	Perseus: Removing Energy Bloat from Large Model Training (Required) GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters (Required) Power-aware Deep Learning Model Serving with μ-Serve
Feb 28 (Training Simulation)	SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision (Required) Vidur: A large-scale simulation framework for LLM inference (Required) Pathways: Asynchronous Distributed Dataflow for ML
	Alignment & Post-Training Optimization
Mar 5 (Sys for RLHF)	Training language models to follow instructions with human feedback (Required) RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion (Required) Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Mar 7 (Serving LoRAs)	LoRA: Low-Rank Adaptation of Large Language Models (Required) dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (Required) Stylus: Automatic Adapter Selection for Diffusion Models
Mar 12 (Quantization)	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Required) AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration (Required) GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
	Grounding
Mar 14 (Optimizing Throughput)	Efficient Memory Management for Large Language Model Serving with PagedAttention (Required) Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (Required) SGLang: Efficient Execution of Structured Language Model Programs
Mar 26 (Optimizing User Experience)	PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (Required) Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services (Required) (Required) LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Mar 28 (Speculative Decoding)	Fast Inference from Transformers via Speculative Decoding (Required) SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification (Required) Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Apr 2	Mid-Semester Presentations
Apr 4	Mid-Semester Presentations
	Inference
Apr 9 (RAG)	REALM: Retrieval-Augmented Language Model Pre-Training (Required) CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (Required) MemGPT: Towards LLMs as Operating Systems
Apr 11 (KV Cache)	SnapKV: LLM Knows What You are Looking for Before Generation (Required) H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Required) Efficient Streaming Language Models with Attention Sinks
Apr 16 (Caching GenAI)	Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models (Required) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Apr 18 (Compound AI)	OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models (Required) Self-Reflection in LLM Agents: Effects on Problem-Solving Performance (Required) Vulcan: Automatic Query Planning for Live ML Analytics
Apr 23 (LLM for Systems)	NetLLM: Adapting Large Language Models for Networking (Required) Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models (Required) LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models
Apr 25 (New LLM Paradigms)	Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Required) Linear-Time Sequence Modeling with Selective State Spaces (Required) MiniMax-01: Scaling Foundation Models with Lightning Attention
Apr 30	Final Presentations
May 2	Final Presentations
May 7	Final Presentations
May 15	Final Report Submission Deadline

Tentative Grading

Groups: All activities of this course, except your own participation :), will be performed in groups of 4-5 students. Form a group and declare your group's membership and paper preferences by Jan 31. After this date, we will form groups from the remaining students.

	Weight
Participation	15%
Paper Presentation & Discussion	15%
Paper Summary	10%
Project Proposal	5%
Project Mid-Semester Presentations	10%
Project Final Presentations	10%
Project Final Report	35%

Academic integrity: The University's Honor Code applies to all activities related to this course. All material you submit in this course (reading responses, project reports, and presentation materials) must be your own. If you use someone else’s material, you must cite them properly.

AI Tool Policy: AI tools may be used for grammar checking and refining initial brainstorms, but the final reviews and codes must be authored by the student. Students are responsible for the entire content and must adhere to the Academic Integrity Policy.

Policies

Participation

Before Each Lecture: Each lecture will include one or two required reading that everyone must read. There will be optional related reading(s) that only the presenter(s) should be familiar with. They are optional for the rest of the class. You are required to submit one insightful question for each presented papers before each lecture.

During Lectures: Active participation is crucial for both your own understanding and to improve the overall quality of the course. You are expected to attend all lectures (up to 2 absences allowed for legitimate reasons), and more importantly, participate in class discussions. Not everyone must have add something every day, but it is expected that everyone has something to share over the semester.

After Lectures: Participation also involves contributing to discussions on Piazza. The group responsible for the summary should initiate the (remaining) discussion, and the rest of the members are encouraged to participate.

Student Lectures

The course will be conducted as a seminar. Only one group will present in each class. Each group will be assigned at least one lecture over the course of the semester. Presentations should last at most 40 minutes without interruption. However, presenters should expect questions and interruptions throughout.

In the presentation, you should:

Provide a brief background to motivate the problem (e.g., simplifying this by referencing previous talks)
Present the high level idea, approach, and/or insight (using examples, whenever appropriate) in the required reading.
Discuss technical details so that one can understand key details without carefully reading (quickly skim the evaluations).
Explain the differences between related works as well as the additional reading.
Identify strengths and weaknesses of the required reading and propose directions of future research.

The slides for a presentation must be emailed to the instructor team (in *.pptx format) at least 24 hours prior to the corresponding class.

Post-Presentation Panel Discussion

To foster a deeper understanding of the papers and encourage critical thinking, each lecture will be followed by a panel discussion. This discussion will involve three distinct roles played by different student groups, simulating an interactive and dynamic scholarly exchange.

Roles and Responsibilities

The Authors

Group Assignment: The 'Companion' group will write the summary and play the role of the paper's authors.
Responsibility: As authors, you are expected to defend your paper against critiques, answer questions, and discuss how you might improve or extend your research in the future, akin to writing a rebuttal during the peer-review process.

The Reviewers

Group Assignment: The 'Reviewer' group will write the summary and will be assigned to one slot to play the role of reviewers.
Responsibility: Reviewers critically assess the paper, posing challenging questions and highlighting potential weaknesses or areas for further investigation. Your goal is to engage in a constructive critique of the paper, simulating a peer review scenario.

Rest of the Class (including the presenters)

Responsibility: During the panel discussions, feel free to actively ask questions and engage in the dialogue.

Lecture Summaries

Each group will also be assigned to write summaries for roughly two lectures: one in the 'Companion' role and the other in the 'Reviewer' role. The summary assigned to a group will not be the reading they gave the lecture on.

A paper summary must address the following four questions in sufficient details (2-3 pages):

What is the problem addressed in the lecture, and why is this problem important?
What is the state of related works in this topic?
What is the proposed solution, and what key insight guides their solution?
What is one (or more) drawback or limitation of the proposal?
What are potential directions for future research?

Late reviews will not be counted. The paper summary of a paper must be emailed to the instructor team within 24 hours after its presentation.

You should use this format for writing your summary. Use Google doc to enable in-line comments and suggestions.

Allocate enough time for your reading, discuss as a group, write the summary carefully, and finally, include key observations from the class discussion.

Project

You will have to complete substantive work an instructor-approved problem and have original contribution. Surveys are not permitted as projects; instead, each project must contain a survey of background and related work.

You must meet the following milestones (unless otherwise specified in future announcements) to ensure a high-quality project at the end of the semester:

Turn in a 2-page draft proposal (template), plus as many pages as needed for references, by February 26. Remember to include the names and UIUC email addresses of the group members.
Each group must present mid-semester progress during class hours on April 2 and April 4.
Each group must turn in an 8-page final report and your code via email on or before 6:00PM CST on May 15. The report must be submitted as a PDF file, with formatting similar to that of the papers you've read in the class. The self-contained (i.e., include ALL dependencies) code must be submitted as a zip file. Each zip file containing the code must include a README file with a step-by-step guide on how to compile and run the provided code.
You can find how to access GPU resources here

Acknowledgements

This course is heavily inspired by other excellent system seminar courses, particularly UMich CSE 585. Acknowledgments to SymbioticLab.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Resources		Resources
Slides		Slides
Summaries		Summaries
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 598: Systems for Generative AI (S'25)

Logistics

Course Description

Tentative Schedule and Reading List

Tentative Grading

Policies

Participation

Student Lectures

Post-Presentation Panel Discussion

Roles and Responsibilities

Lecture Summaries

Project

Acknowledgements

About

Releases

Packages

Contributors 3

fanlai0990/CS598

Folders and files

Latest commit

History

Repository files navigation

CS 598: Systems for Generative AI (S'25)

Logistics

Course Description

Tentative Schedule and Reading List

Tentative Grading

Policies

Participation

Student Lectures

Post-Presentation Panel Discussion

Roles and Responsibilities

Lecture Summaries

Project

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages