Skip to main content
COMP 620
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Guest Lectures


DateLecturerTopic
Aug 292:00 PM–2:40 PMYuke Wang [Slides]Introduction
Sep 52:00 PM–2:40 PMBoyuan Feng [Slides]
BioBoyuan Feng is a PyTorch Core Developer working on PyTorch Compiler, Inductor, CUDAGraph, and Flex Attention.
Lecture AbstractFlexAttention is a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. Since its release in PyTorch 2.5.0, many ML researchers have utilized it to customize their attention kernels without writing kernel code. In this talk, we present recent advances in FlexAttention. More details on our MLSys'25 paper (https://arxiv.org/pdf/2412.05496) and PyTorch Blog (https://pytorch.org/blog/flexattention-for-inference/)!
FlexAttention
Sep 192:00 PM–2:40 PMYue Guan [Slides]
BioYue Guan is a postdoctoral researcher at the University of California, San Diego, working with Prof. Yufei Ding in the Picasso Lab. He received his Ph.D. in Computer Science from Shanghai Jiao Tong University under the supervision of Prof. Jingwen Leng. His research focuses on efficient deep learning systems, spanning model compression, compiler optimization, and system design. His work has been published in top venues such as SOSP, OSDI, ASPLOS and HPCA.
Lecture AbstractThe rapid growth of large language models (LLMs) requires better compilers for efficient use of multi-GPU systems. In this talk, I will introduce Mercury, a compiler that manages remote GPU memory as part of the memory hierarchy to optimize computation, storage, and communication. I will also present KPerfIR, a tool that adds profiling directly into the compilation process to help analyze GPU kernel performance. These approaches show how integrating optimization and performance analysis in compilers can improve the scalability and efficiency of LLMs.
Mercucy & KPerfIR
Oct 172:00 PM–2:40 PMLiangyu Zhao [Slides]
BioLiangyu Zhao is a fourth-year PhD student at the University of Washington, advised by Prof. Arvind Krishnamurthy. His research focuses on machine learning systems, with an emphasis on network communication for distributed machine learning. Currently, he is a research scientist intern at Meta AI & Systems Co-Design team.
Lecture AbstractAs modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today’s heterogeneous and diverse network fabrics. We present ForestColl, a tool that generates throughput-optimal schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretical optimality. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabrics, including both switching fabrics and direct accelerator connections. We evaluated ForestColl on multi-box AMD MI250 and NVIDIA DGX A100 platforms. ForestColl showed significant improvements over the vendors’ own optimized communication libraries, RCCL and NCCL, across various settings and in LLM training. ForestColl also outperformed other state-of-the-art schedule generation techniques with both more efficient generated schedules and substantially faster schedule generation speed.
ForestColl
Oct 312:00 PM–2:40 PMZhuang Wang [Slides]
BioZhuang Wang is an Applied Scientist at Amazon Web Services AI. He received his Ph.D. degree in Computer Science from Rice University in 2023, fortunately advised by Prof. T. S. Eugene Ng. His current research interests focus on efficient training and inference systems for large language models.
Lecture AbstractFrequent failures are observed during large model training due to large-scale resources involved and extended training time. This talk presents Gemini, a distributed training system that enables fast failure recovery for large model training by checkpointing to CPU memory of the host machines with much larger aggregated bandwidth. However, two challenges prevent naïvely checkpointing to CPU memory. First, the availability of checkpoints in CPU memory cannot be guaranteed when failures occur. Second, since the communication traffic for training and checkpointing share the same network, checkpoint traffic can interfere with training traffic and harm training throughput. To address these two challenges, we propose: 1) a provably near-optimal checkpoint placement strategy to maximize the probability of failure recovery from checkpoints in CPU memory; and 2) a checkpoint traffic scheduling algorithm to minimize, if not eliminate, the interference of checkpoint traffic on model training. Our evaluation shows that Gemini achieves optimal checkpoint frequency, i.e., every iteration, and incurs no overhead on training throughput for large model training.
Gemini
Nov 142:00 PM–2:40 PMYixin Dong [Slides]
BioYixin Dong is a Ph.D. student at Carnegie Mellon University, advised by Prof. Tianqi Chen, and a part-time researcher at xAI. His research focuses on building efficient and verifiable LLM agents. Before that, he received his B.Eng in Computer Science from Shanghai Jiao Tong University. Yixin is also a major contributor to several widely adopted open-source projects, including Apache TVM, MLC-LLM, and XGrammar.
Lecture AbstractXGrammar has become the de facto standard for guided decoding in the industry. Guided decoding aims to ensure that the outputs of large language models conform to user-defined structures or grammars by applying additional token masks during decoding. Over the past year since the release of XGrammar, we have made significant improvements. In this talk, I will introduce two exciting advancements. First, on the performance side, we have accelerated grammar compilation and mask generation by leveraging the Earley parser, cross-grammar caching, and JIT grammar compilation. Second, in terms of supported structures, we have designed the Structural Tag, a JSON-based DSL that can describe complex structures, naturally aligning with the needs of LLM agents and now publicly available across all major LLM engines. Finally, we will look ahead to the future goals of XGrammar.
XGrammar
Nov 212:00 PM–2:40 PMSong Bian [Slides]
BioSong Bian is a final-year Ph.D. student at the University of Wisconsin-Madison, advised by Prof. Shivaram Venkataraman. His research focuses on building efficient training systems and designing inference-efficient large language models.
Lecture AbstractScaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. In view of this, I ask the following question: Can we explicitly capture the trade-off between inference efficiency and accuracy of large language models? In this talk, I will demonstrate that the architecture of large language models significantly affects their inference efficiency. Motivated by this observation, we propose a conditional scaling law that extends the Chinchilla framework by incorporating architectural factors. We also introduce a search framework for discovering model architectures that are both inference-efficient and accurate. Finally, using the proposed scaling law and search framework, we predict optimized model architectures that outperform LLaMA‑3.2 in both accuracy and inference throughput, under the same training budget.
Scaling Laws Meet Model Architecture