Guest Lectures

Date	Lecturer	Topic
Aug 292:00 PM–2:40 PM	Yuke Wang [Slides]	Introduction
Sep 52:00 PM–2:40 PM	Boyuan Feng [Slides] Bio Boyuan Feng is a PyTorch Core Developer working on PyTorch Compiler, Inductor, CUDAGraph, and Flex Attention. Lecture Abstract FlexAttention is a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. Since its release in PyTorch 2.5.0, many ML researchers have utilized it to customize their attention kernels without writing kernel code. In this talk, we present recent advances in FlexAttention. More details on our MLSys'25 paper (https://arxiv.org/pdf/2412.05496) and PyTorch Blog (https://pytorch.org/blog/flexattention-for-inference/)!	FlexAttention
Sep 192:00 PM–2:40 PM	Yue Guan [Slides] Bio Yue Guan is a postdoctoral researcher at the University of California, San Diego, working with Prof. Yufei Ding in the Picasso Lab. He received his Ph.D. in Computer Science from Shanghai Jiao Tong University under the supervision of Prof. Jingwen Leng. His research focuses on efficient deep learning systems, spanning model compression, compiler optimization, and system design. His work has been published in top venues such as SOSP, OSDI, ASPLOS and HPCA. Lecture Abstract The rapid growth of large language models (LLMs) requires better compilers for efficient use of multi-GPU systems. In this talk, I will introduce Mercury, a compiler that manages remote GPU memory as part of the memory hierarchy to optimize computation, storage, and communication. I will also present KPerfIR, a tool that adds profiling directly into the compilation process to help analyze GPU kernel performance. These approaches show how integrating optimization and performance analysis in compilers can improve the scalability and efficiency of LLMs.	Mercucy & KPerfIR
Oct 172:00 PM–2:40 PM	Liangyu Zhao [Slides] Bio Liangyu Zhao is a fourth-year PhD student at the University of Washington, advised by Prof. Arvind Krishnamurthy. His research focuses on machine learning systems, with an emphasis on network communication for distributed machine learning. Currently, he is a research scientist intern at Meta AI & Systems Co-Design team. Lecture Abstract As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today’s heterogeneous and diverse network fabrics. We present ForestColl, a tool that generates throughput-optimal schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretical optimality. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabrics, including both switching fabrics and direct accelerator connections. We evaluated ForestColl on multi-box AMD MI250 and NVIDIA DGX A100 platforms. ForestColl showed significant improvements over the vendors’ own optimized communication libraries, RCCL and NCCL, across various settings and in LLM training. ForestColl also outperformed other state-of-the-art schedule generation techniques with both more efficient generated schedules and substantially faster schedule generation speed.	ForestColl
Oct 312:00 PM–2:40 PM	Zhuang Wang [Slides] Bio Zhuang Wang is an Applied Scientist at Amazon Web Services AI. He received his Ph.D. degree in Computer Science from Rice University in 2023, fortunately advised by Prof. T. S. Eugene Ng. His current research interests focus on efficient training and inference systems for large language models. Lecture Abstract Frequent failures are observed during large model training due to large-scale resources involved and extended training time. This talk presents Gemini, a distributed training system that enables fast failure recovery for large model training by checkpointing to CPU memory of the host machines with much larger aggregated bandwidth. However, two challenges prevent naïvely checkpointing to CPU memory. First, the availability of checkpoints in CPU memory cannot be guaranteed when failures occur. Second, since the communication traffic for training and checkpointing share the same network, checkpoint traffic can interfere with training traffic and harm training throughput. To address these two challenges, we propose: 1) a provably near-optimal checkpoint placement strategy to maximize the probability of failure recovery from checkpoints in CPU memory; and 2) a checkpoint traffic scheduling algorithm to minimize, if not eliminate, the interference of checkpoint traffic on model training. Our evaluation shows that Gemini achieves optimal checkpoint frequency, i.e., every iteration, and incurs no overhead on training throughput for large model training.	Gemini
Nov 72:00 PM–2:40 PM	Guangyu (Noah) Shen [Slides] Bio Guangyu Shen is a Ph.D. candidate at Purdue University working with Professor Xiangyu Zhang. His research focuses on making AI systems secure and trustworthy, tackling challenges in model alignment, red-teaming, and backdoor defense. His work has been published in leading venues across machine learning and security, including ICML, NeurIPS, CVPR, IEEE S&P, and USENIX Security. Beyond academia, Guangyu has extensive hands-on experience in AI security competitions—his team, Perspecta-PurdueUMass, has consistently ranked first in the IARPA TrojAI Program for AI backdoor detection. He also led the Purdue PurCL red team to first place in the Amazon Nova AI Challenge, a global competition focused on probing advanced AI coding assistants to uncover safety vulnerabilities and develop more resilient defenses. Lecture Abstract Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs’ situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. We observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks.	Fostering Backdoor Self-Awareness In LLMs
Nov 142:00 PM–2:40 PM	Yixin Dong [Slides] Bio Yixin Dong is a Ph.D. student at Carnegie Mellon University, advised by Prof. Tianqi Chen, and a part-time researcher at xAI. His research focuses on building efficient and verifiable LLM agents. Before that, he received his B.Eng in Computer Science from Shanghai Jiao Tong University. Yixin is also a major contributor to several widely adopted open-source projects, including Apache TVM, MLC-LLM, and XGrammar. Lecture Abstract XGrammar has become the de facto standard for guided decoding in the industry. Guided decoding aims to ensure that the outputs of large language models conform to user-defined structures or grammars by applying additional token masks during decoding. Over the past year since the release of XGrammar, we have made significant improvements. In this talk, I will introduce two exciting advancements. First, on the performance side, we have accelerated grammar compilation and mask generation by leveraging the Earley parser, cross-grammar caching, and JIT grammar compilation. Second, in terms of supported structures, we have designed the Structural Tag, a JSON-based DSL that can describe complex structures, naturally aligning with the needs of LLM agents and now publicly available across all major LLM engines. Finally, we will look ahead to the future goals of XGrammar.	XGrammar
Nov 212:00 PM–2:40 PM	Song Bian [Slides] Bio Song Bian is a final-year Ph.D. student at the University of Wisconsin-Madison, advised by Prof. Shivaram Venkataraman. His research focuses on building efficient training systems and designing inference-efficient large language models. Lecture Abstract Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. In view of this, I ask the following question: Can we explicitly capture the trade-off between inference efficiency and accuracy of large language models? In this talk, I will demonstrate that the architecture of large language models significantly affects their inference efficiency. Motivated by this observation, we propose a conditional scaling law that extends the Chinchilla framework by incorporating architectural factors. We also introduce a search framework for discovering model architectures that are both inference-efficient and accurate. Finally, using the proposed scaling law and search framework, we predict optimized model architectures that outperform LLaMA‑3.2 in both accuracy and inference throughput, under the same training budget.	Scaling Laws Meet Model Architecture