Schedule

Week	Lecture 1	Lecture 2
Jan 14/16	Introduction to Deep Learning Systems; GPU/Accelerator Overview [Slides]	Intro to CUDA and GPU Arch Programming [Slides 1] [Slides 2]
Jan 21/23	Deep Learning Computer Vision [Slides]	GEMM Operation Optimization [Slides 1] [Link 1] [Slides 2 ] [Link 2]
Jan 28/30	Sparse Matrix Multiplication [Slides]	PyTorch Customr C++ and CUDA Operators [Slides]
Feb 4/6	Guest Lecture - Jianming Tong [Record]	Convolution Operation [Slides]
Feb 11/13	Evolution & Programming of Tensor Cores [Slides]	Spring Recess No Course
Feb 18/20	Transformer Ops [Slides]	Float Format [Slides]
Feb 25/27	Guest Lecture - Hongzheng Chen [Slide]	TBD
Mar 4/6	Guest Lecture - Wan Zishen	NCCL Communication & Distributed Training [SLides 1] [Slides 2]
Mar 11/13	TBD	TBD
Mar 18/20	TBD	TBD
Mar 25/27	TBD	TBD
- Apr 24	Course Presentation	Course Presentation

Exteral Talk

Date	Lecturer	Topic	Materials
Feb 4	Jianming Tong Bio Jianming Tong (https://jianmingtong.github.io/) is a 4th-year PhD candidate at Georgia Tech, a visiting researcher at MIT. He focuses on full-stack optimizations—spanning model, system, compiler, and hardware—for enhancing both efficiency and privacy of AI systems. He proposed a framework to approximate non-linear ML operators as polynomials to be compatible with Homomorphic Encryption (HE) without utility sacrifice, enabling privacy-preserving ML via HE (model, MLSys'23), and developed the CROSS compiler to convert HE workloads as AI workloads to be accelerated by existing Google TPUs, enabling immediate scalable low-cost privacy-preserving capability to existing AI stacks and designed a dataflow-layout co-switching reconfigurable accelerator for efficient inference of dynamic AI workloads (ISCA'24). These works are widely deployed in NVIDIA, Google, IBM, and recognized by Qualcomm Innovation Fellowship and Machine Learning and System Rising Star. Lecture Abstract The inference efficiency of diverse ML models over spatial accelerators boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed NEST and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27-2.89x inference latency speedup and 1.3-6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is available at https://github.com/maeri-project/FEATHER.	FEATHER	[Record]
Feb 25	Hongzheng Chen Bio Hongzheng Chen is a final-year Ph.D. candidate at Cornell University. His research interests broadly lie in compilers, programming systems, and accelerator architecture for generative AI workloads. He has authored 20+ papers in top-tier computer systems and hardware conferences. The machine learning systems resulting from his work have been widely adopted across industry, including Google, AMD, Intel, NVIDIA, AWS, and ByteDance. His research has received three Best Paper nominations and a Best Paper Award at leading hardware conferences, and he was selected as an ML and Systems Rising Star in 2024. Lecture Abstract Theoretical scaling laws suggest that increasing compute should naturally lead to more intelligent models. In practice, however, moving a novel model architecture from research to efficient deployment on distributed heterogeneous systems reveals a large gap between theoretical promise and real performance. This gap arises from the hardware lottery of mapping models to chips, the software lottery where frameworks constrain efficient implementations, and the growing productivity bottleneck of manual optimization. In this lecture, I will present our work on composable programming models that aim to close these gaps. I will discuss accelerator programming languages through Tawa [CGO’26], an automated compiler that generates warp-specialized GPU kernels from high-level programs, and accelerator design languages through Allo [PLDI’24], a Python-embedded framework for rapid generation of customized accelerators. Finally, I will describe Magellan [C4ML CGO’26], an agentic system that uses large language models to evolve the compiler itself. Together, these efforts outline a path toward an automated stack that better aligns AI models with hardware, bringing realized performance closer to the promise of scaling.	Composable Programming Models for AI Scaling	[Record]
Mar 4	Zishen Wan Bio Zishen Wan is a postdoctoral fellow at Harvard University, working with Prof. Vijay Janapa Reddi. He received his Ph.D. from Georgia Tech, advised by Profs. Arijit Raychowdhury and Tushar Krishna. His research focuses on computer architecture, with an emphasis on cross-stack co-design of systems, architectures, and silicon for physical intelligence. His work appears in venues including ASPLOS, MICRO, HPCA, JSSC, ISSCC, and DAC, and has been recognized with Best Paper Awards at DAC, CAL, and SRC JUMP2.0, First Place Awards in DAC PhD Forum and ACM Student Research Competition, and honorable mention in IEEE Micro Top Picks. He is a recipient of Qualcomm, Baidu, and CRNCH PhD Fellowships, and was named as ML and Systems Rising Star and Cyber-Physical Systems Rising Star. His research has been featured in MIT Technology Review and Fortune, and adopted by industry partners including Intel, IBM, and Google. For more information, please visit https://zishenwan.github.io/. Lecture Abstract Physical intelligence – where embodied agents perceive, reason, plan, and act in the physical world – is emerging as a new computing frontier spanning robotics, autonomous systems, and spatial AI. However, today’s physical intelligence systems remain constrained by high latency, energy cost, and fragile reliability, due to fundamental mismatch between their compositional nature and existing computing architectures. The core challenge extends beyond algorithms, to how we architect computing systems and silicon that natively support intelligence that reasons and adapts under real-world constraints. In this talk, I will present a principled cross-stack system-architecture-silicon co-design approach to building the computational foundations for physical intelligence. I will first introduce REASON, a flexible hardware architecture culminating the first programmable SoC tapeout for efficient neuro-symbolic cognition. REASON integrates unified kernel abstractions, flexible dataflows, memory-centric computing, end-to-end compilation flow, and adaptive power management, enabling efficient cognition in silicon. Building on this foundation, I will present ReCA, an integrated hardware architecture that bridges high-level cognition and low-level autonomy under stringent power and latency constraints by leveraging spatial-aware runtimes, memory layout optimizations, and heterogeneous fabrics. Finally, I will highlight our agile SoC design flows that translate evolving physical intelligence workloads into efficient silicon implementations. By bridging computer architecture, system software, and silicon validation, my research establishes adaptive, accelerator-rich computing substrates for physical intelligence. This work advances a vision in which AI and hardware are co-designed, co-reason, and co-adapt, architecting future computing systems as active enablers of intelligence in the physical world.	Architecting Physical Intelligence: Cross-Stack Co-Design from Systems to Silicon