Skip to main content
COMP 468/568
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Schedule

WeekLecture 1Lecture 2
Jan 14/16Introduction to Deep Learning Systems; GPU/Accelerator Overview [Slides]S26-Week-1-Lect2-CIFAR10.pptx — CIFAR-10 example and basic DL workflow
Jan 21/23Modern GPU Architecture for Deep LearningS26-Week-2-Lect2-GEMM.pptx — Matrix multiplication and GPU compute cores
Jan 28/30CUDA Programming and Kernel OptimizationS26-Week-3-Lect2-TensorCore.pptx — Tensor Core programming and mixed precision
Feb 4/6Memory Hierarchy, Caching, and Memory ManagementS26-Week-4-Lect2-Convolution.pptx — Convolution kernel optimization and dataflow
Feb 11/13Tensor Operations and Optimized KernelsS26-Week-5-Lect2-Transformer.pptx — Transformer model and attention computation
Feb 18/20Compiler Techniques for Deep Learning (IR, Operator Fusion)S26-Week-6-Lect2-SparseMM.pptx — Sparse matrix multiplication and graph IR
Feb 25/27Distributed Training Fundamentals; Data vs Model ParallelismS26-Week-8-Lect2-DistTrain.pptx — Distributed training implementation
Mar 4/6Communication Optimizations and SchedulingS26-Week-9-Lect2-Diffusion.pptx — Diffusion models and compute scheduling
Mar 11/13Systems for Large Models (LLMs, Diffusion Models)S26-Week-10-Lect2-DLRM.pptx — Deep Learning Recommendation Models
Mar 18/20Profiling, Benchmarking, and Performance Analysis(TBD — profiling demo or tool lecture)
Mar 25/27Case Studies: GNN Acceleration, Diffusion Models, Recommender Systems(Combine with prior GNN, Diffusion, and DLRM lectures as recap)
- Apr 24Course PresentationCourse Presentation

Exteral Talk

DateLecturerTopicMaterials
Feb 18Jianming Tong
BioJianming Tong (https://jianmingtong.github.io/) is a 4th-year PhD candidate at Georgia Tech, a visiting researcher at MIT. He focuses on full-stack optimizations—spanning model, system, compiler, and hardware—for enhancing both efficiency and privacy of AI systems. He proposed a framework to approximate non-linear ML operators as polynomials to be compatible with Homomorphic Encryption (HE) without utility sacrifice, enabling privacy-preserving ML via HE (model, MLSys'23), and developed the CROSS compiler to convert HE workloads as AI workloads to be accelerated by existing Google TPUs, enabling immediate scalable low-cost privacy-preserving capability to existing AI stacks and designed a dataflow-layout co-switching reconfigurable accelerator for efficient inference of dynamic AI workloads (ISCA'24). These works are widely deployed in NVIDIA, Google, IBM, and recognized by Qualcomm Innovation Fellowship and Machine Learning and System Rising Star.
Lecture AbstractThe inference efficiency of diverse ML models over spatial accelerators boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed NEST and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.272.89x inference latency speedup and 1.36.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is available at https://github.com/maeri-project/FEATHER.
FEATHER[TBD]
Feb 18Hongzhen Chen[TBD][TBD]
Mar 4Zishen Wan[TBD][TBD]