Weekly Summary 3
Dec 31, 2023 12:30 pm UTC+8
科大高新区
1 本周的工作
1.1 理论调研:
- 在深入了解Nsight System 和 Nsight Compute时,我发现我根本没有系统的学习GPU和CUDA编程,有很多基础的问题无法回答:
- 通用GPU设计
- GPU作为一种加速器,设计之初的目的是解决什么问题?简单的并行计算?
- 为了这个目的,在硬件上诞生了什么经典的通用设计?类似CPU的Mem hierarchy,ROB,port model。
- 可能的SIMT, shared memory, large shared register page
- GPU程序的指令执行流程是什么? 也是多级流水线?
- AI程序为什么适合在GPU上跑(并行度高,访存量大?),以及是如何在GPU上跑的?(举例 CNN,transformer,前向反向传播,梯度更新?)
- NV-GPU和CUDA编程的相关概念
- NV的GPU有什么特殊的设计? (光追单元,3D单元? Tensor core?)
- block grid
- Warps threads
- 通用GPU设计
- DSE on GPU using Interval model
- 三篇文章,与一个公式,
- GPUMech: GPU Performance Modeling Technique based on Interval Analysis
- MDM: The GPU Memory Divergence Model
- GCoM: A Detailed GPU Core Model for Accurate Analytical Modeling of Modern GPUs
- 原理也是基于 Interval model 和 memory trace
- 模拟精度与GPUSIM差10%,
- 相较于GPUSIM 时间大大缩短,真实100ms,采集trace 10s, 差不多也是100倍。
- 对于Zsim模拟器,不仅有DSE功能。保留了TB级别的trace数据,但是节约了DSE的大量时间。
- 三篇文章,与一个公式,
- Ideas about friends:
- CZW: DSE on PIM
- TBX: GPU model formular based on interval model
- YFY: tensor core & shared memory model
- My offer between huawei and baidu.
1.2 实践上手:
- PPT 使用插件,修改字体和格式。
- SD的推理上手并简单分析热点。
2 下周任务优先级
- 完成毕业论文标题和内容填充:参考 Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis
- AI for system 3: back to 3 papers about GPU Analysis model to get deep understanding. After this I will continue we journey to analyse LLM and PowerInfer
- Of course I have a lot of ACSA lab website design and document work remain to do.
2.1 团队
- 百度编译器 晓光
- 10几个人,后端上海,前端北京。
- 尖酸小黄鸭:T9 - 半年
- T5 -清华的硕士:AI编译器前端,OP → 机器码→kernel。循环融合。
2.2 工作内容
- 鲁棒性(容忍输入扰动的稳定性,处理出错情况的可靠性),正确性,AI原生的思维。
- 后端 code阵列,
2.2.1 AI 编译器
- 中间理论:代数结构,形式化下来。多个op循环融合,map ADT(Abstract Data Type",即抽象数据类型?)
- TVM autotune op规则
- torch 负优化 - 形式化定义,算子以及约束。 A B op融合
Hi Jiang,
Here’s a summary of my activities this week:
-
Nvidia Nsight-System and Nsight-Compute: While using Nsight-System and Nsight-Compute, I realized that I lack basic knowledge about GPU design and some related concepts, such as the relationship between warps and threads. Additionally, I found myself struggling to understand the GPU performance model and its differences from the CPU performance model used by llvm-mca.
-
Design Space Exploration (DSE) on GPU: Feeling confused, I sought advice from Fuyan Yuan to simplify his explanation of his work on Design Space Exploration (DSE) on GPU. During the learning process, I discovered the formulas mentioned in the papers were exactly what I had been searching for. I believe Buxin Tu can incorporate these formulas into the GPU model.
-
Job Offer from HUAWEI HR: On Thursday, I received a great job offer from HUAWEI HR. I spent two days seeking advice from peers and senior leaders. Finally, I decided to accept the offer and will be joining Huawei next year.
Next Few Weeks’ Work:
-
Graduation Thesis: Complete the title and content for my graduation thesis, taking inspiration from “Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis.”
-
AI for System 3: Revisit three papers about GPU analysis models to deepen my understanding. Following this, I will continue our journey to analyze LLM and PowerInfer.
-
ACSA Lab Website Design and Documentation: I have various tasks related to ACSA lab website design and documentation that still need attention.
Feel free to reach out if you have any questions or suggestions.
Best regards, Shaojie Tan