Publications (Conf/Journal/Patent)

Also available at Google Scholar, DBLP and ORCID.

Note: (Co-)Supervised Student, Corresponding-Author^✉, ar/acceptance_rate

Selected Publications

2026

[C36] Yuhao Gu, Zhongchun Zheng, Nong Xiao^✉, Yutong Lu and Xianwei Zhang^✉
coMulator: Coordinate Cross-architecture Compilation and Emulation to Accelerate Dynamic Binary Translation (CCF-A, ar/25.3%)
The 59th IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, October 2026.

[C35] Han Huang, Lanshu Huang, Xianjie Chen, Xianwei Zhang^✉ and Yutong Lu^✉
HSPref: Efficient Software Prefetching for ARM SME Outer-Products (CCF-A, ar/25.3%)
The 59th IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, October 2026.

[C34] Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu^✉ and Xianwei Zhang^✉
Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration (CCF-A, ar/10.6%)
The 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Pittsburgh, PA, United States, March 2026.

[C33] Xuanteng Huang, Fan Li, Riyang Hu, Jianchang Zhang, Yuan Peng, Yang Zhou, Fangying Chen and Xianwei Zhang^✉
FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs (CCF-A, ar/17.6%)
The 40th Annual AAAI Conference on Artificial Intelligence (AAAI), Singapore, January 2026.

2025

[J7] Wenxuan Pan, Zejia Lin, Jiangsu Du^✉ and Xianwei Zhang^✉
HuntKTm: Hybrid Scheduling and Automatic Management for Efficient Kernel Execution on Modern GPUs (CCF-A)
ACM Transactions on Architecture and Code Optimization (TACO), Volume 22, Issue 4, Article 161.

[C29] Hongxin Xu⁼, Tianyu Guo⁼ and Xianwei Zhang^✉ (⁼Equal Contribution)
DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline Parallelism (CCF-A, ar/24.5%)
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), San Diego, CA, United States, December 2025.

[C28] Yuhao Gu, Haoquan Chen, Xianjie Chen, Jiangsu Du, Zhiguang Chen, Nong Xiao^✉, Xianwei Zhang^✉ and Yutong Lu
coMtainer: Compilation-assisted HPC Container Images with Enhanced Adaptability (CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C27] Tianyu Guo, Xianwei Zhang^✉, Jiangsu Du, Zhiguang Chen^✉, Nong Xiao and Yutong Lu
gLLM: Global Balanced Pipeline Parallelism Systems for Distributed LLMs Serving with Token Throttling (CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C26] Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen and Yutong Lu^✉
HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA (CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C25] Xuanteng Huang, Jiangsu Du, Nong Xiao and Xianwei Zhang^✉
PaSK: Cold Start Mitigation for Inference with Proactive and Selective Kernel Loading on GPUs (CCF-A, ar/23%)
The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.

[C24] Kan Wu, Zejia Lin, Mengyue Xi, Zhongchun Zheng, Wenxuan Pan, Xianwei Zhang^✉ and Yutong Lu^✉
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving (CCF-A, ar/23%)
The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.

[C23] Yuhao Gu, Chunyu Chen, Jiangsu Du, Xiaoxi Zhang and Xianwei Zhang^✉
ORFA: Exploring WebAssembly as a Turing Complete Query Language for Web APIs (CCF-A, Oral, ar/19.8%)
The ACM Web Conference (WWW), Sydney, NSW, Australia, April 2025.

2024

[C19] Tianyu Guo, Xuanteng Huang, Kan Wu, Xianwei Zhang^✉ and Nong Xiao
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism (CCF-A)
The 61st ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2024.

[C18] Yuanxin Wei, Jiangsu Du^✉, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang^✉, Nong Xiao and Yutong Lu
APTMoE: Affinity-aware Pipeline Tuning for MoE Models on Bandwidth-constrained GPU Nodes (CCF-A)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Atlanta, GA, United States, November 2024.

2023

[C14] Zejia Lin, Zewei Mo, Xuanteng Huang, Xianwei Zhang^✉ and Yutong Lu
KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications (CCF-B)
The IEEE 41st International Conference on Computer Design (ICCD), Washington DC, United States, November 2023.

2022

[C13] Tianao Ge, Zewei Mo, Kan Wu, Xianwei Zhang^✉ and Yutong Lu
RollBin: Reducing Code-size via Loop Rerolling at Binary Level (CCF-B)
The 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), San Diego, California, United States, June 2022.

2018

[C8] Anthony Gutierrez, Brad Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matt Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain and Tim Rogers
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level (CCF-A)
The 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018.

2017

[C7] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
DrMP: Mixed Precision-aware DRAM for High Performance Approximate and Precise Computing (CCF-B)
The 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, Oregon, USA, September 2017.

2016

[C6] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
Restore Truncation for Performance Improvement in Future DRAM Systems (CCF-A)
The 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016.

2013

[C1] Xianwei Zhang, Lei Jiang, Youtao Zhang, Chuanjun Zhang and Jun Yang
WoM-SET: Lowering Write Power of Proactive-SET based PCM Write Strategy Using WoM Code (CCF-C), (Best Paper Award)
The International Symposium on Low Power Electronics and Design (ISLPED), Beijing, China, September 2013.

All Publications

2026

[C36] Yuhao Gu, Zhongchun Zheng, Nong Xiao^✉, Yutong Lu and Xianwei Zhang^✉
coMulator: Coordinate Cross-architecture Compilation and Emulation to Accelerate Dynamic Binary Translation(CCF-A, ar/25.3%)
The 59th IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, October 2026.

[C35] Han Huang, Lanshu Huang, Xianjie Chen, Xianwei Zhang^✉ and Yutong Lu^✉
HSPref: Efficient Software Prefetching for ARM SME Outer-Products(CCF-A, ar/25.3%)
The 59th IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, October 2026.

[C34] Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu^✉ and Xianwei Zhang^✉
Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration(CCF-A, ar/10.6%)
The 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Pittsburgh, PA, United States, March 2026.

[C33] Xuanteng Huang, Fan Li, Riyang Hu, Jianchang Zhang, Yuan Peng, Yang Zhou, Fangying Chen and Xianwei Zhang^✉
FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs(CCF-A, ar/17.6%)
The 40th Annual AAAI Conference on Artificial Intelligence (AAAI), Singapore, January 2026.

[C32] Xianwei Zhang⁼, Xuanteng Huang⁼ and Nong Xiao^✉ (⁼Equal Contribution)
FEDCM: Fine-grained Kernel Scheduling and Management to Improve GPU Sharing(CCF-B, ar/25%)
Design, Automation and Test in Europe Conference (DATE), Verona, Italy, April 2026.

[C31] Tengyang Zheng, Han Huang, Junru Chen, Xianwei Zhang^✉ and Yutong Lu^✉
SMEAtten: Fast and Memory-Efficient Outer Product-based Attention on ARMv9 CPUs with SME(CCF-B, ar/26.2%)
The 32nd International European Conference on Parallel and Distributed Computing (Euro-Par), Pisa, Italy, August 2026.

[C30] Bingjie Liu, Zhongchun Zheng and Xianwei Zhang^✉
TensorAgent: Multi-Agent Framework for Automated Tensor Core Code Generatio(CCF-C)
International Symposium on Advanced Parallel Processing Technology (APPT), Brussels, Belgium, July 2026.

[J8] Wenyuan Liang, Hengzhong Liang, Han Huang and Xianwei Zhang^✉
Porting LULESH to SYCL: Practical Insights and Performance Analysis(CCF-C)
CCF Transactions on High Performance Computing (THPC), 2026.

2025

[J7] Wenxuan Pan, Zejia Lin, Jiangsu Du^✉ and Xianwei Zhang^✉
HuntKTm: Hybrid Scheduling and Automatic Management for Efficient Kernel Execution on Modern GPUs(CCF-A)
ACM Transactions on Architecture and Code Optimization (TACO), Volume 22, Issue 4, Article 161.

[C29] Hongxin Xu⁼, Tianyu Guo⁼ and Xianwei Zhang^✉ (⁼Equal Contribution)
DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline Parallelism(CCF-A, ar/24.5%)
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), San Diego, CA, United States, December 2025.

[C28] Yuhao Gu, Haoquan Chen, Xianjie Chen, Jiangsu Du, Zhiguang Chen, Nong Xiao^✉, Xianwei Zhang^✉ and Yutong Lu
coMtainer: Compilation-assisted HPC Container Images with Enhanced Adaptability(CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C27] Tianyu Guo, Xianwei Zhang^✉, Jiangsu Du, Zhiguang Chen^✉, Nong Xiao and Yutong Lu
gLLM: Global Balanced Pipeline Parallelism Systems for Distributed LLMs Serving with Token Throttling(CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C26] Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen and Yutong Lu^✉
HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA(CCF-A, ar/21.2%)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. Louis, MO, United States, November 2025.

[C25] Xuanteng Huang, Jiangsu Du, Nong Xiao and Xianwei Zhang^✉
PaSK: Cold Start Mitigation for Inference with Proactive and Selective Kernel Loading on GPUs(CCF-A, ar/23%)
The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.

[C24] Kan Wu, Zejia Lin, Mengyue Xi, Zhongchun Zheng, Wenxuan Pan, Xianwei Zhang^✉ and Yutong Lu^✉
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving(CCF-A, ar/23%)
The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.

[C23] Yuhao Gu, Chunyu Chen, Jiangsu Du, Xiaoxi Zhang and Xianwei Zhang^✉
ORFA: Exploring WebAssembly as a Turing Complete Query Language for Web APIs(CCF-A, Oral, ar/19.8%)
The ACM Web Conference (WWW), Sydney, NSW, Australia, April 2025.

[C22] Mengyue Xi, Jingyi He and Xianwei Zhang^✉
CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency(CCF-B)
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.

[C21] Tianyu Guo, Hande Dong^✉, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao and Xianwei Zhang^✉
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse(CCF-B)
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.

[C20] Mengyue Xi, Tianyu Guo, Xuanteng Huang, Zejia Lin and Xianwei Zhang^✉
Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs(CCF-C, ar/28.6%)
The 30th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo Odaiba Miraikan, Japan, January 2025.

[J6] Hengzhong Liang, Han Huang and Xianwei Zhang^✉
SuCL: Supply Unified Communication Layer to Improve SYCL-based Heterogeneous Computing(CCF-C)
CCF Transactions on High Performance Computing (THPC), 2025.

[J5] Pin Chen, Qing Mo, Zexin Xu, Xianwei Zhang and Yutong Lu^✉
Star-gen: An HPC-AI Framework for Constructing Large-scale Computational Materials Database(CCF-C)
CCF Transactions on High Performance Computing (THPC), 2025.

2024

[C19] Tianyu Guo, Xuanteng Huang, Kan Wu, Xianwei Zhang^✉ and Nong Xiao
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism(CCF-A)
The 61st ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2024.

[C18] Yuanxin Wei, Jiangsu Du^✉, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang^✉, Nong Xiao and Yutong Lu
APTMoE: Affinity-aware Pipeline Tuning for MoE Models on Bandwidth-constrained GPU Nodes(CCF-A)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Atlanta, GA, United States, November 2024.

[C17] Zejia Lin, Aoyuan Sun, Xianwei Zhang^✉ and Yutong Lu
MixPert: Optimizing Mixed-precision Floating-point Emulation on GPU Integer Tensor Cores(CCF-B)
The 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), Copenhagen, Denmark, June 2024.

[C16] Zhaowen Shan, Xuanteng Huang, Zheng Zhou and Xianwei Zhang^✉
openLG: A Tunable and Efficient Open-source LSTM on GPUs(CCF-C)
The International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, June 2024.

[C15] Zhongchun Zheng, Yuan Wu and Xianwei Zhang^✉
mLOOP: Optimize Loop Unrolling in Compilation with a ML-based Approach
The 17th International Conference on Networking, Architecture, and Storage (NAS), Guangzhou, China, November 2024.

2023

[C14] Zejia Lin, Zewei Mo, Xuanteng Huang, Xianwei Zhang^✉ and Yutong Lu
KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications(CCF-B)
The IEEE 41st International Conference on Computer Design (ICCD), Washington DC, United States, November 2023.

[J4] Xuanteng Huang, Xianwei Zhang^✉, Panfei Yang^✉ and Nong Xiao
Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS
Applied Sciences, December 2023.

[J3] Xi Zhang^✉, Xiaohu Gu, Yue Weng, Xianwei Zhang, Yutong Lu and Zhong Zhao
Hybrid MPI and CUDA Paralleled Finite Volume Unstructured CFD Simulations on a Multi-GPU System(CCF-C)
Future Generation Computer Systems (FGCS), 139 (2023), February 2023.

[W5] Lianghong Huang, Zejia Lin, Wei Liu^✉ and Xianwei Zhang^✉
Hay: Enhancing GPU Sharing Performance With Two-Level Scheduling for Ray
The 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS, short), Hainan, China, December 2023.

2022

[C13] Tianao Ge, Zewei Mo, Kan Wu, Xianwei Zhang^✉ and Yutong Lu
RollBin: Reducing Code-size via Loop Rerolling at Binary Level(CCF-B)
The 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), San Diego, California, United States, June 2022.

[C12] Zewei Mo, Zejia Lin, Xianwei Zhang^✉ and Yutong Lu
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators(CCF-C)
The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.

[C11] Yue Weng, Tianao Ge, Xianwei Zhang^✉ and Yutong Lu
RAISE: Efficient GPU Resource Management via Hybrid Scheduling(CCF-C)
The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina (Messina), Italy, May 2022.

2021

[J2] Yue Weng, Xi Zhang, Xiaohu Guo, Xianwei Zhang^✉, Yutong Lu and Yang Liu
Effects of Mesh Loop Modes on Performance of Unstructured Finite Volume GPU Simulations
Advances in Aerodynamics (AIA), 3(21), 2021.

2020

[W4] Xianwei Zhang and Evgeny Shcherbakov
DELTA: Validate GPU Memory Profiling with Microbenchmarks
The International Symposium on Memory Systems (MemSys, short), Washington D.C., USA, October 2020.

2019

[C10] Tuan Ta, Xianwei Zhang, Anthony Gutierrez and Brad Beckmann
Autonomous Data-Race-Free GPU Testing
IEEE International Symposium on Workload Characterization (IISWC), Orlando, Florida, USA, November 2019.

[C9] Xianwei Zhang, Rujia Wang, Youtao Zhang and Jun Yang
Boosting Chipkill Capability under Retention-error Induced Reliability Emergency(CCF-C)
The 24th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, Janurary 2019.

[W3] John Alsop, Matt Sinclair, Srikant Bharadwaj, Anthony Gutierrez, Xianwei Zhang, Brad Beckmann, Alex Dutu, Onur Kayiran, Michael LeBeane, Brandon Potter, Sooraj Puthoor and Tsung Tai Yeh
Optimizing GPU Cache Policies for MI Workloads
IEEE International Symposium on Workload Characterization (IISWC, short), Orlando, Florida, USA, November 2019.

2018

[C8] Anthony Gutierrez, Brad Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matt Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain and Tim Rogers
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level(CCF-A)
The 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018.

2017

[C7] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
DrMP: Mixed Precision-aware DRAM for High Performance Approximate and Precise Computing(CCF-B)
The 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, Oregon, USA, September 2017.

[J1] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
On the Restore Time Variations of Future DRAM Memory(CCF-B)
ACM Trans. on Design Automation of Electronic Systems (TODAES), 22(2), February 2017.

2016

[W2] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
AWARD: Approximation-aWAre Restore in Further Scaling DRAM
The International Symposium on Memory Systems (MemSys, extended abstract), Washington D.C., USA, October 2016.

[C6] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
Restore Truncation for Performance Improvement in Future DRAM Systems(CCF-A)
The 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016.

2015

[C5] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
Exploiting DRAM Restore Time Variations in Deep Sub-micron Scaling(CCF-B)
The IEEE conference on Design, Automation and Test in Europe (DATE), Grenoble, France, March 2015.

[C4] Xianwei Zhang, Youtao Zhang and Jun Yang
DLB: Dynamic Lane Borrowing for Improving Bandwidth and Performance in Hybrid Memory Cube(CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

[C3] Xianwei Zhang, Youtao Zhang and Jun Yang
TriState-SET: Proactive SET for Improved Performance in MLC Phase Change Memories(CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

[C2] Xianwei Zhang, Lei Zhao, Youtao Zhang and Jun Yang
Exploit Common Source-Line to Construct Energy Efficient Domain Wall Memory based Caches(CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

[W1] Xianwei Zhang, Youtao Zhang and Jun Yang
Adaptive Lane Borrowing of Hybrid Memory Cube
The 52nd ACM/IEEE Design Automation Conference (DAC, wip), San Francisco, California, USA, June 2015.

2013

[C1] Xianwei Zhang, Lei Jiang, Youtao Zhang, Chuanjun Zhang and Jun Yang
WoM-SET: Lowering Write Power of Proactive-SET based PCM Write Strategy Using WoM Code(CCF-C), (Best Paper Award)
The International Symposium on Low Power Electronics and Design (ISLPED), Beijing, China, September 2013.

[arXiv]

[A7] Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu and Xianwei Zhang
Benchmarks are Not Enough: Ramp for Runtime Assessing of Agentic Models in Production Systems

[A6] Yipeng Ouyang, Yi Xiao, Yuhao Gu and Xianwei Zhang
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

[A5] Yuhao Gu, Zhongchun Zheng, Nong Xiao, Yutong Lu and Xianwei Zhang
Partial Cross-Compilation and Mixed Execution for Accelerating Dynamic Binary Translation

[A4] Tianyu Guo, Tianming Xu, Xianjie Chen, Junru Chen, Nong Xiao and Xianwei Zhang
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference

[A3] Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu and Xianwei Zhang
Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing(ASPLOS’2026)

[A2] Zhongchun Zheng, Kan Wu, Long Cheng, Lu Li, Rodrigo C. O. Rocha, Tianyi Liu, Wei Wei, Jianjiang Zeng, Xianwei Zhang and Yaoqing Gao
VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU

[A1] Tian Wu, Liming Wang, Zijian Wen, Xiaoxi Zhang, Jingpu Duan, Xianwei Zhang and Jinhang Zuo
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

[Patent]

[P8] Xianwei Zhang, John Kalamatianos and Bradford Beckmann
GPU Cache Management based on Lightweight Locality Type Detection
US11487671B2, 2022.

[P7] Sooraj Puthoor, Kishore PUNNIYAMURTHY, Onur Kayiran, Xianwei Zhang, Yasuko ECKERT, Johnathan Alsop and Bradford Michael Beckmann
Memory Request Priority Assignment Techniques for Parallel Processors
US11507522B2, 2022.

[P6] Seyed Mohammad Seyedzadehdelcheh, Xianwei Zhang, Bradford Beckmann, and Shomit N. Das
Data Compression System Using Base Values and Methods Thereof
US11740791B2, 2023.

[P5] Anthony T. Gutierrez, Sergey Blagodurov, Scott A. Moe, Xianwei Zhang, Jieming Yin and Matthew D. Sinclair
Selecting a Precision Level for Executing a Workload in an Electronic Device
US11150899B2, 2021.

[P4] Xianwei Zhang, Yuhao Gu, Zhiguang Chen and Yutong Lu
A Programmable Data Exchange System
ZL 2025 1 0077351.0

[P3] Zewei Mo, Xianwei Zhang, Tianao Ge and Yutong Lu
A Compiler-Based Automatic Multi-Stream Scheduling Method for Kernel Functions
ZL 2022 1 0172808.2

[P2] Zewei Mo, Xianwei Zhang and Tianao Ge
An Error-Controllable Automated Optimization Method for Mixed-Precision Operators
ZL 2021 1 1551663.9

[P1] Tianao Ge, Xianwei Zhang, Zewei Mo and Yutong Lu
A Loop-Folding-Based Binary Code Size Optimizer
ZL 2022 1 0154571.5

GPU

2026

[C32] Xianwei Zhang⁼, Xuanteng Huang⁼ and Nong Xiao^✉ (⁼Equal Contribution)
FEDCM: Fine-grained Kernel Scheduling and Management to Improve GPU Sharing (CCF-B, ar/25%)
Design, Automation and Test in Europe Conference (DATE), Verona, Italy, April 2026.

[C30] Bingjie Liu, Zhongchun Zheng and Xianwei Zhang^✉
TensorAgent: Multi-Agent Framework for Automated Tensor Core Code Generatio (CCF-C)
International Symposium on Advanced Parallel Processing Technology (APPT), Brussels, Belgium, July 2026.

[J8] Wenyuan Liang, Hengzhong Liang, Han Huang and Xianwei Zhang^✉
Porting LULESH to SYCL: Practical Insights and Performance Analysis (CCF-C)
CCF Transactions on High Performance Computing (THPC), 2026.

2025

[C22] Mengyue Xi, Jingyi He and Xianwei Zhang^✉
CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency (CCF-B)
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.

[C20] Mengyue Xi, Tianyu Guo, Xuanteng Huang, Zejia Lin and Xianwei Zhang^✉
Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs (CCF-C, ar/28.6%)
The 30th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo Odaiba Miraikan, Japan, January 2025.

2024

[C17] Zejia Lin, Aoyuan Sun, Xianwei Zhang^✉ and Yutong Lu
MixPert: Optimizing Mixed-precision Floating-point Emulation on GPU Integer Tensor Cores (CCF-B)
The 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), Copenhagen, Denmark, June 2024.

2023

[J4] Xuanteng Huang, Xianwei Zhang^✉, Panfei Yang^✉ and Nong Xiao
Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS
Applied Sciences, December 2023.

[J3] Xi Zhang^✉, Xiaohu Gu, Yue Weng, Xianwei Zhang, Yutong Lu and Zhong Zhao
Hybrid MPI and CUDA Paralleled Finite Volume Unstructured CFD Simulations on a Multi-GPU System (CCF-C)
Future Generation Computer Systems (FGCS), 139 (2023), February 2023.

2022

[C12] Zewei Mo, Zejia Lin, Xianwei Zhang^✉ and Yutong Lu
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators (CCF-C)
The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.

[C11] Yue Weng, Tianao Ge, Xianwei Zhang^✉ and Yutong Lu
RAISE: Efficient GPU Resource Management via Hybrid Scheduling (CCF-C)
The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina (Messina), Italy, May 2022.

2021

2020

2019

2018

High-Performance Computing

2026

[C31] Tengyang Zheng, Han Huang, Junru Chen, Xianwei Zhang^✉ and Yutong Lu^✉
SMEAtten: Fast and Memory-Efficient Outer Product-based Attention on ARMv9 CPUs with SME (CCF-B, ar/26.2%)
The 32nd International European Conference on Parallel and Distributed Computing (Euro-Par), Pisa, Italy, August 2026.

[J8] Wenyuan Liang, Hengzhong Liang, Han Huang and Xianwei Zhang^✉
Porting LULESH to SYCL: Practical Insights and Performance Analysis (CCF-C)
CCF Transactions on High Performance Computing (THPC), 2026.

2025

[J6] Hengzhong Liang, Han Huang and Xianwei Zhang^✉
SuCL: Supply Unified Communication Layer to Improve SYCL-based Heterogeneous Computing (CCF-C)
CCF Transactions on High Performance Computing (THPC), 2025.

[J5] Pin Chen, Qing Mo, Zexin Xu, Xianwei Zhang and Yutong Lu^✉
Star-gen: An HPC-AI Framework for Constructing Large-scale Computational Materials Database (CCF-C)
CCF Transactions on High Performance Computing (THPC), 2025.

2024

2023

[J3] Xi Zhang^✉, Xiaohu Gu, Yue Weng, Xianwei Zhang, Yutong Lu and Zhong Zhao
Hybrid MPI and CUDA Paralleled Finite Volume Unstructured CFD Simulations on a Multi-GPU System (CCF-C)
Future Generation Computer Systems (FGCS), 139 (2023), February 2023.

2022

[C12] Zewei Mo, Zejia Lin, Xianwei Zhang^✉ and Yutong Lu
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators (CCF-C)
The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.

2021

Machine Learning and System

2026

[C31] Tengyang Zheng, Han Huang, Junru Chen, Xianwei Zhang^✉ and Yutong Lu^✉
SMEAtten: Fast and Memory-Efficient Outer Product-based Attention on ARMv9 CPUs with SME (CCF-B, ar/26.2%)
The 32nd International European Conference on Parallel and Distributed Computing (Euro-Par), Pisa, Italy, August 2026.

[C30] Bingjie Liu, Zhongchun Zheng and Xianwei Zhang^✉
TensorAgent: Multi-Agent Framework for Automated Tensor Core Code Generatio (CCF-C)
International Symposium on Advanced Parallel Processing Technology (APPT), Brussels, Belgium, July 2026.

2025

[C21] Tianyu Guo, Hande Dong^✉, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao and Xianwei Zhang^✉
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse (CCF-B)
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.

2024

[C16] Zhaowen Shan, Xuanteng Huang, Zheng Zhou and Xianwei Zhang^✉
openLG: A Tunable and Efficient Open-source LSTM on GPUs (CCF-C)
The International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, June 2024.

Architecture and Hw/Sw Co-design

2026

[C32] Xianwei Zhang⁼, Xuanteng Huang⁼ and Nong Xiao^✉ (⁼Equal Contribution)
FEDCM: Fine-grained Kernel Scheduling and Management to Improve GPU Sharing (CCF-B, ar/25%)
Design, Automation and Test in Europe Conference (DATE), Verona, Italy, April 2026.

[C31] Tengyang Zheng, Han Huang, Junru Chen, Xianwei Zhang^✉ and Yutong Lu^✉
SMEAtten: Fast and Memory-Efficient Outer Product-based Attention on ARMv9 CPUs with SME (CCF-B, ar/26.2%)
The 32nd International European Conference on Parallel and Distributed Computing (Euro-Par), Pisa, Italy, August 2026.

2025

[C22] Mengyue Xi, Jingyi He and Xianwei Zhang^✉
CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency (CCF-B)
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.

[C20] Mengyue Xi, Tianyu Guo, Xuanteng Huang, Zejia Lin and Xianwei Zhang^✉
Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs (CCF-C, ar/28.6%)
The 30th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo Odaiba Miraikan, Japan, January 2025.

2024

[C17] Zejia Lin, Aoyuan Sun, Xianwei Zhang^✉ and Yutong Lu
MixPert: Optimizing Mixed-precision Floating-point Emulation on GPU Integer Tensor Cores (CCF-B)
The 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), Copenhagen, Denmark, June 2024.

2023

2022

[C12] Zewei Mo, Zejia Lin, Xianwei Zhang^✉ and Yutong Lu
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators (CCF-C)
The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.

[C11] Yue Weng, Tianao Ge, Xianwei Zhang^✉ and Yutong Lu
RAISE: Efficient GPU Resource Management via Hybrid Scheduling (CCF-C)
The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina (Messina), Italy, May 2022.

2020

2019

[C9] Xianwei Zhang, Rujia Wang, Youtao Zhang and Jun Yang
Boosting Chipkill Capability under Retention-error Induced Reliability Emergency (CCF-C)
The 24th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, Janurary 2019.

2018

2017

[J1] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
On the Restore Time Variations of Future DRAM Memory (CCF-B)
ACM Trans. on Design Automation of Electronic Systems (TODAES), 22(2), February 2017.

2016

2015

[C5] Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang
Exploiting DRAM Restore Time Variations in Deep Sub-micron Scaling (CCF-B)
The IEEE conference on Design, Automation and Test in Europe (DATE), Grenoble, France, March 2015.

[C4] Xianwei Zhang, Youtao Zhang and Jun Yang
DLB: Dynamic Lane Borrowing for Improving Bandwidth and Performance in Hybrid Memory Cube (CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

[C3] Xianwei Zhang, Youtao Zhang and Jun Yang
TriState-SET: Proactive SET for Improved Performance in MLC Phase Change Memories (CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

[C2] Xianwei Zhang, Lei Zhao, Youtao Zhang and Jun Yang
Exploit Common Source-Line to Construct Energy Efficient Domain Wall Memory based Caches (CCF-B)
The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.

2013

Compilation and CodeOpt

2026

[C32] Xianwei Zhang⁼, Xuanteng Huang⁼ and Nong Xiao^✉ (⁼Equal Contribution)
FEDCM: Fine-grained Kernel Scheduling and Management to Improve GPU Sharing (CCF-B, ar/25%)
Design, Automation and Test in Europe Conference (DATE), Verona, Italy, April 2026.

[C30] Bingjie Liu, Zhongchun Zheng and Xianwei Zhang^✉
TensorAgent: Multi-Agent Framework for Automated Tensor Core Code Generatio (CCF-C)
International Symposium on Advanced Parallel Processing Technology (APPT), Brussels, Belgium, July 2026.

2025

2024

2023

2022

[C12] Zewei Mo, Zejia Lin, Xianwei Zhang^✉ and Yutong Lu
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators (CCF-C)
The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.

arXiv

[A7] Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu and Xianwei Zhang
Benchmarks are Not Enough: Ramp for Runtime Assessing of Agentic Models in Production Systems

[A6] Yipeng Ouyang, Yi Xiao, Yuhao Gu and Xianwei Zhang
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

[A5] Yuhao Gu, Zhongchun Zheng, Nong Xiao, Yutong Lu and Xianwei Zhang
Partial Cross-Compilation and Mixed Execution for Accelerating Dynamic Binary Translation

[A4] Tianyu Guo, Tianming Xu, Xianjie Chen, Junru Chen, Nong Xiao and Xianwei Zhang
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference

[A3] Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu and Xianwei Zhang
Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing (ASPLOS’2026)

[A1] Tian Wu, Liming Wang, Zijian Wen, Xiaoxi Zhang, Jingpu Duan, Xianwei Zhang and Jinhang Zuo
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

Issued Patents

[P8] Xianwei Zhang, John Kalamatianos and Bradford Beckmann
GPU Cache Management based on Lightweight Locality Type Detection
US11487671B2, 2022.

[P6] Seyed Mohammad Seyedzadehdelcheh, Xianwei Zhang, Bradford Beckmann, and Shomit N. Das
Data Compression System Using Base Values and Methods Thereof
US11740791B2, 2023.

[P4] Xianwei Zhang, Yuhao Gu, Zhiguang Chen and Yutong Lu
A Programmable Data Exchange System
ZL 2025 1 0077351.0

[P3] Zewei Mo, Xianwei Zhang, Tianao Ge and Yutong Lu
A Compiler-Based Automatic Multi-Stream Scheduling Method for Kernel Functions
ZL 2022 1 0172808.2

[P2] Zewei Mo, Xianwei Zhang and Tianao Ge
An Error-Controllable Automated Optimization Method for Mixed-Precision Operators
ZL 2021 1 1551663.9

[P1] Tianao Ge, Xianwei Zhang, Zewei Mo and Yutong Lu
A Loop-Folding-Based Binary Code Size Optimizer
ZL 2022 1 0154571.5