News

July 2026: Zecheng Li will present our TypeCraft: A Lightweight Data Type Profiler with High Resolution work at OSDI’26 in Seattle, WA, USA.
June 2026: Yanbo Zhao will present our Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC work at ARRAY’26.
May 2026: Zhaonan Meng presented our STTID: High-Performance Sparse Tensor-Train Interpolative Decomposition work at IPDPS’26 in New Orleans, LA, USA.
Mar 03-06, 2026: Jiajia Li presented at SIAM Conference on Parallel Processing for Scientific Computing PP26 at Berlin, Germany.
Nov 16-21, 2025: Jiajia Li attended SC’25 at St. Louis, MO as a PC member, session chair, and an accepted paper
Oct 21-22, 2025: Jiajia Li presented at ORNL Core Universities AI Workshop AI-CORE at University of Tennessee Knoxville
Oct 16-19, 2025: Jiajia Li participated in Scialog Quantum Matter and Information meeting (QMI) as a Scialog Fellow
Sep. 12 & 23, 2025: NSF PPoSS CROSS Project 2nd Annual Meeting with Advisory Board
Sep 17-19, 2025: Jiajia Li gave a presentation at Toulouse Tensor Workshop 2025

Activities

Nov 16-21, 2025: NSF CROSS Project PI Meeting at SC’25
Sep 12, 23, 2025: NSF PPoSS CROSS Project 2nd Annual Meeting with Advisory Broad, Online Agenda HERE)
Nov 17-22, 2024: NSF PPoSS CROSS Project PI Meeting at SC’24
Sep 3, 6, 2024: NSF PPoSS CROSS Project Annual Meeting with Advisory Broad, Online (Meeting Report HERE)
XTensor 2024: 1st Workshop on Cross-stack Optimization of Tensor Methods, San Diego, CA

Projects

CROSS: Collaborative Research: PPoSS: LARGE: Cross-layer Coordination and Optimization for Scalable and Sparse Tensor Networks (CROSS)

Lead PI: Jiajia Li; PIs: Frank Mueller (NCSU), Dong Li (UC Merced), Lizhong Chen (OSU)
NSF PPoSS project, 09/07/2023 – 08/31/2028, Total amount: $5,000,000
CROSS: Collaborative Research: PPoSS: Planning: Cross-layer Coordination and Optimization for Scalable and Sparse Tenor Networks(CROSS)”

Lead PI: Jiajia Li; PIs: Frank Mueller (NCSU), Dong Li (UC Merced), Lizhong Chen (OSU)
NSF PPoSS project, 10/01/2022 – 09/30/2024, Total amount: $250,000

Publication

TIDES: Tiered Block-Sparse Tensor Contraction with Streaming Transpose on GPUs.
Zecheng Li, Sri Harshavardhan Reddy Deverapalli, Yanbo Zhao, Frank Mueller, Karl Pierce, and Jiajia Li.

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC). 2026
STTID: High-Performance Sparse Tensor-Train Interpolative Decomposition.
Zhaonan Meng, Miles Stoudenmire, Karl Pierce, Frank Mueller, and Jiajia Li.

IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2026
Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC.
Yanbo Zhao, Zhaonan Meng, Sai Krishna Teja Varma Manthena, Xu Liu, Ajay Panyala, and Jiajia Li.

ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY), co-located with PLDI. 2026
SmartDispatch: Dynamic Substitution of NumPy-style APIs on Heterogenous CPU-GPU Systems.
Jinku Cui, Yueming Hao, Shuyin Jiao, Jiajia Li, and Xu Liu.

ACM International Conference on the Foundations of Software Engineering (FSE). 2026
G-HEMP: Fast Multi-GPU Private Inference for Large-Scale GCNs with Homomorphic Encryption.
Ran Ran, Zhaoting Gong, Zhaowei Li, Xianting Lu, Jiajia Li, and Wujie Wen.

Conference on Machine Learning and Systems (MLSys). 2026
Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation.
Srikar Chundury, Blake Burgstahler, Jiajia Li, In-Saeng Suh, and Frank Mueller.

ACM International Conference on Supercomputing (ICS). 2026
TypeCraft: A Lightweight Data Type Profiler with High Resolution.
Zecheng Li, Xu Liu, Namhyung Kim, Blake Jones, Alexey Alexandrov, and Jiajia Li.

USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2026
From 2^N to N^2: Tree-Free Scalable Sparse Symmetric Tucker Decomposition.
Yongseok Soh, Shruti Shivakumar, Jiajia Li, Jee Choi, and Ramakrishnan Kannan.

International Conference on Parallel Processing (ICPP). 2026
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference.
Han Meng, Danny Liu, and Dong Li.

ArXiv. 2026
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization.
Jiu Chen, Shuangyan Yang, Xu Xiong, Hexiao Duan, Xinran Zhang, Jie Ren, and Dong Li.

ArXiv. 2026
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism.
Bin Ma, Xingjian Ding, Tekin Bicer, Pengfei Su, and Dong Li.

ArXiv. 2026
TierBPF: Page Migration Admission Control for Tiered Memory via eBPF.
Xi Wang, Tal Zussman, Yuang Xu, Bin Ma, Asaf Cidon, and Dong Li.

ArXiv. 2026
RedSan: Redundant Memory Instruction Sanitizer for GPU Programs.
Yanbo Zhao, Yueming Hao, Zecheng Li, Shuyin Jiao, Xu Liu, and Jiajia Li.

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC). 2025
Systolic Array Acceleration of Diagonal-Optimized Sparse-Sparse Matrix Multiplication for Efficient Quantum Simulation.
Yuchao Su, Srikar Chundury, Jiajia Li, and Frank Mueller.

ArXiv. 2025
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates.
Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Sun, Rui Yang, Tekin Bicer, Dong Li and Yue Cheng.

ArXiv. 2025
NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium.
Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su and Dong Li.

ArXiv. 2025
Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory.
Bin Ma, Jie Ren, Shuangyan Yang, Benjamin Francis, Ehsan Ardestani, Min Si, and Dong Li.

31st International Symposium on High-Performance Computer Architecture. 2025
cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications.
Xi (Sherry) Wang, Bin Ma, Jongryool Kim, Byungil Koh, Hoshik Kim, and Dong Li.

37th ACM/IEEE International Conference for High Performance Computing, Performance Measurement, Modeling and Tools.2025
Performance Studies of Hypergraph Neural Networks (HGNNs) Using Tensor Representations.
Ahmed Taimoor.

M.S. Thesis, North Carolina State University.2025
CONQURE: A Co-Execution Environment for Quantum and Classical Resources.
Atulya Mahesh and Frank Mueller.

International Workshop on Integrating High-Performance and Quantum Computing. 2025
Fully Parallelized BP Decoding for Quantum LDPC Codes Can Outperform BP-OSD.
Ming Wang, Ang Li, and Frank Mueller.

arXiv preprint arXiv:2507.00254.2025
OpenMP-Q: Quantum Task Offloading in OpenMP.
Swastik Mittal, Atulya Mahesh, and Frank Mueller.

International Workshop on OpenMP.2025
Swift: High-performance sparse tensor contraction for scientific applications.
Andrew Ensinger, Gabriel Kulp, Victor Agostinelli, Dennis Lyakhov, and Lizhong Chen.

arXiv preprint arXiv:2410.10094. 2024
FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction.
Gabriel Kulp, Andrew Ensinger, and Lizhong Chen.

arXiv preprint arXiv:2404.16317. 2024
FlexMem: Adaptive Page Profiling and Migration for Tiered Memory.
Dong Xu, Junhee Ryu, Jinho Baek, Kwangsik Shin, Pengfei Su, and Dong Li.

30th USENIX Annual Technical Conference. 2024 [pdf]
MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Memory Systems.
Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li.

European Conference on Computer Systems. 2024 [pdf]
Enabling Large Dynamic Neural Network Training with Learning-based Memory Management.
Jie Ren, Dong Xu, Shuangyan Yang, Jiacheng Zhao, Zhicheng Li, Christian Navasca, Chenxi Wang, Harry Xu, and Dong Li.

30th International Symposium on High-Performance Computer Architecture (acceptance rate: 18%). 2024 [pdf]
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link.
Dong Xu, Yuan Feng, Kwangsik Shin, Daewoo Kim, Hyeran Jeon, and Dong Li.

30th USENIX Annual Technical Conference (acceptance rate: 15.7%). 2024 [pdf]
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
Keren Zhou, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, Jiajia Li.

The 38th ACM International Conference on Supercomputing (ICS). 2024
DiaQ: Efficient State-Vector Quantum Simulation
Srikar Chundury, Jiajia Li, In-Saeng Suh and Frank Mueller.

ArXiv. 2024
Parallel Sparse Tensor-times-Vector on Cerebras WSE-2
Sai Krishna Teja Varm Manthena.

M.S. Thesis, North Carolina State University. 2024
DiaQ: A Novel Quantum-Tailored Numerical Format
Srikar Chundury.

M.S. Thesis, North Carolina State University. 2024
A PEPS plugin for TNQVM
Srikar Chundury, J. Lietz, E. A. C. Perez, A. Shehata, In-Saeng Suh and Frank Mueller.

IEEE International Conference on Quantum Computing and Engineering (QCE). 2023

Description

High-dimensional data computation and analytics are gaining importance in various domains, such as quantum chemistry/physics, quantum circuit simulation, social networks, healthcare, and machine/deep learning. Tensors, a representation of high-dimensional data, have become increasingly crucial. While extensive research has focused on tensor methods like decompositions and factorizations for low-dimensional data, there is a notable lack of development in tensor networks that cater to high-dimensional data (over ten dimensions) and can extract physically meaningful latent variables. The challenges arise from their complicated mathematical nature, extremely high computational complexity, and domain-specific difficulties. This project aims to bridge this critical gap by devising efficient tensor networks, especially for sparse data, which are prevalent in many real-world applications. The impacts of the project encompass four aspects: 1) Improving data compression, computation, memory usage, and interpretability of tensor networks; 2) fostering enduring and collaborative partnerships among academia, national research labs, and industry with a shared focus on the aforementioned applications; and 3) broadening education avenues by designing relevant new courses, training undergraduate and graduate students, organizing workshops, and enhancing K-12 outreach.

This project delves into Cross-layer cooRdination and Optimization for Scalable and Sparse Tensor Networks (CROSS) designed for heterogeneous systems equipped with diverse accelerators like Graphics Processing Units (GPUs), Tensor Processing Units (TPUs) and Field Programmable Gate Arrays (FPGAs), and various memories such as dynamic and non-volatile random-access memories. This research aims to study sparsity within widely used tensor networks by incorporating constraints, regularization, dictionary, and domain knowledge. In addition to sparsity challenges, sparse tensor networks also face problems such as dimensionality, exacerbated data randomness and irregular program and memory access behaviors. This research tackles these challenges from four dimensions: (1) memory heterogeneity-aware representations and data (re-)arrangement, (2) balanced sparse tensor contraction algorithms with smart page arrangement, (3) memoization and intelligent allocation to reduce computational cost, and (4) specialized accelerator architectures for sparse tensor networks. The optimized sparse tensor networks represent a synergistic effort combining expertise from high-performance computing, algorithms, compilers, computer architecture and performance modeling. The proposed solutions are evaluated under diverse application scenarios and across a wide range of hardware environments to demonstrate their effectiveness and applicability in real-world settings.

Team

Dr. Jiajia Li, Assistant Professor @ NCSU
Dr. Frank Mueller, Professor @ NCSU
Dr. Dong Li, Associate Professor @ UC Merced
Dr. Lizhong Chen, Professor @ Oregon State
Zecheng Li, PhD student @ NCSU
Zhaonan Meng, PhD student @ NCSU
Rahmy Salman, PhD student @ NCSU
Devadatta Mandaogane, PhD student @ NCSU
Yanbo Zhao, PhD student @ NCSU
Srikar Chundury, PhD student @ NCSU
Yuchao Su, PhD student @ NCSU
Blake Burgstahler, PhD student @ NCSU
Xi (Sherry) Wang, PhD student @ UC Merced
Dinghong Song, PhD student @ UC Merced
Bin Ma, PhD student @ UC Merced
Anthony Kung, PhD student @ Oregon State
Adrian Alupoaei, PhD student @ Oregon State
Raymond Baartmans, PhD student @ Oregon State

Advisory Committee

Contact

Please feel free to drop me an email @ jiajia.li@ncsu.edu if you are interested in this project for collaboration.

NSF CROSS Project

Collaborative Research: PPoSS: LARGE: Cross-layer Coordination and Optimization for Scalable and Sparse Tensor Networks