NSF CROSS Project

Collaborative Research: PPoSS: LARGE: Cross-layer Coordination and Optimization for Scalable and Sparse Tensor Networks

News

  • July 2026: Zecheng Li will present our TypeCraft: A Lightweight Data Type Profiler with High Resolution work at OSDI’26 in Seattle, WA, USA.
  • June 2026: Yanbo Zhao will present our Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC work at ARRAY’26.
  • May 2026: Zhaonan Meng presented our STTID: High-Performance Sparse Tensor-Train Interpolative Decomposition work at IPDPS’26 in New Orleans, LA, USA.
  • Mar 03-06, 2026: Jiajia Li presented at SIAM Conference on Parallel Processing for Scientific Computing PP26 at Berlin, Germany.
  • Nov 16-21, 2025: Jiajia Li attended SC’25 at St. Louis, MO as a PC member, session chair, and an accepted paper
  • Oct 21-22, 2025: Jiajia Li presented at ORNL Core Universities AI Workshop AI-CORE at University of Tennessee Knoxville
  • Oct 16-19, 2025: Jiajia Li participated in Scialog Quantum Matter and Information meeting (QMI) as a Scialog Fellow
  • Sep. 12 & 23, 2025: NSF PPoSS CROSS Project 2nd Annual Meeting with Advisory Board
  • Sep 17-19, 2025: Jiajia Li gave a presentation at Toulouse Tensor Workshop 2025

Activities

Projects

Publication

  • STTID: High-Performance Sparse Tensor-Train Interpolative Decomposition.

    Zhaonan Meng, Miles Stoudenmire, Karl Pierce, Frank Mueller, and Jiajia Li.

    IEEE International Parallel and Distributed Processing Symposium (IPDPS).

  • Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC.

    Yanbo Zhao, Zhaonan Meng, Sai Krishna Teja Varma Manthena, Xu Liu, Ajay Panyala, and Jiajia Li.

    ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY), co-located with PLDI.

  • SmartDispatch: Dynamic Substitution of NumPy-style APIs on Heterogenous CPU-GPU Systems.

    Jinku Cui, Yueming Hao, Shuyin Jiao, Jiajia Li, and Xu Liu.

    ACM International Conference on the Foundations of Software Engineering (FSE).

  • G-HEMP: Fast Multi-GPU Private Inference for Large-Scale GCNs with Homomorphic Encryption.

    Ran Ran, Zhaoting Gong, Zhaowei Li, Xianting Lu, Jiajia Li, and Wujie Wen.

    Conference on Machine Learning and Systems (MLSys).

  • Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation.

    Srikar Chundury, Blake Burgstahler, Jiajia Li, In-Saeng Suh, and Frank Mueller.

    ACM International Conference on Supercomputing (ICS).

  • TypeCraft: A Lightweight Data Type Profiler with High Resolution.

    Zecheng Li, Xu Liu, Namhyung Kim, Blake Jones, Alexey Alexandrov, and Jiajia Li.

    USENIX Symposium on Operating Systems Design and Implementation (OSDI).

  • RedSan: Redundant Memory Instruction Sanitizer for GPU Programs.

    Yanbo Zhao, Yueming Hao, Zecheng Li, Shuyin Jiao, Xu Liu, and Jiajia Li.

    The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC).

  • Systolic Array Acceleration of Diagonal-Optimized Sparse-Sparse Matrix Multiplication for Efficient Quantum Simulation.

    Yuchao Su, Srikar Chundury, Jiajia Li, and Frank Mueller.

    ArXiv.

  • ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates.

    Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Sun, Rui Yang, Tekin Bicer, Dong Li and Yue Cheng.

    ArXiv.

  • NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium.

    Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su and Dong Li.

    ArXiv.

  • Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory.

    Bin Ma, Jie Ren, Shuangyan Yang, Benjamin Francis, Ehsan Ardestani, Min Si, and Dong Li.

    31st International Symposium on High-Performance Computer Architecture.

  • cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications.

    Xi (Sherry) Wang, Bin Ma, Jongryool Kim, Byungil Koh, Hoshik Kim, and Dong Li.

    37th ACM/IEEE International Conference for High Performance Computing, Performance Measurement, Modeling and Tools.

  • Performance Studies of Hypergraph Neural Networks (HGNNs) Using Tensor Representations.

    Ahmed Taimoor.

    M.S. Thesis, North Carolina State University.

  • CONQURE: A Co-Execution Environment for Quantum and Classical Resources.

    Atulya Mahesh and Frank Mueller.

    International Workshop on Integrating High-Performance and Quantum Computing.

  • Fully Parallelized BP Decoding for Quantum LDPC Codes Can Outperform BP-OSD.

    Ming Wang, Ang Li, and Frank Mueller.

    arXiv preprint arXiv:2507.00254.

  • OpenMP-Q: Quantum Task Offloading in OpenMP.

    Swastik Mittal, Atulya Mahesh, and Frank Mueller.

    International Workshop on OpenMP.

  • Swift: High-performance sparse tensor contraction for scientific applications.

    Andrew Ensinger, Gabriel Kulp, Victor Agostinelli, Dennis Lyakhov, and Lizhong Chen.

    arXiv preprint arXiv:2410.10094.

  • FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction.

    Gabriel Kulp, Andrew Ensinger, and Lizhong Chen.

    arXiv preprint arXiv:2404.16317.

  • FlexMem: Adaptive Page Profiling and Migration for Tiered Memory.

    Dong Xu, Junhee Ryu, Jinho Baek, Kwangsik Shin, Pengfei Su, and Dong Li.

    30th USENIX Annual Technical Conference. [pdf]

  • MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Memory Systems.

    Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li.

    European Conference on Computer Systems. [pdf]

  • Enabling Large Dynamic Neural Network Training with Learning-based Memory Management.

    Jie Ren, Dong Xu, Shuangyan Yang, Jiacheng Zhao, Zhicheng Li, Christian Navasca, Chenxi Wang, Harry Xu, and Dong Li.

    30th International Symposium on High-Performance Computer Architecture (acceptance rate: 18%). [pdf]

  • Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link.

    Dong Xu, Yuan Feng, Kwangsik Shin, Daewoo Kim, Hyeran Jeon, and Dong Li.

    30th USENIX Annual Technical Conference (acceptance rate: 15.7%). [pdf]

  • FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks

    Keren Zhou, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, Jiajia Li.

    The 38th ACM International Conference on Supercomputing (ICS).

  • DiaQ: Efficient State-Vector Quantum Simulation

    Srikar Chundury, Jiajia Li, In-Saeng Suh and Frank Mueller.

    ArXiv.

  • Parallel Sparse Tensor-times-Vector on Cerebras WSE-2

    Sai Krishna Teja Varm Manthena.

    M.S. Thesis, North Carolina State University.

  • DiaQ: A Novel Quantum-Tailored Numerical Format

    Srikar Chundury.

    M.S. Thesis, North Carolina State University.

  • A PEPS plugin for TNQVM

    Srikar Chundury, J. Lietz, E. A. C. Perez, A. Shehata, In-Saeng Suh and Frank Mueller.

    IEEE International Conference on Quantum Computing and Engineering (QCE).

Description

High-dimensional data computation and analytics are gaining importance in various domains, such as quantum chemistry/physics, quantum circuit simulation, social networks, healthcare, and machine/deep learning. Tensors, a representation of high-dimensional data, have become increasingly crucial. While extensive research has focused on tensor methods like decompositions and factorizations for low-dimensional data, there is a notable lack of development in tensor networks that cater to high-dimensional data (over ten dimensions) and can extract physically meaningful latent variables. The challenges arise from their complicated mathematical nature, extremely high computational complexity, and domain-specific difficulties. This project aims to bridge this critical gap by devising efficient tensor networks, especially for sparse data, which are prevalent in many real-world applications. The impacts of the project encompass four aspects: 1) Improving data compression, computation, memory usage, and interpretability of tensor networks; 2) fostering enduring and collaborative partnerships among academia, national research labs, and industry with a shared focus on the aforementioned applications; and 3) broadening education avenues by designing relevant new courses, training undergraduate and graduate students, organizing workshops, and enhancing K-12 outreach.


This project delves into Cross-layer cooRdination and Optimization for Scalable and Sparse Tensor Networks (CROSS) designed for heterogeneous systems equipped with diverse accelerators like Graphics Processing Units (GPUs), Tensor Processing Units (TPUs) and Field Programmable Gate Arrays (FPGAs), and various memories such as dynamic and non-volatile random-access memories. This research aims to study sparsity within widely used tensor networks by incorporating constraints, regularization, dictionary, and domain knowledge. In addition to sparsity challenges, sparse tensor networks also face problems such as dimensionality, exacerbated data randomness and irregular program and memory access behaviors. This research tackles these challenges from four dimensions: (1) memory heterogeneity-aware representations and data (re-)arrangement, (2) balanced sparse tensor contraction algorithms with smart page arrangement, (3) memoization and intelligent allocation to reduce computational cost, and (4) specialized accelerator architectures for sparse tensor networks. The optimized sparse tensor networks represent a synergistic effort combining expertise from high-performance computing, algorithms, compilers, computer architecture and performance modeling. The proposed solutions are evaluated under diverse application scenarios and across a wide range of hardware environments to demonstrate their effectiveness and applicability in real-world settings.


Team

Advisory Committee

Contact

Please feel free to drop me an email @ jiajia.li@ncsu.edu if you are interested in this project for collaboration.