Research

You can also find my articles on my Google Scholar profile.

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu*, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
SOSP 2025
[paper] [slides]

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse
NSDI 2026 (to appear)
[paper]
LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
Zhuohan Gu*, Jiayi Yao*, Kuntai Du, Junchen Jiang
NeurIPS 2024 Workshop on Machine Learning for Systems
[paper] [poster]
AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang
SOSP 2025 Workshop on Big Memory (BigMem)
[paper]
Transformer-based Predictions for Sudden Network Changes
Siddhant Ray, Zhuohan Gu, Xi Jiang, Junchen Jiang, Nick Feamster
NSDI 2024 Poster Session
[poster]
An Introduction to Loewner Energy
Zhuohan Gu, Dadu Chen
UChicago Math REU, 2024
[paper]
A Study in Markov Chains, Loop-Erased Random Walk, and Loop Soups
Zhuohan Gu
UChicago Math REU, 2023
[paper]

LMCache: The first open-source Knowledge Delivery Network (KDN) for LLM applications
Accelerates LLM applications up to 8x faster, at 8x lower cost.
[code] [website]
vLLM Production Stack
Scale from a single vLLM instance to a distributed vLLM deployment without changing any application code.
[code]
KV Cache Compression and Streaming for Multimodal Large Language Models (MLLMs)
Knowledge Streaming from LLMs to Environments
[website]