Research

You can also find my publications on my Google Scholar profile.

Featured Recent Papers

  • PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
    Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden
    arXiv preprint, 2026
  • DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
    Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse
    NSDI 2026
  • METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
    Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
    SOSP 2025

Featured Recent Projects

  • Carnot: Interpretable, Interactive, and Optimized Execution of Deep Research Queries
    Enterprise-grade system that improves the quality, cost, and latency of execution plans generated by Deep Research agents over private data warehouses.

Other Papers

  • AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
    Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang
    SOSP 2025 Workshop on Big Memory (BigMem)
  • LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
    Zhuohan Gu*, Jiayi Yao*, Kuntai Du, Junchen Jiang
    NeurIPS 2024 Workshop on Machine Learning for Systems
  • Transformer-based Predictions for Sudden Network Changes
    Siddhant Ray, Zhuohan Gu, Xi Jiang, Junchen Jiang, Nick Feamster
    NSDI 2024 Poster Session
  • An Introduction to Loewner Energy
    Zhuohan Gu, Dadu Chen
    UChicago Math REU, 2024
  • A Study in Markov Chains, Loop-Erased Random Walk, and Loop Soups
    Zhuohan Gu
    UChicago Math REU, 2023

Other Projects

  • LMCache: The first open-source Knowledge Delivery Network (KDN) for LLM applications
    Accelerates LLM applications up to 8x faster, at 8x lower cost.
  • vLLM Production Stack
    Scales from a single vLLM instance to a distributed vLLM deployment without changing any application code.
  • KV Cache Compression and Streaming for Multimodal Large Language Models (MLLMs)
  • Knowledge Streaming from LLMs to Environments