KV cache and long-context inference
Work on KV cache selection, eviction, and long-context serving under memory constraints.
Examples: query-oriented KV selection (QuoKA), KV cache eviction for long contexts (KeyDiff).
Senior AI Researcher at Qualcomm AI Research
I work on efficient inference systems for large language models, especially KV cache selection, speculative decoding, and long-context serving under resource constraints.
I work mostly on efficient LLM inference, with continuing interests in speculative decoding and reinforcement learning for combinatorial optimization.
Work on KV cache selection, eviction, and long-context serving under memory constraints.
Examples: query-oriented KV selection (QuoKA), KV cache eviction for long contexts (KeyDiff).
Systems work on draft-model alignment and recursive decoding for faster language model serving.
Examples: recursive speculative decoding, draft-model alignment for speculative decoding.
Earlier work on reinforcement-learning methods for routing, scheduling, and optimization libraries.
Examples: parallel autoregressive policies for multi-agent optimization (PARCO), an RL library for combinatorial optimization (RL4CO).
Expand to view work from 2023 back to 2018.
Recent paper acceptances, workshop activity, and research updates.
Training in industrial engineering, optimization, and machine learning at KAIST.