News Source
EXCERPT:
For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of researchers at Moonshot AI and Tsinghua University is making the case that this constraint is about to break down — and that the right architecture can already exploit that shift.
The research team introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. The result, in a case study using an internal 1T-parameter hybrid model, is 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup — while consuming only a fraction of available cross-datacenter bandwidth. The research team note that when compared at equal hardware cost, the throughput gain is approximately 15%, reflecting that the full 54% advantage comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode.
