vllm源码学习

本学习主要围绕vllm的一些特性展开

kv_cache特性

路径vllm/v1/core/kv_cache_manager.py

Request 到来
  ↓
Scheduler 判断本轮能不能调度
  ↓
KVCacheManager.get_computed_blocks()
  ↓
查 Prefix Cache，得到可复用 blocks
  ↓
KVCacheManager.allocate_slots()
  ↓
向 Block Pool 申请新 blocks
  ↓
返回 block_ids / block table 给执行层
  ↓
PagedAttention kernel 根据 block table 读取非连续 KV blocks
  ↓
生成结束后 free(request)，释放或保留可缓存 blocks

这段代码是 vLLM v1 架构中核心的显存管理层实现。
它位于调度器（Scheduler）和底层物理显存池（Block Pool）之间，
主要负责 PagedAttention 机制下的显存块分配、回收，以及 Prefix Caching（前缀缓存）功能。

代码主要由两个类构成：

KVCacheBlocks：调度器与管理器之间的隔离层
KVCacheManager：缓存与分配的控制中枢

@dataclass
class KVCacheBlocks:
    # 这是一个数据类，本质上是对物理显存块列表的封装。
    # 它的作用是隐藏 KVCacheManager 内部的数据结构，
    # 只向上层调度器暴露稳定接口。

    blocks: tuple[Sequence[KVCacheBlock], ...]
    # blocks 是一个元组，支持多个 kv_cache_group。
    # 这么设计是为了兼容未来不同组使用不同 block_size 的架构。

    def __add__(self, other: "KVCacheBlocks") -> "KVCacheBlocks":
        # 把两个 KVCacheBlocks 对象按组拼接起来。
        # 这个操作常用于把新算出的块和已有块合并。
        return KVCacheBlocks(
            tuple(
                list(itertools.chain(blk1, blk2))
                for blk1, blk2 in zip(self.blocks, other.blocks)
            )
        )

    @overload
    def get_block_ids(
        self,
        allow_none: Literal[False] = False,
    ) -> tuple[list[int], ...]: ...

    @overload
    def get_block_ids(
        self,
        allow_none: Literal[True] = True,
    ) -> tuple[list[int], ...] | None: ...

    def get_block_ids(
        self,
        allow_none: bool = False,
    ) -> tuple[list[int], ...] | None:
        # 把 KVCacheBlocks 转换成 block_id 列表，方便调度器读取。
        # 如果 allow_none=True 且所有组都为空，就直接返回 None。
        if allow_none and all(len(group) == 0 for group in self.blocks):
            return None
        return tuple([blk.block_id for blk in group] for group in self.blocks)

    def get_unhashed_block_ids(self) -> list[int]:
        # 返回尚未进行哈希，也就是还没有进入前缀缓存管理的块 ID。
        # 这里默认只支持单组场景。
        assert len(self.blocks) == 1, "Only one group is supported"
        return [block.block_id for block in self.blocks[0] if block.block_hash is None]

    def get_unhashed_block_ids_all_groups(self) -> list[list[int]]:
        # 按组返回未哈希块 ID，并跳过 padding block。
        return [
            [
                block.block_id
                for block in group
                if block.block_hash is None and not block.is_null
            ]
            for group in self.blocks
        ]

    def new_empty(self) -> "KVCacheBlocks":
        # 创建一个与当前组数一致、但不包含任何块的新对象。
        return KVCacheBlocks(tuple(() for _ in range(len(self.blocks))))

class KVCacheManager:
    def __init__(
        self,
        kv_cache_config: KVCacheConfig,
        max_model_len: int,
        hash_block_size: int,
        max_num_batched_tokens: int | None = None,
        enable_caching: bool = True,
        use_eagle: bool = False,
        log_stats: bool = False,
        enable_kv_cache_events: bool = False,
        dcp_world_size: int = 1,
        pcp_world_size: int = 1,
        metrics_collector: KVCacheMetricsCollector | None = None,
    ) -> None:
        # KVCacheManager 是管理 KV Cache 生命周期的大脑。
        # 它处理常规生成请求，也兼容投机解码（Speculative Decoding）、
        # 多模态外部缓存接入等复杂场景。

        self.max_model_len = max_model_len
        # 如果调度器没有传入上限，就回退到 max_model_len。
        if max_num_batched_tokens is None:
            max_num_batched_tokens = max_model_len

        self.enable_caching = enable_caching
        self.use_eagle = use_eagle
        self.log_stats = log_stats
        self.metrics_collector = metrics_collector

        # prefix_cache_stats 只在开启日志时创建。
        self.prefix_cache_stats = PrefixCacheStats() if log_stats else None

        # coordinator 是真正负责底层 block pool 和 cache 组织的协作者。
        self.coordinator = get_kv_cache_coordinator(
            kv_cache_config=kv_cache_config,
            max_model_len=self.max_model_len,
            max_num_batched_tokens=max_num_batched_tokens,
            use_eagle=self.use_eagle,
            enable_caching=self.enable_caching,
            enable_kv_cache_events=enable_kv_cache_events,
            dcp_world_size=dcp_world_size,
            pcp_world_size=pcp_world_size,
            hash_block_size=hash_block_size,
            metrics_collector=self.metrics_collector,
        )
        self.num_kv_cache_groups = len(kv_cache_config.kv_cache_groups)
        self.block_pool = self.coordinator.block_pool
        self.kv_cache_config = kv_cache_config

        # 提前构造一个空 KVCacheBlocks，避免频繁创建空对象带来 GC 开销。
        self.empty_kv_cache_blocks = KVCacheBlocks(
            tuple(() for _ in range(self.num_kv_cache_groups))
        )

    @property
    def usage(self) -> float:
        # 返回当前 KV cache 使用率。
        return self.block_pool.get_usage()

    def make_prefix_cache_stats(self) -> PrefixCacheStats | None:
        # 读取并重置 prefix cache 的统计信息。
        # 如果没有开启 log_stats，就直接返回 None。
        if not self.log_stats:
            return None
        stats = self.prefix_cache_stats
        self.prefix_cache_stats = PrefixCacheStats()
        return stats

    def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
        # 核心机制一：前缀缓存匹配。
        # 当启用了 enable_caching 时，这个方法会在处理新请求前被调用。
        # 它会根据 request.block_hashes 去底层 coordinator 查找是否已经存在
        # 可复用的 KV Cache。

        # 典型场景是 System Prompt 复用或多轮对话。
        # 如果命中，就直接返回已经缓存好的物理块列表，以及命中的 token 数量。
        # 这样可以跳过对应前缀的 Attention 计算，从而显著降低首字延迟（TTFT）。

        if not self.enable_caching or request.skip_reading_prefix_cache:
            return self.empty_kv_cache_blocks, 0

        max_cache_hit_length = request.num_tokens - 1
        computed_blocks, num_new_computed_tokens = (
            self.coordinator.find_longest_cache_hit(
                request.block_hashes, max_cache_hit_length
            )
        )

        if self.log_stats:
            assert self.prefix_cache_stats is not None
            self.prefix_cache_stats.record(
                num_tokens=request.num_tokens,
                num_hits=num_new_computed_tokens,
                preempted=request.num_preemptions > 0,
            )

        return self.create_kv_cache_blocks(computed_blocks), num_new_computed_tokens

def allocate_slots(
    self,
    request: Request,
    num_new_tokens: int,
    num_new_computed_tokens: int = 0,
    new_computed_blocks: KVCacheBlocks | None = None,
    num_lookahead_tokens: int = 0,
    num_external_computed_tokens: int = 0,
    delay_cache_blocks: bool = False,
    num_encoder_tokens: int = 0,
    full_sequence_must_fit: bool = False,
) -> KVCacheBlocks | None:
    # 核心机制二：复杂的槽位分配。
    # 这是整个类里最核心、也最复杂的函数。
    # 它负责为一个请求在当前推理步申请物理显存块。

    # 代码中把请求在显存中的逻辑布局划分得很细：
    # 已经计算或缓存的部分：
    #   comp      = 当前请求之前已经计算完的 token
    #   new_comp  = 刚刚通过前缀缓存匹配到的 token
    #   ext_comp  = 由外部组件（比如多模态 connector）计算并传入的 token
    # 需要新分配空间的部分：
    #   new       = 当前步骤即将输入或生成的新 token
    #   lookahead = 为投机解码预留的额外槽位，用来存放多个未来草稿 token

    # 分配流程可以概括为四步：
    # 1. 容量预检：提前判断 full sequence 是否能放下，避免请求中途 OOM。
    # 2. 清理废弃块：比如滑动窗口外的旧 token。
    # 3. 物理分配：向 block pool 申请真正的物理块。
    # 4. 状态固化：把已经确认的 token 写入前缀缓存池，供后续请求复用。

    # 当加载外部 KV 数据时，可能没有 new tokens，但仍然需要分配槽位。
    if num_new_tokens == 0 and num_external_computed_tokens == 0:
        raise ValueError(
            "num_new_tokens must be greater than 0 when there are no external computed tokens"
        )

    # new_computed_blocks 为空时，统一用空对象，避免后面写分支判断。
    if new_computed_blocks is not None:
        new_computed_block_list = new_computed_blocks.blocks
    else:
        new_computed_block_list = self.empty_kv_cache_blocks.blocks

    # 统计当前已经计算完成的 token 数量。
    num_local_computed_tokens = request.num_computed_tokens + num_new_computed_tokens
    total_computed_tokens = min(
        num_local_computed_tokens + num_external_computed_tokens,
        self.max_model_len,
    )

    # full_sequence_must_fit 表示：必须提前确认整段序列都能放下。
    if full_sequence_must_fit:
        full_num_tokens = min(request.num_tokens, self.max_model_len)

        num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
            request_id=request.request_id,
            num_tokens=full_num_tokens,
            new_computed_blocks=new_computed_block_list,
            num_encoder_tokens=num_encoder_tokens,
            total_computed_tokens=total_computed_tokens,
            num_tokens_main_model=full_num_tokens,
            apply_admission_cap=True,
        )
        if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
            return None

    # 新请求本轮主模型总 token 数。
    num_tokens_main_model = total_computed_tokens + num_new_tokens
    # 需要预留的总槽位数，包含 lookahead。
    num_tokens_need_slot = min(
        num_tokens_main_model + num_lookahead_tokens, self.max_model_len
    )

    # 先释放 attention 不再需要的旧块，比如滑动窗口外的 token。
    self.coordinator.remove_skipped_blocks(request.request_id, total_computed_tokens)

    num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
        request_id=request.request_id,
        num_tokens=num_tokens_need_slot,
        new_computed_blocks=new_computed_block_list,
        num_encoder_tokens=num_encoder_tokens,
        total_computed_tokens=num_local_computed_tokens + num_external_computed_tokens,
        num_tokens_main_model=num_tokens_main_model,
    )

    if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
        return None

    # 如果有新算出的前缀块或者外部块，先把它们挂到请求上，避免后续分配失败。
    if (
        new_computed_block_list is not self.empty_kv_cache_blocks.blocks
        or num_external_computed_tokens > 0
    ):
        self.coordinator.allocate_new_computed_blocks(
            request_id=request.request_id,
            new_computed_blocks=new_computed_block_list,
            num_local_computed_tokens=num_local_computed_tokens,
            num_external_computed_tokens=num_external_computed_tokens,
        )

    # 真正分配当前步需要的新块。
    new_blocks = self.coordinator.allocate_new_blocks(
        request.request_id,
        num_tokens_need_slot,
        num_tokens_main_model,
        num_encoder_tokens,
    )

    # 如果禁用了缓存，或者这是延迟缓存路径，就只返回新块，不写回 prefix cache。
    if not self.enable_caching or delay_cache_blocks:
        return self.create_kv_cache_blocks(new_blocks)

    # 只缓存最终确认的 token，避免把可能被 reject 的 draft token 写进去。
    num_tokens_to_cache = min(
        total_computed_tokens + num_new_tokens,
        request.num_tokens,
    )
    self.coordinator.cache_blocks(request, num_tokens_to_cache)

    return self.create_kv_cache_blocks(new_blocks)

def free(self, request: Request) -> None:
    # 核心机制三：资源回收与维护。
    # 当请求生成完毕或被中止时，将其持有的所有物理块交还给底层池。
    self.coordinator.free(request.request_id)


def remove_skipped_blocks(self, request_id: str, total_computed_tokens: int) -> None:
    # 移除 attention 计算中不再需要的块，并用 null block 替换被删除位置。
    self.coordinator.remove_skipped_blocks(request_id, total_computed_tokens)


def evict_blocks(self, block_ids: set[int]) -> None:
    # 从前缀缓存中驱逐指定块，通常用于缓存池满了之后执行 LRU 式淘汰。
    self.block_pool.evict_blocks(block_ids)


def reset_prefix_cache(self) -> bool:
    # 重置 prefix cache。训练权重更新或基准测试时会用到。
    if not self.block_pool.reset_prefix_cache():
        return False
    if self.log_stats:
        assert self.prefix_cache_stats is not None
        self.prefix_cache_stats.reset = True
    return True


def get_num_common_prefix_blocks(self, running_request_id: str) -> list[int]:
    # 统计当前正在运行的所有请求共享了多少个前缀块。
    # 这个指标对于调度策略优化和缓存效率评估非常关键。
    return self.coordinator.get_num_common_prefix_blocks(running_request_id)


def take_events(self) -> list[KVCacheEvent]:
    # 取出 block pool 中积累的 KV cache 事件，通常用于统计、调试或上报。
    return self.block_pool.take_events()


def get_blocks(self, request_id: str) -> KVCacheBlocks:
    # 获取某个请求当前持有的 blocks，并重新封装成 KVCacheBlocks。
    return self.create_kv_cache_blocks(self.coordinator.get_blocks(request_id))


def get_block_ids(self, request_id: str) -> tuple[list[int], ...]:
    # 直接返回某个请求的 block_id 列表，方便上层调度器读取。
    return self.get_blocks(request_id).get_block_ids()


def cache_blocks(self, request: Request, num_computed_tokens: int) -> None:
    # 把某个请求已经确认完成的 blocks 写入 prefix cache。
    if self.enable_caching:
        self.coordinator.cache_blocks(request, num_computed_tokens)


def create_kv_cache_blocks(self, blocks: tuple[list[KVCacheBlock], ...]) -> KVCacheBlocks:
    # 只有在 blocks 非空时才创建新对象，否则直接复用空对象，减少额外分配。
    return KVCacheBlocks(blocks) if any(blocks) else self.empty_kv_cache_blocks


def take_new_block_ids(self) -> list[int]:
    # 取出新分配的 attention block IDs，供后续做清零或初始化处理。
    ids: list[int] = []
    for mgr in self.coordinator.single_type_managers:
        ids.extend(mgr.take_new_block_ids())
    return ids


def new_step_starts(self) -> None:
    # 当前一步开始时调用，用来推进 coordinator 的内部状态机。
    self.coordinator.new_step_starts()

def allocate_slots(
    self,
    request: Request,
    num_new_tokens: int,
    num_new_computed_tokens: int = 0,
    new_computed_blocks: KVCacheBlocks | None = None,
    num_lookahead_tokens: int = 0,
    num_external_computed_tokens: int = 0,
    delay_cache_blocks: bool = False,
    num_encoder_tokens: int = 0,
    full_sequence_must_fit: bool = False,
) -> KVCacheBlocks | None:
    # 核心机制二：复杂的槽位分配。
    # 这段逻辑负责为当前请求在本轮推理中申请、回收并缓存 KV blocks。
    # """Add slots for a request with new tokens to append.

    # Args:
    #     request: The request to allocate slots.
    #     num_new_tokens: The number of new tokens to be allocated and computed.
    #     num_new_computed_tokens: The number of new computed tokens just
    #         hitting the prefix caching, excluding external tokens.
    #     new_computed_blocks: The cached blocks for the above new computed
    #         tokens, grouped as a tuple by kv cache groups.
    #     num_lookahead_tokens: The number of speculative tokens to allocate.
    #         This is used by spec decode proposers with kv-cache such
    #         as eagle.
    #     num_external_computed_tokens: The number of tokens that their
    #         KV caches are not cached by vLLM but cached by the connector.
    #     delay_cache_blocks: Whether to skip caching the blocks. This is
    #         used by P/D when allocating blocks used in a KV transfer
    #         which will complete in a future step.
    #     num_encoder_tokens: The number of encoder tokens to allocate for
    #         cross-attention in encoder-decoder models(e.g., Whisper).
    #         For decoder-only models, this should be 0.
    #     full_sequence_must_fit: Only allocate blocks if the KV cache has enough
    #         free blocks to hold the full sequence, accounting for prefix cache hits
    #         and sliding window. Used as an admission gate to prevent over-admitting
    #         requests when chunked prefill would otherwise only check the first chunk

    # Blocks layout:
    # ```
    # ----------------------------------------------------------------------
    # | < comp > | < new_comp > | < ext_comp >  | < new >  | < lookahead > |
    # ----------------------------------------------------------------------
    #                                           |   < to be computed >     |
    # ----------------------------------------------------------------------
    #                           |            < to be allocated >           |
    # ----------------------------------------------------------------------
    #                           | < to be cached (roughly, |
    #                           | details below)>          |
    # ----------------------------------------------------------------------
    # | Prefix-cached tokens from either vLLM   |
    # | or connector. Can be safely removed if  |
    # | they are outside sliding window.        |
    # ----------------------------------------------------------------------
    # |   < cached by vLLM >    | not cached by |
    #                           | vLLM, but     |
    # | ref_cnt  | ref_cnt not  | cached by     |
    # | increased| increased yet| connector     |
    # ----------------------------------------------------------------------
    # ```

    # Abbrivations:

    # ```
    # comp      = request.num_computed_tokens
    # new_comp  = num_new_computed_tokens
    #           = len(new_computed_blocks) * block_size
    # ext_comp  = num_external_computed_tokens, cached by the connector
    # new       = num_new_tokens, including unverified draft tokens
    # lookahead = num_lookahead_tokens
    # ```

    # NOTE: for new tokens which include both verified and unverified draft
    # tokens, we only cache the verified tokens (by capping the number at
    # `request.num_tokens`).

    # The allocation has three stages:
    # - Free unnecessary blocks in `comp` and check
    #    if we have sufficient free blocks (return None if not).
    # - Handle prefix tokens (`comp + new_comp + ext_comp`):
    #     - Free unnecessary blocks (e.g. outside sliding window)
    #     - Allocate new blocks for `ext_comp` tokens inside
    #       sliding window
    # - Allocate new blocks for tokens to be computed (`new + lookahead`)

    # Returns:
    #     A list of new allocated blocks.
    # """
    # # When loading KV data asynchronously, we may have zero new tokens to
    # # compute while still allocating slots for externally computed tokens.
    # if num_new_tokens == 0 and num_external_computed_tokens == 0:
    #     raise ValueError(
    #         "num_new_tokens must be greater than 0 when there are no "
    #         "external computed tokens"
    #     )

    # if new_computed_blocks is not None:
    #     new_computed_block_list = new_computed_blocks.blocks
    # else:
    #     new_computed_block_list = self.empty_kv_cache_blocks.blocks

    # # The number of computed tokens is the number of computed tokens plus
    # # the new prefix caching hits
    # num_local_computed_tokens = (
    #     request.num_computed_tokens + num_new_computed_tokens
    # )
    # total_computed_tokens = min(
    #     num_local_computed_tokens + num_external_computed_tokens,
    #     self.max_model_len,
    # )

    # if full_sequence_must_fit:
    #     # First check and fail if the full request sequence won't fit.
    #     full_num_tokens = min(request.num_tokens, self.max_model_len)

    #     num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
    #         request_id=request.request_id,
    #         num_tokens=full_num_tokens,
    #         new_computed_blocks=new_computed_block_list,
    #         num_encoder_tokens=num_encoder_tokens,
    #         total_computed_tokens=total_computed_tokens,
    #         num_tokens_main_model=full_num_tokens,
    #         apply_admission_cap=True,
    #     )
    #     if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
    #         return None

    # num_tokens_main_model = total_computed_tokens + num_new_tokens
    # num_tokens_need_slot = min(
    #     num_tokens_main_model + num_lookahead_tokens, self.max_model_len
    # )

    # # Free the blocks that are skipped during the attention computation
    # # (e.g., tokens outside the sliding window).
    # # We can do this even if we cannot schedule this request due to
    # # insufficient free blocks.
    # # Should call this function before allocating new blocks to reduce
    # # the number of evicted blocks.
    # 先释放滑动窗口外等不再需要的旧块，即使本轮最终分配失败也可以做。
    self.coordinator.remove_skipped_blocks(
        request.request_id, total_computed_tokens
    )

    num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
        request_id=request.request_id,
        num_tokens=num_tokens_need_slot,
        new_computed_blocks=new_computed_block_list,
        num_encoder_tokens=num_encoder_tokens,
        total_computed_tokens=num_local_computed_tokens
        + num_external_computed_tokens,
        num_tokens_main_model=num_tokens_main_model,
    )

    if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
        # Cannot allocate new blocks
        return None

    # 如果有新命中的前缀块或外部计算块，先挂到 request 上，避免后续分配失败时状态丢失。
    if (
        new_computed_block_list is not self.empty_kv_cache_blocks.blocks
        or num_external_computed_tokens > 0
    ):
        # Append the new computed blocks to the request blocks until now to
        # avoid the case where the new blocks cannot be allocated.
        self.coordinator.allocate_new_computed_blocks(
            request_id=request.request_id,
            new_computed_blocks=new_computed_block_list,
            num_local_computed_tokens=num_local_computed_tokens,
            num_external_computed_tokens=num_external_computed_tokens,
        )

    new_blocks = self.coordinator.allocate_new_blocks(
        request.request_id,
        num_tokens_need_slot,
        num_tokens_main_model,
        num_encoder_tokens,
    )

    # P/D: delay caching blocks if we have to recv from
    # remote. Update state for locally cached blocks.
    if not self.enable_caching or delay_cache_blocks:
        return self.create_kv_cache_blocks(new_blocks)

    # 只把最终确认会保留的 token 写回 prefix cache，避免把草稿 token 过早缓存。
    # NOTE(woosuk): We want to commit (cache) up to num_local_computed_tokens
    # + num_external_computed_tokens + num_new_tokens, but must exclude
    # "non-committable" tokens (e.g., draft tokens that could be rejected).
    # Therefore, we cap the number at `request.num_tokens`, ensuring only
    # "finalized" tokens are cached.
    num_tokens_to_cache = min(
        total_computed_tokens + num_new_tokens,
        request.num_tokens,
    )
    self.coordinator.cache_blocks(request, num_tokens_to_cache)

    return self.create_kv_cache_blocks(new_blocks)

def free(self, request: Request) -> None:
    # 请求结束或被中止时，把它持有的 blocks 交还给底层池。
    """Free the blocks allocated for the request.
    We free the blocks in reverse order so that the tail blocks are evicted
    first when caching is enabled.

    Args:
        request: The request to free the blocks.
    """
    self.coordinator.free(request.request_id)

def remove_skipped_blocks(
    self, request_id: str, total_computed_tokens: int
) -> None:
    # 移除 attention 计算中已经不需要的块，通常用于滑动窗口场景。
    """Remove the blocks that are no longer needed from `blocks` and replace
    the removed blocks with null_block.

    Args:
        request_id: The request ID.
        total_computed_tokens: The total number of computed tokens, including
            local computed tokens and external computed tokens.
    """
    self.coordinator.remove_skipped_blocks(request_id, total_computed_tokens)

def evict_blocks(self, block_ids: set[int]) -> None:
    # 按 block_id 强制驱逐前缀缓存中的指定块。
    """evict blocks from the prefix cache by their block IDs.

    Args:
        block_ids: Set of block IDs to evict from cache.
    """
    self.block_pool.evict_blocks(block_ids)

def reset_prefix_cache(self) -> bool:
    # 重置 prefix cache，常用于权重更新后的失效清理或 benchmark。
    """Reset prefix cache. This function may be used in RLHF
    flows to invalidate prefix caching after the weights are updated,
    or used for resetting prefix caching status for benchmarking.

    Returns:
        bool: True if the prefix cache is successfully reset,
        False otherwise.
    """
    if not self.block_pool.reset_prefix_cache():
        return False
    if self.log_stats:
        assert self.prefix_cache_stats is not None
        self.prefix_cache_stats.reset = True
    return True

def get_num_common_prefix_blocks(self, running_request_id: str) -> list[int]:
    # 统计当前运行请求共享了多少个前缀块，方便判断缓存命中程度。
    """Calculate the number of common prefix blocks for each kv cache group.

    The function selects a running request and iterates through its blocks.
    A block is considered a common prefix block if ALL requests with
    allocated KV cache share it (i.e., ref_cnt equals the number of entries
    in req_to_blocks).

    NOTE(woosuk): The number of requests with allocated KV cache is **greater
    than or equal to** the number of requests scheduled in the current step.
    This is because having allocated KV cache only indicates that:
    1. The request has not yet finished, and
    2. The request holds its blocks unfreed.

    While all scheduled requests must have allocated KV cache, the inverse
    is not necessarily true. There may be requests with allocated KV cache
    that are not scheduled in the current step.

    This can result in an edge case where the number of common prefix blocks
    is 0, even though all scheduled requests share a common prefix. This
    occurs because there may be unscheduled requests that do not share the
    common prefix. Currently, this case cannot be easily detected, so the
    function returns 0 in such cases.

    Args:
        running_request_id: The request ID of any running request, used to
            identify the common prefix blocks.

    Returns:
        list[int]: The number of common prefix blocks for each kv cache
        group.
    """
    return self.coordinator.get_num_common_prefix_blocks(running_request_id)

def take_events(self) -> list[KVCacheEvent]:
    # 取出 block pool 中积累的 KV cache 事件，通常用于统计或调试。
    """Take the KV cache events from the block pool.

    Returns:
        A list of KV cache events.
    """
    return self.block_pool.take_events()

def get_blocks(self, request_id: str) -> KVCacheBlocks:
    # 获取某个请求当前持有的 blocks，并重新封装成 KVCacheBlocks。
    """Get the blocks of a request."""
    return self.create_kv_cache_blocks(self.coordinator.get_blocks(request_id))

def get_block_ids(self, request_id: str) -> tuple[list[int], ...]:
    # 直接返回某个请求的 block_id 列表。
    """Get the block ids of a request."""
    return self.get_blocks(request_id).get_block_ids()

def cache_blocks(self, request: Request, num_computed_tokens: int) -> None:
    # 把已经确认完成的 blocks 写入 prefix cache。
    """Cache the blocks for the request, if enabled.

    Args:
        request: The request to cache the blocks.
        num_computed_tokens: The number of computed tokens, including tokens
            that are already cached and tokens to be cached.
    """
    if self.enable_caching:
        self.coordinator.cache_blocks(request, num_computed_tokens)

def create_kv_cache_blocks(
    self, blocks: tuple[list[KVCacheBlock], ...]
) -> KVCacheBlocks:
    # 只有在 blocks 非空时才创建新对象，否则直接复用空对象以减少分配。
    # Only create new KVCacheBlocks for non-empty blocks
    return KVCacheBlocks(blocks) if any(blocks) else self.empty_kv_cache_blocks

def take_new_block_ids(self) -> list[int]:
    # 取出新分配的 attention block IDs，供后续清零或初始化使用。
    """Drain and return new attention block IDs for zeroing."""
    ids: list[int] = []
    for mgr in self.coordinator.single_type_managers:
        ids.extend(mgr.take_new_block_ids())
    return ids

def new_step_starts(self) -> None:
    # 新一步开始时推进 coordinator 的内部状态机。
    """Called when a new step is started."""
    self.coordinator.new_step_starts()

总结：
这段代码展示了 vLLM 如何把“理论上的 PagedAttention”落地成“工业级的显存调度系统”。
它通过哈希匹配实现跨请求的内存共享（Prefix Caching），并通过非常细致的槽位类型划分，
支持了当前大模型推理领域中的多种前沿加速技术。

文末问答

以下是对 vLLM KV Cache 核心问题的详细解答。这里的“代码实例”只保留两类：一类是本文前面已经贴出的源码片段；另一类会明确标注为“示意代码”，避免把机制说明误写成“这段源码里就直接实现了它”。
一、核心架构：PagedAttention 原理

显存浪费问题与 PagedAttention 的解决思路
• 传统痛点： • 外部碎片：传统框架（如早期 FasterTransformer）需要提前按模型的最大序列长度（如 2048 或 4096）为每个请求预分配连续显存。但大部分请求往往达不到最大长度，导致预留的显存空置。 • 内部碎片：即使按实际生成长度动态分配，由于内存对齐或分块策略粗糙，也会产生浪费。 • 共享困难：不同请求即使 System Prompt 相同，由于显存必须连续，也无法在物理层面共享这段 Prompt 的 KV Cache。
• PagedAttention 方案：将 KV Cache 拆分为固定大小的“物理块”（Physical Blocks），每个块仅存储少量 Token（如 16 个）。通过引入“逻辑块”（Logical Blocks）到物理块的映射表（Block Table），实现了按需动态分配显存，显著缓解外部碎片，使显存分配从“按最大长度连续预留”变成“按 block 按需分配；内部碎片被限制在序列的最后一个块中（若 block_size=16，最多浪费 15 个 Token 的空间）。
代码实例：

num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
    request_id=request.request_id,
    num_tokens=num_tokens_need_slot,
    new_computed_blocks=new_computed_block_list,
    num_encoder_tokens=num_encoder_tokens,
    total_computed_tokens=num_local_computed_tokens + num_external_computed_tokens,
    num_tokens_main_model=num_tokens_main_model,
)
if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
    return None

PagedAttention 的执行流程与底层的 CUDA Kernel
• 执行流程：在 Attention 计算阶段，系统不再向底层算子传递一个巨大的连续张量，而是传递预先分配好的巨大显存池（K/V Block Pool）以及当前请求的 Block Table（记录了逻辑顺序到物理索引的映射）。
• Kernel 内寻址：在定制的 PagedAttention CUDA Kernel 中，GPU 线程在计算 Query 与对应位置 Key 的内积时，会动态计算该位置所在的逻辑块 logical_block_id = pos / block_size 和块内偏移 offset = pos % block_size。接着，通过查 Block Table 获取 physical_block_id，直接跨越不连续的显存地址抓取数据。全程实现零拷贝（Zero-copy）。
示意代码（说明访问过程，非本文源码原样）：

logical_block_id = pos // block_size
offset = pos % block_size
physical_block_id = block_table[logical_block_id]

k_ptr = k_cache_pool[physical_block_id][offset]
v_ptr = v_cache_pool[physical_block_id][offset]

block_size 的权衡与影响
• 过大的弊端：会增加内部碎片。例如 block_size = 256，如果一个请求刚好生成了 257 个 Token，就会占用两个物理块，第二个块浪费了 255 个 Token 的空间。
• 过小的弊端：会导致 Block Table 变得极长，增加 CPU 端的调度开销；同时在 GPU Kernel 执行时，频繁的查表和非连续内存访问会降低访存带宽的利用率。
• 工业界选择：通常设为 16 或 32，这是显存利用率与 Kernel 访存效率之间的一个最佳甜点。
示意代码（用于说明权衡，本文源码并没有直接写“block_size 应该取多少”）：

def estimate(num_tokens: int, block_size: int) -> tuple[int, int]:
    num_blocks = (num_tokens + block_size - 1) // block_size
    tail_waste = num_blocks * block_size - num_tokens
    return num_blocks, tail_waste

# 例如 257 个 token：
# block_size=16  -> 17 个块，尾块浪费 15
# block_size=256 -> 2 个块，尾块浪费 255

二、显存分配与调度策略

遇到 OOM 时的调度处理机制（抢占机制）
当物理块池耗尽时，vLLM 的调度器会暂停低优先级（通常是最晚到达的）请求，执行抢占（Preemption）。主要有两种策略：
• Swap（交换）：将受害者请求已生成的 KV Cache 从 GPU 的 HBM 拷贝到 CPU 的系统内存中。当高优先级请求执行完毕释放出物理块后，再将数据从 CPU 换回 GPU 恢复执行。
• Recomputation（重计算）：直接丢弃受害者请求在 GPU 上的 KV Cache。当系统资源充足轮到它恢复执行时，将已经生成的序列作为新的 Prompt 输入，重新经历一次 Prefill 阶段来恢复 KV Cache。
代码实例：
1
2
if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
return None
连续批处理（Continuous Batching）中的资源分配
• vLLM 在 iteration（即生成一个 Token 的推理步）级别进行调度。调度器每次循环都会评估当前等待队列中的 Prefill 请求和正在执行的 Decode 请求。
• 调度器会计算当前所有运行中请求的 num_tokens_to_allocate，并对比 block_pool.get_num_free_blocks()。只要空闲块足够，系统就会为老请求分配 Decode 阶段需要的新块，并同时将新请求的 Prefill 任务加入到当前的 Batch 中一起计算，极大地提高了 GPU 计算单元的饱和度。
代码实例：
1
2
3
4
5
6
new_blocks = self.coordinator.allocate_new_blocks(
request.request_id,
num_tokens_need_slot,
num_tokens_main_model,
num_encoder_tokens,
)
三、前缀缓存 (Prefix Caching)
自动前缀缓存 (APC) 的工作原理
• 哈希匹配：vLLM 为每个 Token 序列计算哈希值。为了高效匹配，系统底层使用基数树（Radix Tree）来管理这些哈希值与物理块的对应关系。树的每个节点对应一个物理块。
• 复用条件：当新请求到来时，系统会以 block_size 为粒度计算其前缀的哈希，并在基数树中检索。如果找到完全匹配的路径，系统不会分配新的显存，而是直接将该请求的 Block Table 指向这些已存在的物理块，并将物理块的引用计数（Reference Count）加 1。
代码实例：
1
2
3
4
5
computed_blocks, num_new_computed_tokens = (
self.coordinator.find_longest_cache_hit(
request.block_hashes, max_cache_hit_length
)
)
Prefix Caching 的驱逐与淘汰策略
• 系统采用 LRU（Least Recently Used，最近最少使用）策略。
• 每个物理块都有一个访问时间戳。当物理显存水位过高，需要申请新块而空闲块不足时，调度器会扫描所有 ref_count == 0（即当前没有正在活跃的请求依赖这个块）的物理块，按照时间戳优先淘汰最久未被使用的块，将其回收到空闲池中。
代码实例（本文这段代码只展示“驱逐入口”，LRU 排序细节在更下层）：
1
2
def evict_blocks(self, block_ids: set[int]) -> None:
self.block_pool.evict_blocks(block_ids)
四、进阶与复杂场景适配

Beam Search / 并行采样下的 Copy-on-Write 机制
• 内存共享：在生成初始阶段，同一个 Prompt 衍生出的多条路径会完全共享同一组 KV 物理块，此时这些块的 ref_count = N（比如并行度为 4，则计数为 4）。
• 写时复制 (CoW)：当这些序列开始生成各自不同的 Token，需要修改当前所在的物理块（如果该块尚未填满）或分配新块时，系统检测到该块的 ref_count > 1。此时，调度器会向系统申请一个新的空闲物理块，将原块的数据拷贝过来，然后在这个新块上写入新的 Token，同时将原块的 ref_count 减 1。这大幅降低了 Beam Search 的显存翻倍问题。
示意代码（本文贴出的 KVCacheManager 片段没有直接展开 CoW 实现）：

if block.ref_count > 1:
    new_block = allocate_empty_block()
    copy_block(src=block, dst=new_block)
    block.ref_count -= 1
    new_block.ref_count = 1
    write_token(new_block, token)
else:
    write_token(block, token)

投机解码（Speculative Decoding）的槽位管理
• Lookahead 预分配：在 allocate_slots 时，不仅为当前的 Token 分配空间，还会为草稿模型预测的多个未来 Token 预留 lookahead 物理块。
• 状态隔离与回滚：草稿 Token 的 KV Cache 会被写入物理块，但它们在被目标大模型验证之前，处于“未提交”状态（不会加入到全局前缀缓存树中）。如果验证阶段某些 Token 被拒绝（Reject），调度器会调整请求的逻辑长度，直接释放被拒绝 Token 占用的物理块，实现了显存的高效回收。
代码实例（这两行能直接支撑“预留 lookahead 槽位”和“延迟写入缓存”这两个点）：

num_tokens_need_slot = min(
    num_tokens_main_model + num_lookahead_tokens, self.max_model_len
)

if not self.enable_caching or delay_cache_blocks:
    return self.create_kv_cache_blocks(new_blocks)

滑动窗口注意力 (Sliding Window Attention, SWA) 的适配
• 在类似 Mistral 这样采用 SWA 的模型中，由于当前 Token 只与过去窗口大小为 W 的历史 Token 计算注意力，超过窗口的旧 Token 实际上已经不需要参与计算了。
• vLLM 的 KVCacheManager 提供了一个 remove_skipped_blocks 机制。当请求的生成长度超过 W 时，调度器会识别出那些由于滑动窗口前进而彻底失效的逻辑块，并将其对应的物理块引用解除。如果该物理块没有被其他前缀缓存请求依赖，就可以提前被释放并放回内存池，使该请求的显存占用维持在一个恒定的上限。
代码实例：
1
2
3
4
def remove_skipped_blocks(
self, request_id: str, total_computed_tokens: int
) -> None:
self.coordinator.remove_skipped_blocks(request_id, total_computed_tokens)
block 是什么时候分配的？
结合前面的 allocate_slots() 来看，block 不是请求一进来就一次性全部分完，而是在 prefill / decode 推进过程中按需分配。函数会先根据当前已经计算的 token、这一步新增的 token、以及 num_lookahead_tokens 计算 num_tokens_need_slot，再调用 get_num_blocks_to_allocate() 和 allocate_new_blocks() 判断并申请真正需要的新 block。
logical token 怎么映射 physical block？
逻辑上先把 token 位置映射到逻辑块，再通过 Block Table 找到对应的 physical block。本文源码虽然没有展开 kernel 本体，但 KVCacheBlocks.get_block_ids() 暴露的就是“当前请求对应哪些物理块 ID”，这正是调度层把逻辑顺序交给底层物理块寻址的桥梁。
request 结束后怎么回收？
request 结束后，不是简单把一大片连续显存直接删掉，而是把它持有的 physical blocks 归还给底层 block pool。你前面“代码说明”里提到的 free(request) 就是在做这件事；如果某些 block 仍被 Prefix Cache 或其他请求共享，那么它们会等到底层引用关系解除后再进入真正可回收状态。
attention kernel 怎么读 KV？
PagedAttention 的关键就是 kernel 不要求 K/V 连续存放。它会先根据 token 位置算出 logical_block_id 和块内 offset，再通过 Block Table 找到 physical_block_id，最后去对应 block 的 offset 位置取 K/V。也正因为如此，vLLM 才能把 KV Cache 拆成分页块来管理，而不是强依赖一整段连续显存。
为什么 Prefix Caching 通常按 block 粒度命中，而不是按 token 粒度命中？
从本文源码看，KVCacheBlocks、get_block_ids()、get_unhashed_block_ids() 暴露的核心对象都是 block，而 get_computed_blocks() 也是基于 request.block_hashes 去查命中。因此工程上的复用单位天然就是 block：这样可以直接复用整块物理 KV，调度、引用计数和回收路径都更简单；如果细到 token 粒度，元数据和访存组织会复杂很多。
为什么前缀缓存通常只能复用“完整前缀”，而不是中间任意一段？
因为 get_computed_blocks() 做的是“从当前请求开头开始找最长 cache hit”，也就是从序列起点往后匹配连续前缀。Transformer 的 KV 本身也是按时间顺序累积的，所以最稳定的复用方式是“从开头开始的连续公共前缀”，而不是把中间某一段单独剪出来拼接。
num_new_computed_tokens 和 num_new_tokens 的区别是什么？
这两个量在 allocate_slots() 里是分开处理的。num_new_computed_tokens 表示通过 Prefix Cache 新命中的 token 数，它们已经有现成 KV，会进入 num_local_computed_tokens；num_new_tokens 则表示当前这一步真正还要新增槽位、参与本轮计算或生成的 token 数。前者是“命中复用”，后者是“实际新增”。
为什么 allocate_slots() 里要区分 local computed 和 external computed？
因为源码里 total_computed_tokens 是由 num_local_computed_tokens + num_external_computed_tokens 一起算出来的。这说明 vLLM 允许一部分 KV 不是主模型在当前进程里一步步算出来的，而是由外部组件提前提供的，比如多模态 encoder。这样容量判断和 block 分配时，才能把这部分上下文也算进去。
为什么源码里要有 full_sequence_must_fit 这种 admission check？
因为有些场景不能接受“先跑起来，跑到一半再发现后面放不下”。allocate_slots() 里会在这个开关打开时，先估算整段序列需要多少 block，再和 get_num_free_blocks() 比较。这样可以减少中途 OOM 或频繁抢占，换来更稳定的时延表现。
delay_cache_blocks 的作用是什么？
从 if not self.enable_caching or delay_cache_blocks: return ... 这段逻辑可以看出，它表示“这轮先分到 block，但先不要把这些 block 固化进 Prefix Cache”。这在投机解码里很重要，因为 lookahead 里的草稿 token 还没有被最终确认，过早写入缓存会污染后续请求的命中结果。
Prefix Caching 为什么不等于“所有重复请求都能加速”？
因为 get_computed_blocks() 依赖的是 request.block_hashes 的严格匹配，而不是“看起来差不多”就能复用。只要 system prompt、模板、特殊 token 或前处理插入字段不一样，block hash 就会不同，最终就不会命中。所以它复用的是“严格一致的前缀 block”，不是“语义相似的 prompt”。
Prefix Caching 的收益为什么通常主要体现在 TTFT，而不是 decode 吞吐？
从源码流程上看，get_computed_blocks() 是在请求进入时先查前缀命中，命中后直接跳过前面那一大段本该做的 prefill 计算，所以最直接改善的是首 token 前的等待时间。decode 阶段每轮只新增很少 token，因此 Prefix Cache 对 decode 吞吐的改善通常没有对 prefill 那么显著。
为什么说 Continuous Batching 和 KV Cache 是强耦合的？
因为 Continuous Batching 想让不同请求在不同 iteration 上动态进出 batch，就必须让每个请求的上下文能被低成本保存、扩展和回收。本文 allocate_slots() 做的正是这件事：算 block 需求、清理无效 block、申请新 block、必要时延迟固化到缓存。没有这套分页式显存管理，动态合批的收益会被显存管理成本吃掉。
如果两个请求前缀相同，但采样参数不同，还能复用同一份前缀 KV 吗？
通常可以。因为本文讨论的 Prefix Caching 命中条件来自输入 token 序列对应的 block_hashes，而不是 temperature、top-p、top-k 这类采样参数。采样参数影响的是“后续选哪个 token”，不会改变已经存在的前缀 token 的 K/V 内容。
面试里如果被问“KV Cache 真正占显存的主要矛盾是什么”，怎么答？
可以结合本文 allocate_slots() 的参数来答：真正决定显存压力的，不只是“开没开 cache”，而是“总上下文长度、并发请求数、每轮新增 token、lookahead 预留、外部已计算 token”等因素叠加后的 block 需求。vLLM 的核心价值不是凭空减少理论 KV 体积，而是用 PagedAttention 把这些 KV 变成“可分页、可复用、可延迟固化、可及时回收”的资源，从而让同样的显存承载更多真实请求。

代码说明

这段代码是 vLLM v1 架构中核心的显存管理层实现。它位于调度器（Scheduler）和底层物理显存池（Block Pool）之间，主要负责 PagedAttention 机制下的显存块分配、回收，以及 Prefix Caching（前缀缓存）功能。

代码主要由两个类构成，下面按模块拆解。

1. KVCacheBlocks：调度器与管理器的隔离层

这是一个数据类，本质上是对物理显存块列表的封装，用来把 KVCacheManager 内部的数据结构隐藏起来，只向上层调度器暴露稳定接口。

设计目的：作为 Scheduler 的返回对象，避免调度器直接接触内部 block pool 的复杂结构。
多组支持：blocks 是一个元组，支持多个 kv_cache_group。这样可以兼容未来不同组采用不同 block_size 的架构。
核心功能：
- get_block_ids：把对象转换成物理块 ID 列表，方便调度器使用。
- get_unhashed_block_ids：返回尚未进行哈希、也就是还没有进入前缀缓存管理的块 ID。
- get_unhashed_block_ids_all_groups：按组返回未哈希块 ID，并跳过 padding block。

2. KVCacheManager：缓存与分配的控制中枢

这是管理 KV Cache 生命周期的大脑。它不仅处理常规的生成请求，还兼容投机解码（Speculative Decoding）、多模态外部缓存接入等复杂场景。

3. 核心机制一：前缀缓存匹配 `get_computed_blocks`

当系统启用了 enable_caching 时，这个方法会在处理新请求前被调用。

它会根据请求的 block_hashes，去底层 coordinator 里查找是否已经存在可复用的 KV Cache。
典型场景是 System Prompt 复用或多轮对话。如果命中，就直接返回已经缓存好的物理块列表，以及命中的 token 数量。
这样可以跳过对应前缀的 Attention 计算，从而显著降低首字延迟（TTFT）。

4. 核心机制二：复杂的槽位分配 `allocate_slots`

这是整个类里最核心、也最复杂的函数。它负责为一个请求在当前推理步申请物理显存块。代码中把请求在显存中的逻辑布局划分得很细，可以理解成下面几段：

已经计算或缓存的部分：
- comp：当前请求之前已经计算完的 token。
- new_comp：刚刚通过前缀缓存匹配到的 token。
- ext_comp：由外部组件（比如多模态 connector）计算并传入的 token。
需要新分配空间的部分：
- new：当前步骤即将输入或生成的新 token。
- lookahead：专门为投机解码预留的额外槽位，用来存放模型一次性预测出的多个未来草稿 token。

分配流程可以概括为四步：

容量预检：如果设置了 full_sequence_must_fit，会提前计算该请求直到最大长度所需的总块数。如果系统当前空闲块不足，就直接拒绝分配，防止请求中途 OOM。
清理废弃块：调用 remove_skipped_blocks。如果模型使用了滑动窗口注意力，超出窗口范围的旧 token 占用的块会被提前释放。
物理分配：通过 coordinator.allocate_new_blocks 向系统申请真正的物理块。
状态固化：将当前已经确认无误的 token，尤其是排除了投机解码中可能被拒绝的草稿 token 后的部分，推入前缀缓存池，供后续请求复用。

5. 核心机制三：资源回收与维护

free(request)：当请求生成完毕或被中止时，将其持有的所有物理块交还给底层池。
evict_blocks(block_ids)：强制从前缀缓存中驱逐指定块，通常用于缓存池满了之后执行最近最少使用（LRU）式淘汰。
get_num_common_prefix_blocks：统计当前正在运行的所有请求共享了多少个前缀块，这对于调度策略优化和缓存效率评估非常关键。

kvcache的特点

KV cache 本身并不会减少单个 token decode 时必须读取的历史 K/V 总量。PagedAttention 主要解决的是显存管理问题：减少预分配浪费、支持动态 batch、支持 prefix cache 复用和请求级回收。

decode 阶段仍然容易 memory-bound，因为每生成一个 token 都要读取历史 K/V，访存量随 seq_len 增长。PagedAttention 通过更好的显存管理提高并发承载能力，但 block table 间接寻址也会带来一定 kernel 访问开销。vLLM 的设计是在显存利用率、动态调度能力和 attention kernel 访存效率之间做权衡。