Shared batch-assembly helpers used by the batching strategies.
The strategies (DecodeMaximal, PrefillPriority, Balanced) only differ in
the order and budget split between decode and prefill — the per-slot assembly
of decode tokens and prefill chunks is identical, so it lives here.
Performance notes
These helpers run on every tick of the server loop, once per active slot, and the prefill helper runs once per token of every prompt. To keep the loop linear in the number of entries:
- A running
n_entriescount is threaded through the accumulators instead of callinglength/1on the growingentrieslist. Entries are prepended and reversed once by the caller, so a freshly appended entry's final batch index is exactly the currentn_entries. - Prompt length comes from the cached
slot.n_prompt_tokensrather thanlength(slot.prompt_tokens). - Prefill chunks are sliced from
slot.prompt_tokens_tuple(O(1) random access) rather thanEnum.slice/3on a list (O(prefill_pos)).
All helpers take and return the 4-tuple {entries, n_entries, slots, budget}.
Summary
Functions
Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted. Streams each token piece to the slot's subscriber.
Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.
Functions
Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted. Streams each token piece to the slot's subscriber.
Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.