# `LlamaCppEx.ModelManager.Budget`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex/model_manager/budget.ex#L1)

Advisory, placement-aware memory budgeting for `LlamaCppEx.ModelManager`.

Given a budget, the footprint a new model would occupy across devices, and what
resident models already use, decide whether the new model fits. The manager
refuses over-budget loads.

## Budget shapes

  * `:infinity` / `nil` — no limit.
  * an **integer** — a single **combined** pool: RAM plus all VRAM counts
    against one number (backward-compatible with the original single-pool
    budget).
  * `:auto` — RAM ≈ 80% of system memory, and **per-GPU** VRAM from each
    device's free memory (via `LlamaCppEx.devices/0`).
  * a map `%{ram: ram, vram: vram}` — explicit **per-device** budget. `ram`
    and each VRAM entry may be a byte limit, `:infinity`, or (for `ram`/`vram`)
    `:auto`. `vram` may be a list `[b0, b1, ...]` (indexed by GPU) or a map
    `%{gpu_index => bytes}`.

Placement across GPUs is derived from a model's `:n_gpu_layers`, `:split_mode`,
`:tensor_split`, and `:main_gpu`. It is advisory — quantization, compute
buffers, fragmentation, and exact KV growth are approximated, and partial
offload (`0 < n_gpu_layers < n_layers`) is treated coarsely as fully offloaded.

# `budget`

```elixir
@type budget() ::
  %{mode: :unlimited}
  | %{mode: :combined, limit: non_neg_integer()}
  | %{
      mode: :per_device,
      ram: limit(),
      vram: :infinity | %{required(non_neg_integer()) =&gt; limit()}
    }
```

# `limit`

```elixir
@type limit() :: non_neg_integer() | :infinity
```

# `placement`

```elixir
@type placement() :: %{
  ram: non_neg_integer(),
  vram: %{required(non_neg_integer()) =&gt; non_neg_integer()}
}
```

# `add_usage`

```elixir
@spec add_usage(placement(), placement()) :: placement()
```

Folds a placement into a usage accumulator.

# `check`

```elixir
@spec check(budget(), placement(), placement()) ::
  :ok | {:error, {:insufficient_memory, keyword()}}
```

Decides whether `placement` fits within `budget` given current `used`.

Returns `:ok` or
`{:error, {:insufficient_memory, device: device, required: r, available: a}}`,
where `device` is `:total` (combined budget), `:ram`, or `{:gpu, index}`.

# `distribute`

```elixir
@spec distribute(non_neg_integer(), keyword(), non_neg_integer()) :: placement()
```

Estimates how a model's footprint is distributed across RAM and GPUs.

Returns `%{ram: bytes, vram: %{gpu_index => bytes}}`.

## Options (from the load opts)

  * `:mode` - `:server` adds a coarse KV-cache estimate; `:direct` adds none.
  * `:n_gpu_layers` - `0` keeps the model in RAM; anything else (incl. `-1`)
    is treated as fully offloaded to GPU(s).
  * `:split_mode` - `:none` (default) places everything on `:main_gpu`;
    `:layer`/`:row` splits by `:tensor_split`.
  * `:tensor_split` - per-GPU weights; when empty, an equal split across the
    `n_gpus` detected devices.
  * `:main_gpu` - target device for `:none` (default `0`).
  * `:offload_kqv` - whether the KV cache lives on GPU (default `true`).
  * `:n_ctx`, `:n_parallel` - feed the KV-cache estimate.

# `empty_usage`

```elixir
@spec empty_usage() :: placement()
```

An empty usage accumulator.

# `resolve`

```elixir
@spec resolve(term(), [map()]) :: budget()
```

Resolves a `:memory_budget` option into an internal budget.

`gpu_devices` is the GPU subset of `LlamaCppEx.devices/0` (maps with
`:gpu_index`, `:memory_free`); it is only consulted for `:auto` VRAM.

---

*Consult [api-reference.md](api-reference.md) for complete listing*