# `LlamaCppEx.ModelManager`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex/model_manager.ex#L1)

Holds multiple models resident and routes requests to them by id.

The manager is a node-wide singleton `GenServer` that owns an ETS table of
loaded models. Following the otp-thinking ETS pattern, **lifecycle writes
serialize through the GenServer, while inference-time lookups read the ETS
table directly from the caller** — so the manager never becomes a throughput
bottleneck for `generate/3`, `stream/3`, `chat/3`, or `embed/3`.

It is a singleton by design: the client API targets the manager by its module
name, and the backing `Registry`/`DynamicSupervisor` use fixed names. Start at
most one per node — `init/1` refuses a second instance with a clear error.

Because the slow parts of `load/3` (Hub download + native model load) run in a
supervised `Task` rather than the GenServer process, a long load does **not**
block other lifecycle calls (`unload/1`, `set_default/1`, or concurrent
`load/3`s). The memory-budget reservation and the ETS commit are still
serialized on the GenServer, so resident models are always accounted for. (The
budget remains advisory: a model's footprint is only reserved once its size is
known — after resolve — so two models *downloading* at once may momentarily
under-count each other.)

Start it as part of `LlamaCppEx.ModelSupervisor`, which also starts the
`Registry` and `DynamicSupervisor` that server-backed models need:

    children = [
      {LlamaCppEx.ModelSupervisor,
       memory_budget: :auto,
       models: [
         {"chat", {:hub, "Qwen/Qwen3-0.6B-GGUF", "Qwen3-0.6B-Q8_0.gguf"}, n_gpu_layers: -1},
         {"embed", {:path, "/models/nomic-embed.gguf"}, capabilities: [:embed]}
       ]}
    ]

## Backing modes

  * `:server` (default for generation/chat) — backs the model with a supervised
    `LlamaCppEx.Server`, giving continuous batching, streaming, prefix cache,
    and telemetry.
  * `:direct` (auto-selected when `:embed` is in `:capabilities`) — holds the
    `%LlamaCppEx.Model{}` and runs stateless `LlamaCppEx.generate/3` /
    `LlamaCppEx.Embedding.embed/2`. Mandatory for embeddings, since the server
    has no embedding path.

## Routing

  * Explicit id: `generate("chat", prompt)`.
  * Default model: `generate(:default, prompt)` routes to the model marked
    `default: true` at load, or set via `set_default/1`.

## Unloading and memory

Model cleanup is GC-based: `unload/1` stops the backing server (dropping its
context and model refs) and removes the ETS entry, then forces a GC. Because
reclamation is by garbage collection, **any caller still holding a `%Model{}`
obtained via `fetch_model/1` keeps the underlying model alive** past `unload/1`.
Prefer id-based dispatch and avoid holding raw refs.

Loads are checked against an advisory memory budget (see
`LlamaCppEx.ModelManager.Budget`); over-budget loads are refused with
`{:error, {:insufficient_memory, ...}}`.

# `id`

```elixir
@type id() :: term()
```

A model identifier. Any term works as a key; strings (e.g. `"chat"`) or atoms
are conventional. Ids flow through as raw terms and are never converted to
atoms, so user-supplied strings are safe.

# `source`

```elixir
@type source() :: LlamaCppEx.ModelManager.Entry.source()
```

# `chat`

```elixir
@spec chat(id(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, String.t()} | {:error, term()}
```

Routes a chat request to model `id` (or `:default`).

# `child_spec`

Returns a specification to start this module under a supervisor.

See `Supervisor`.

# `default`

```elixir
@spec default() :: id() | nil
```

Returns the current default model id, or `nil`.

# `embed`

```elixir
@spec embed(id(), String.t(), keyword()) :: {:ok, [float()]} | {:error, term()}
```

Routes an embedding request to model `id` (or `:default`).

The model must have been loaded with `:embed` in its `:capabilities` (which
forces `:direct` mode).

# `fetch_model`

```elixir
@spec fetch_model(id()) :: {:ok, LlamaCppEx.Model.t()} | {:error, term()}
```

Returns the raw `%LlamaCppEx.Model{}` for advanced use.

Holding the returned ref keeps the model alive past `unload/1` — prefer
id-based dispatch where possible.

# `generate`

```elixir
@spec generate(id(), String.t(), keyword()) :: {:ok, String.t()} | {:error, term()}
```

Routes a generation request to model `id` (or `:default`).

Dispatches to `LlamaCppEx.Server.generate/3` (`:server` mode) or
`LlamaCppEx.generate/3` (`:direct` mode).

# `info`

```elixir
@spec info(id()) :: {:ok, map()} | {:error, :not_loaded}
```

Returns a sanitized view of one model, or `{:error, :not_loaded}`.

# `list`

```elixir
@spec list() :: [map()]
```

Lists resident models as sanitized maps (no raw refs).

# `load`

```elixir
@spec load(id(), source(), keyword()) :: {:ok, id()} | {:error, term()}
```

Loads a model and keeps it resident under `id`.

## Options

  * `:mode` - `:server` or `:direct`. Defaults to `:direct` when
    `:capabilities` includes `:embed`, otherwise `:server`.
  * `:capabilities` - List of `:generate`, `:chat`, `:embed`. Defaults to
    `[:generate, :chat]`.
  * `:default` - When `true`, mark this model as the default route.
  * Hub options (`:cache_dir`, `:token`, `:revision`, `:force`) when `source`
    is `{:hub, repo, file}`.
  * Any `LlamaCppEx.Model.load/2` or `LlamaCppEx.Server.start_link/1` options
    (e.g. `:n_gpu_layers`, `:n_ctx`, `:n_parallel`).

# `loaded?`

```elixir
@spec loaded?(id()) :: boolean()
```

Returns whether a model is loaded and `:ready`.

# `route`

```elixir
@spec route(id()) ::
  {:ok, {:server, pid(), LlamaCppEx.ModelManager.Entry.t()}}
  | {:ok, {:direct, LlamaCppEx.Model.t(), LlamaCppEx.ModelManager.Entry.t()}}
  | {:error, term()}
```

Resolves `id` (or `:default`) to its dispatch target.

Returns `{:ok, {:server, pid, entry}}`, `{:ok, {:direct, model, entry}}`, or
`{:error, :not_loaded | {:not_ready, status}}`. Primarily for dispatch and
testing.

# `set_default`

```elixir
@spec set_default(id()) :: :ok | {:error, :not_loaded}
```

Sets the default model used by `:default` routing.

# `start_link`

```elixir
@spec start_link(keyword()) :: GenServer.on_start()
```

Starts the manager. Normally started by `LlamaCppEx.ModelSupervisor`.

## Options

  * `:memory_budget` - `:infinity` (default), `:auto` (~80% of system RAM), or
    a byte limit.
  * `:models` - List of `{id, source}` or `{id, source, opts}` to auto-load
    after start. `source` is `{:path, p}` or `{:hub, repo, file}`.
  * `:io` - Backend module (`LlamaCppEx.ModelManager.Backend`). Defaults to
    `LlamaCppEx.ModelManager.ModelIO`; overridden in tests.
  * `:name` - GenServer name. Defaults to `LlamaCppEx.ModelManager`.

# `stream`

```elixir
@spec stream(id(), String.t(), keyword()) :: Enumerable.t()
```

Routes a streaming generation request to model `id` (or `:default`).

Raises `ArgumentError` if the model is not loaded and ready (a lazy stream
cannot carry an error tuple).

# `unload`

```elixir
@spec unload(id(), timeout()) :: :ok | {:error, :not_loaded}
```

Unloads a model and frees its backing resources (GC-based).

Stopping a backing server can take a moment for large models, so this accepts
an optional `timeout` (default 30s) rather than the 5s `GenServer` default.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
