Needle is a 26-million-parameter model built specifically for function calling. On single-shot function call benchmarks, it outperforms FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M — models between three and ten times its size. Cactus Compute published the model and the paper last week; the GitHub repository has 649 points on Hacker News.
The small function calling model space has been moving fast, but Needle sits at an extreme end of the size-to-performance ratio that is worth understanding.
Architecture
The model has 26M parameters across a 12-layer encoder and an 8-layer decoder. The encoder has no feed-forward networks — just attention. The decoder uses 8 heads with 4 key-value heads and a 512 hidden dimension. Vocabulary is 8,192 BPE tokens.
That vocabulary size is narrow by design. Needle is not trying to model natural language breadth. It is trying to recognize function schemas, match arguments to parameters, and emit correctly structured calls — tasks where a small, focused vocabulary is an advantage rather than a limitation.
Training
Pretraining ran on 16 TPU v6e chips for 27 hours over 200 billion tokens. Post-training on 2 billion function-call-specific tokens took 45 minutes. The model was distilled from Gemini 3.1.
Inference speed on Cactus's own hardware: 6,000 tokens per second prefill, 1,200 tokens per second decode.
Hardware requirements
Quantized to INT4, the model weights are 14MB on disk. No GPU is required. It runs on CPU RAM on consumer hardware — the project targets phones, watches, and Raspberry Pi-class devices as deployment platforms. Cactus does not publish a minimum RAM specification, but at 14MB weights with INT4 quantization the working memory footprint is well under 1GB.
What this is and is not
Needle is a routing and dispatch model, not a reasoning model. It handles the structured part of an agentic pipeline — receiving a task, selecting the right tool, and formatting the call correctly — but it does not generate reasoning, handle ambiguous instructions, or maintain conversation context.
The benchmarks are single-shot function calling: given a schema and a task, produce the correct function call in one pass. That is a specific and real capability. It is also a narrow one. In conversational settings where tool selection depends on context accumulated across multiple turns, the benchmark advantage shrinks.
Cactus trained the model across 15 tool categories: timers, messaging, navigation, reminders, smart home control, calendar, search, weather, music, calls, alarms, settings, notes, and a few others. These are the categories where single-shot dispatch is sufficient and where the input is usually short and well-scoped.
The documentation acknowledges the model can be "finicky" on ambiguous input. In testing reported on Hacker News, asking it to "contact my boss to say I'll be late" caused it to set a timer rather than select the email tool. Providing explicit context (a contact name and email address) fixed the selection. The model works well when the schema is unambiguous and the input maps cleanly to one tool; it struggles when disambiguation requires inference from context the model has not seen.
Multi-step operations also trip it up. Asking it to "set a timer for one hour, then remind me in one hour" produced two parallel one-hour timers rather than a chained operation. This is expected given the single-shot architecture — Needle does not maintain state or reason about sequences.
The practical use case is cost-sensitive pipelines where function dispatch is the bottleneck. If you are paying for GPT-4o or Claude Sonnet API calls and a significant fraction of those calls are just resolving which tool to call with which arguments, replacing that step with a 26M local model is a meaningful cost reduction. The large model handles reasoning and response; Needle handles dispatch.
A second pattern that came up on HN: pairing Whisper for transcription with Needle for tool dispatch on a local device. The combined pipeline handles voice-to-function-call without any network round-trip — useful for Home Assistant integrations or embedded voice agents where latency and privacy matter.
Running it
The model runs locally. Clone the repository, run the setup script, then launch the playground:
git clone https://github.com/cactus-compute/needle.git
cd needle
source ./setup
needle playgroundThat opens a web interface at localhost:7860 for testing schemas and queries. For programmatic use:
from needle import SimpleAttentionNetwork, load_checkpoint, generate
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
result = generate(model, params, tokenizer, query="set a timer for 10 minutes", tools=[...])Cactus also publishes the package on PyPI:
pip install cactus-needleThe repository includes example code for Playwright-style function schemas and a comparison table against the benchmark models. The documentation is clear about what the model is not — the note that "larger models excel in conversational settings" is in the README, not buried in a paper appendix.
The source is at github.com/cactus-compute/needle.