Computers and Computation

AI LLMs Engineering

Language models are bad at arithmetic. This is well-established. A 2B parameter model asked to multiply 87 by 24 will confidently produce something close to the right answer, but not the right answer. The interesting question is what to do about it.

There are broadly two schools of thought. The first says: make the model itself compute. Train it so that the transformer’s forward pass is the computation. Tzamos et al. at Percepta recently demonstrated this taken to its extreme, building a computer inside a transformer that can execute arbitrary C programs for millions of steps using 2D attention heads. They compiled a WebAssembly (WASM, a portable low-level bytecode format) interpreter directly into transformer weights, achieving 100% deterministic execution at 33K tokens/sec on CPU and solving the world’s hardest Sudoku without a single probabilistic error.

The result is technically impressive but the community response has been a mix of fascination and skepticism. The core issue is that the weights are compiled, not learned. There’s no gradient descent involved, which prompted the obvious objection: if you want to run bytecode, just run bytecode. The authors claim the execution trace is differentiable for backpropagation, but the model uses non-differentiable hard attention, and whether approximate differentiable variants would actually work remains undemonstrated. There are no performance benchmarks against native execution or tool-calling approaches. And the fundamental efficiency question is unresolved: they’ve turned O(1) memory access into O(log n) operations via their HullKVCache scheme. Transformers executing instructions will always be slower than just running them on a regular computer.

The second school says: don’t make the model compute, make it delegate. Teach it to recognise when it needs a calculator, write the expression, and let an external system evaluate it. This is less ambitious but more practical. You don’t need to solve the differentiability problem, you don’t need to compile interpreters into weights, and you get exact arithmetic for free because the calculator is just a regular computer.

This post describes an implementation of the delegation approach. I fine-tuned Qwen 3.5-2B to emit <<calc: expr>> tokens during generation, built a runtime interceptor that evaluates the expressions and injects results back into the generation context, and tested three different eval backends: CPU, MLX GPU ops, and a custom Metal compute shader. The result is a 78% relative improvement in math accuracy, from 35.3% to 62.8%.

The idea of augmenting language models with external tools dates back to Toolformer (Schick et al., 2023), which showed that models can learn to insert API calls (calculator, search engine, translator) inline during generation. Toolformer used a self-supervised approach: generate candidate API calls, filter by whether the API result reduces perplexity, then fine-tune on the filtered data.

Calcformer (Kadlcik et al., 2023) narrowed the focus to arithmetic specifically, training models to emit calculator tokens in a format similar to what I use here. They released the Calc-X dataset — 328K math examples with inline calculator annotations — which I use as one of my data sources.

The approach here differs from Toolformer in that it uses straightforward SFT rather than self-supervised tool insertion, and differs from Calcformer in the inference mechanism (token-level streaming interception rather than post-hoc evaluation) and in the exploration of GPU-resident eval backends.

Dataset

The training data comes from three sources, combined into a single dataset of 9,691 chat-formatted examples.

Calc-X (HuggingFace). I pulled math word problems and arithmetic examples from the Calc-X dataset, transforming their XML gadget format into the <<calc: expr>> token format:

# Calc-X original format
<gadget id="calculator">7 * 8</gadget><output>56</output>

# Transformed format
<<calc: 7 * 8>> = 56

Every expression is validated with a sandboxed Python evaluator before inclusion. Any example where the expression fails to parse or produces an incorrect result is discarded.

Template-generated examples. Seven categories of synthetic problems with randomised parameters: multi-step word problems, unit conversions, date/time calculations, finance (interest, loans), science (statistics, physics), string operations, and trivial arithmetic. Each template pre-computes all intermediate values in Python, so the dataset is 100% correct by construction.

Negative examples. About 26% of the training data consists of general knowledge questions (“What are some tips for learning a new language?”, “Explain the difference between RAM and ROM”) where the model should produce a normal text response with no calc tokens. This is important. Without negatives, the model learns to use the calculator for everything, including questions that don’t involve computation. Ask it about the fall of the Roman Empire and it’ll try <<calc: roman_empire + 1000_years>> and wonder why it gets a syntax error.

SplitTotalCalc examplesGeneral knowledge
Train7,7525,719 (73.8%)2,033 (26.2%)
Valid969749 (77.3%)220 (22.7%)
Test970723 (74.5%)247 (25.5%)

One deliberate design choice: all math uses calc tokens, even trivial arithmetic. 2 + 2 gets <<calc: 2 + 2>>, not 4. This enforces consistent delegation and avoids the model learning a threshold for “this is hard enough to need a calculator.” In practice, the overhead of evaluating 2 + 2 externally is negligible, and the consistency makes both training and evaluation cleaner.

Model and Training

The base model is Qwen 3.5-2B, a hybrid architecture with 18 GatedDeltaNet layers (linear attention with a custom Metal kernel for efficient inference on Apple Silicon) and 6 full attention layers.

Two special tokens, <<calc: and >>, are added to the tokenizer vocabulary. The corresponding embedding rows are initialised as the mean of all existing embeddings rather than random, which gives the model a reasonable starting point in the embedding space.

LoRA configuration:

rank:             32
alpha:            64  (effective scale: 2.0)
target layers:    last 16 of 24 transformer blocks
target modules:   all Linear layers (full mode)
learning rate:    2e-5 with cosine decay
warmup:           100 steps (linear 1e-7 → 2e-5)
batch size:       1
max sequence len:  2048
iterations:       3,000

Training was done on a Modal A100 using HuggingFace PEFT, taking about 86 minutes. The loss is standard cross-entropy for next-token prediction with no special weighting for calc tokens vs regular tokens.

One complication with LoRA on this model: the GatedDeltaNet layers in Apple’s mlx_lm use a custom Metal kernel for fast inference, but this kernel has no backward pass. It’s designed as an inference-only optimisation, with a pure-ops fallback (gated_delta_ops) for when you need gradients. This means two training modes:

  • Full mode: LoRA on all Linear layers. Requires model.train() which disables the Metal kernel and falls back to the auto-differentiable ops path. About 7x slower but produces better adapters.
  • MLP-only mode: LoRA on MLP + full-attention layers only. Wraps GatedDeltaNet output with mx.stop_gradient() so the fast kernel stays active during forward passes. Faster iteration but the linear attention layers are effectively frozen.

The final model uses full mode.

Inference: Token-Level Interception

The core mechanism is simple. During autoregressive generation, the interceptor watches the output stream for completed <<calc: expr>> patterns. When one is detected:

  1. Generation is paused
  2. The expression is extracted and evaluated
  3. = {result} is injected into the generated text
  4. The full context (prompt + generated text so far) is re-encoded
  5. Generation continues from the new context
CALC_PATTERN = re.compile(r'<<calc:\s*(.*?)\s*>>')

while total_generated_tokens < max_tokens:
    input_ids = mx.array(tokenizer.encode(full_context))

    for token_arr, _ in generate_step(input_ids, model, sampler):
        token_text = tokenizer.decode([token_id])
        generated_text += token_text

        match = CALC_PATTERN.search(generated_text)
        if match:
            result = eval_fn(match.group(1))
            injection = f" = {format_result(result)}"
            generated_text = generated_text[:match.end()] + injection + generated_text[match.end():]
            full_context = prompt + generated_text
            break  # re-encode and continue

There’s no KV cache reuse after injection. The full context is re-encoded from scratch. This is deliberately simple. The model typically generates 1-3 calc expressions per response, so the re-encoding overhead is minimal compared to the generation time itself.

Eval Backends

I implemented three backends for evaluating the extracted expressions, primarily to understand where computation time actually goes.

CPU safe_eval. A sandboxed Python evaluator built on the ast module. The expression string is parsed into an abstract syntax tree (so 7 * 8 + 2 becomes a tree with + at the root, 7 * 8 on the left, 2 on the right), validated by a whitelist visitor that rejects anything dangerous (imports, attribute access, exec/eval, file operations), then evaluated by a second tree walker that maps nodes to Python operators and math functions. Supports arithmetic, comparisons, trig functions, logarithms, and list/tuple literals. Safety limits: 500 char max expression length, 20 max nesting depth, 1e100 magnitude cap.

MLX ops. Parses the expression AST and maps operations to mx.* calls (mx.add, mx.multiply, mx.sqrt, etc.), keeping computation on the GPU. No CPU round-trip for the eval itself.

Metal shader. A custom Metal compute kernel that implements a stack-based postfix evaluator. Expressions are converted from infix to postfix notation via the shunting-yard algorithm on the CPU side, then dispatched to the GPU as a flat float buffer where positive values are operands and negative values are opcodes:

kernel void evaluate_postfix(
    device const float* expr_buf    [[buffer(0)]],
    device float*       result_buf  [[buffer(1)]],
    device const uint*  params      [[buffer(2)]],
    uint                tid         [[thread_position_in_grid]])
{
    float stack[MAX_STACK_DEPTH];
    int sp = 0;

    for (uint i = 0; i < expr_len; i++) {
        float tok = expr_buf[expr_offset + i];
        if (tok == OP_END) break;
        if (tok == OP_ADD) {
            float b = stack[--sp]; float a = stack[--sp];
            stack[sp++] = a + b;
        } else if (tok >= 0.0) {
            stack[sp++] = tok;  // operand
        }
        // ... other opcodes
    }
    result_buf[tid * 2] = stack[0];
}

The shader is compiled at runtime via MTLDevice.newLibraryWithSource() through PyObjC, and supports batched evaluation (N expressions in one dispatch).

I also wrote a GPU-parallel pattern scanner kernel (scan_calc_tokens) that searches token buffers for <<calc: and >> sequences using atomic counters, but it turned out to be unnecessary. CPU regex on the decoded text is fast enough for streamed generation.

Results

All evaluations use live calc interception during generation, testing the full production pipeline.

ModelMath accuracyAvg tok/s
Base Qwen 3.5-2B (no calc)35.3% (255/723)47.5
SFT + CPU safe_eval62.8% (454/723)43.2
SFT + MLX ops59.6% (431/723)43.7
SFT + Metal shader60.0% (434/723)43.3

Answer correctness is determined by extracting the final numeric result from the model’s output and comparing it to the ground truth.

A few things stand out.

Backend choice doesn’t matter. The accuracy differences between backends (62.8% vs 59.6% vs 60.0%) look like they could tell a story about float32 precision, but on 723 examples the gaps aren’t statistically significant. Different backends produce slightly different intermediate values, which get injected back into the context as text, which can nudge the model down different generation paths. The variation is better explained by this downstream divergence than by the eval precision itself. All three backends produce ~43 tok/s, and individual expression evaluation takes 20-160 microseconds depending on the backend while token generation runs at ~23 milliseconds per token. The eval is three orders of magnitude faster than the bottleneck. At this scale (single GPU, 2B model, no KV cache reuse after injection) the backend genuinely doesn’t matter. At larger scale, with batched inference across multiple GPUs and KV cache surgery instead of full re-encoding, the eval overhead might start to show up. But for this setup, pick whichever is simplest.

The base model is bad at math. 35.3% accuracy on problems that require multi-step arithmetic. This isn’t surprising for a 2B model, but it establishes the baseline. The fine-tuned model with calc interception reaches ~62%, a 78% relative improvement.

Format compliance is high. 96.7% of test examples correctly use (or correctly avoid) calc tokens. 99.7% of generated calc expressions are syntactically valid Python. The model learned the format well.

Discussion

What the 37.2% failure rate looks like. Not all failures are arithmetic errors. Some are reasoning failures: the model sets up the wrong equations, misidentifies what quantity to compute, or stops after computing an intermediate result instead of the final answer. The calc token approach fixes computation errors but doesn’t fix reasoning errors. A model that correctly delegates 87 * 24 to the calculator still fails if the problem required 87 + 24.

Why not GRPO? I experimented with Group Relative Policy Optimisation as a second training stage (the adapter checkpoints are in the repo) but found that SFT alone was sufficient for learning the format and delegation behaviour. GRPO might help for improving the reasoning that happens around the calc tokens, but that wasn’t the focus here.

The re-encoding tradeoff. Re-encoding the full context after each calc injection is wasteful in theory. You’re throwing away the KV cache and reprocessing tokens you’ve already seen. A more efficient approach would surgically insert the result tokens into the existing KV cache. But for 2B model inference on Apple Silicon where the context lengths are modest (under 2K tokens), the re-encoding adds maybe 50-100ms per injection. Not worth the complexity of KV cache surgery for this use case.

Apple Silicon as a research platform. The entire pipeline (data generation, fine-tuning locally via MLX or Modal A100 for production, inference, evaluation) runs on an M4 MacBook. MLX makes this practical. The Metal shader was an interesting exercise but ultimately unnecessary for this workload. The Python AST evaluator at 160 microseconds per expression is fast enough when token generation takes 23 milliseconds.

Conclusion

Teaching a small language model to delegate computation to an external tool is straightforward with SFT. The inline token format (<<calc: expr>>) is learnable from a modest dataset (~8K examples), the model achieves high format compliance (96.7%), and the accuracy improvement is substantial (35.3% to 62.8%). The remaining errors are predominantly reasoning failures, not computation failures, which suggests the next gains come from improving how the model formulates problems, not from improving the calculator.

For anyone reproducing this: the CPU eval (Python’s ast module) is the simplest backend and works just as well as the GPU alternatives. The speed difference is irrelevant when generation dominates, and there’s no reason to complicate things.

References

Citation

Please cite this work as:

Leeney, Will. "Computers and Computation". Will Leeney (March 2026). https://willleeney.com/blog/computers-and-computation

Or use the BibTeX citation:

@article{leeney2026computers,
  title = {Computers and Computation},
  author = {Leeney, Will},
  journal = {willleeney.com},
  year = {2026},
  month = {March},
  url = "https://willleeney.com/blog/computers-and-computation"
}