Autoresearch-Charged Action Search

AI Search Engineering
Autoresearch experiment loop: objectives flow into experiments, measurements, and improved results

A few months ago I wrote about building semantic search for StackOne’s 10,000+ actions. The approach worked: synonym-enriched embeddings with MiniLM, 84% Hit@5 across 9,340 actions, sub-10ms latency. We shipped it and customers noticed the improvement.

Except I couldn’t leave it alone.

The system we shipped was good at matching natural language queries to StackOne connector actions. But I had no idea how it compared to anything else. Were we at 80% of what’s possible, or 30%? The only way to find out was to benchmark properly. And the thing I really wanted was to set the objectives and let something else do the optimisation. I’d been building autonomous research loops for other domains, so it was time to point one at our own product.

This post is that story. Before the methodology: the headline results, so you know what you’re reading toward.

BenchmarkScoreWhat it tests
ToolRet-full54.4 NDCG@1044,453 tools, paraphrased queries — the largest public tool-retrieval benchmark
MetaTool93.8% NDCG@5200 tools, realistic agent task queries — “when should the agent call a tool at all?”
Held-out connectors92.8% Hit@115% of connectors excluded entirely from training — zero-shot generalisation to tools the model never saw

The model (109M params) is currently #1 on ToolRet-full, ahead of Qwen3-Embedding-8B (48.1), NV-Embed-v1 (42.6), GritLM-7B (39.2), and Anthropic Tool Use (30.0). The held-out connector result is the one I care most about in practice: a model that memorises training connectors but falls apart on new ones is useless in a growing connector library.

The rest of this post is how we got there.

The benchmark

The first step was a proper evaluation harness. Not just our internal test set, but a gauntlet: our mock-connectors dataset (998 tools, 1,843 hard queries), plus three public benchmarks. MetaTool (200 tools, 2,000 queries), ToolBench (10,439 tools, 451 queries), and ToolRet-full (44,453 tools, 7,961 queries).

Then I added competitors. Not toy baselines — actual published methods:

  • Semantic Router (Aurelio Labs): embed descriptions as “routes” with centroid matching
  • Tool2Vec (Berkeley): synthetic-query embeddings
  • LangGraph BigTool: graph-based with an LLM in the loop
  • Toolshed RAG: query expansion + cross-encoder reranking

And for good measure, Anthropic’s native tool use — give Claude Sonnet the full tool catalog and let it pick. About $100 in API calls for that one.

The first results

On our mock-connectors dataset, the initial numbers were humbling:

StrategyHit@1 (global)
Semantic Router35.4%
Tool2Vec29.4%
StackOne v1 (synonym enrichment)27.6%
Toolshed RAG26.8%

To be clear, v1 wasn’t bad. It was built for a specific job: matching queries to actions within a known connector. And it did that job well (59% Hit@1 scoped, 84% Hit@5 on our production test set). But in a global search across all tools, the synonym dictionaries and domain terms didn’t help as much as I’d assumed. Semantic Router, which does nothing clever at all, just embeds raw descriptions with the same MiniLM model, beats it by 8 points.

The off-the-shelf methods all clustered between 27–35%. The LLM-based approaches were slower (~2,500ms per query) and not dramatically better. Every approach had hit the same ceiling.

The training pivot

This is where the autoresearch loop earned its keep. The objective was simple: maximise Hit@1 on held-out connectors across all benchmarks, keep inference under 10ms, keep the model small enough to deploy on a Lambda. The loop handled data preparation, training runs, ablation studies, and evaluation.

The insight that mattered was data engineering, not architecture.

Instead of runtime heuristics (synonym dictionaries, category mappings), we built training data that teaches the model the same distinctions:

  • Cross-connector hard negatives: same action suffix from different connectors. For example, “github_create_issue” vs “jira_create_issue”. The model has to learn that these are different tools, not just “create issue”
  • Same-connector hard negatives: same connector, different action. Forces the model to distinguish between closely related operations
  • LLM-generated paraphrases: 3 per instruction. “Make a new employee”, “add a team member”, “onboard a hire”. Exactly what the synonym dictionary was trying to do, but learned rather than hand-coded
  • Connector-level holdout: 15% of entire connectors held out for validation, not just queries. The model has to generalise to tools it’s never seen

The split matters. If you hold out queries but the model has seen the tools during training, you’re measuring memorisation, not generalisation.

What the loop found

Some of it was predictable. More hard negatives helped (we went from 3 to 7 per example). Doubling the text truncation from 1,000 to 2,000 characters preserved documentation context that turned out to be important.

Some of it surprised me. The fine-tuned MiniLM (22M parameters) beat BGE-base (109M parameters) on our internal data. Less capacity with good training data beats raw model size. Random sampling of training data outperformed every clever curriculum strategy the loop tried. And the biggest single lift (45% on ToolRet-full) came from fixing a format mismatch between training and evaluation. The model was embedding structured descriptions during training but raw documentation during eval. Aligning those was almost half the battle.

The loop also ran hard negative mining: run the model, find the queries it gets wrong, extract the tools it confused, and feed those back as targeted training signal. This is the kind of thing you wouldn’t bother doing manually. It’s tedious and iterative, which is exactly what an autonomous loop is good at.

The final model

The result is what we’re calling v2: a fine-tuned BGE-base model (109M params) trained on combined data from all sources. Here’s how it stacks up.

Mock-Connectors — Scoped (998 tools, know which connector)

ModelHit@1Hit@10MRR
v2 (fine-tuned)92.8%100%0.957
v1 (synonyms + MiniLM)59.0%96.8%0.712
BM2536.5%91.6%0.537

Mock-Connectors — Global (all 998 tools)

ModelHit@1Hit@10MRR
v2 (fine-tuned)57.3%82.3%0.661
Anthropic Tool Use44.0%69.3%0.517
v1 (synonyms + MiniLM)27.6%58.7%0.377

ToolRet-full (44,453 tools, 7,961 queries)

ModelnDCG@10Params
v2 (combined-v4)0.544109M
Qwen3-Embedding-8B0.4628,000M
NV-Embed-v10.4277,000M
GritLM-7B0.4117,000M
v1 (synonyms + MiniLM)~0.2622M

MetaTool (200 tools, 2,000 queries)

ModelHit@1Hit@5nDCG@5
v2 (combined-v4)86.9%98.6%0.938
Anthropic Tool Use83.3%96.7%0.913
v1 (synonyms + MiniLM)62.0%87.6%0.762
BM2523.1%52.3%0.385

ToolBench (10,439 tools, 451 queries)

ModelHit@1Hit@10nDCG@5
v2 (combined-v4)20.6%96.0%0.482
Ada-002 (paper)0.387
v1 (synonyms + MiniLM)11.8%51.4%0.255
BM258.2%36.4%0.174

The headline: a 109M parameter model, trained on domain-specific data with proper hard negatives, is #1 on ToolRet-full (54.4 NDCG@10). It beats Qwen3-8B, NV-Embed-v1, and GritLM-7B — all 60–70x larger. It beats Anthropic’s native tool use by 81% at 30,000x less latency and near-zero marginal cost. And on held-out connectors — the 15% of connectors excluded entirely from training — it hits 92.8% Hit@1. That last number is the practical test: can the model generalise to tools it has never seen? In a library that ships new connectors every week, that’s the benchmark that actually matters in production.

Deployment

A model that wins benchmarks is useless if you can’t retrain and deploy it reliably. The autoresearch loop runs on Modal GPUs, so we needed signing keys to ensure only our CI pipeline can trigger training runs. Datasets live in S3 with versioning, so every training run is traceable to a specific data snapshot. The pipeline formats and validates datasets before training starts, because a malformed example at row 40,000 shouldn’t surface as a mysterious accuracy regression two hours into a run.

The most important piece: performance gates. The pipeline won’t promote a new model unless it beats the current production model on all four benchmarks. No regressions allowed. If it produces something worse, it just doesn’t ship. That’s what makes low-supervision retraining viable.

What I learned

The synonym dictionaries, the domain enrichment, the category context maps: all the hand-crafted work in v1 added maybe 1-2 points over a vanilla embedding baseline. They felt good to write. The kind of thing an engineer builds in a weekend and ships. But they were solving the wrong problem. The real gains came from the training data. Connector-aware splits, structured hard negatives, failure-mined examples, and cross-domain data mixing.

The autoresearch loop didn’t discover any single brilliant insight. It just ran a lot of experiments I wouldn’t have run manually (ablations on negative sampling, format alignment checks, curriculum experiments that mostly didn’t work) and the compound effect of all those small wins added up to a model that doubles or triples accuracy depending on the benchmark.

If you have a clear metric and a fast feedback loop, pointing an autonomous research process at your own product is probably the highest-leverage thing you can do. I set the objectives, curated the data sources, and decided when something was good enough to ship. The iteration part is increasingly not my job.