Building a Search Tool: Accelerating Development with AI

AI Engineering Production ML

At StackOne, we have over 10,000 actions across all our connectors and growing. Some connectors have 2,000 actions alone because the underlying API is massive. Customers were asking for better search, and rightfully so: keyword matching doesn’t work when someone searches “onboard new hire” but the action is called “Create Employee”. We needed semantic search.

I’ve written before about the implications of LLM-assisted coding, but this post will focus on how I’m using LLMs in my workflow right now. I want to walk through how I actually built this feature using Claude Code and reflect on changing nature of development work. This process probably won’t be relevant in a year or two, but for now, this is how I’m getting the most out of AI-assisted development.

For context on the scale of AI assistance: while Claude runs, I’m typically doing other work. As I write this, I have six other Claude Code sessions running in parallel. Compared to even four months ago - when I had to hand-hold through POCs - the level of autonomy is remarkable.

The Development Lifecycle

The phases I went through were:

  1. Stakeholder conversations (CTO + product managers)
  2. POC and benchmarking
  3. Local development
  4. API integration
  5. Testing and iteration

This lifecyle of development hasn’t really changed - AI is involved in the process but at the top level of the hierarchy, the development lifecycle isn’t different.

Phase 1: Specification

The spec remains essential. A significant part of a developer’s job is still specification: understanding the actual problem, defining the solution format, surfacing requirements, and identifying edge cases.

Through stakeholder conversations, we identified clear requirements:

  • Problem: 10k actions, keyword search fails for natural language queries
  • Solution: Semantic search
  • Constraints:
    • Storage must be minimal - we can’t add significant overhead to our API
    • Search latency must remain fast
    • Must handle account-specific custom connectors

These constraints naturally defined our benchmark dimensions: accuracy, storage size, and latency. Without these upfront, we’d have no way to evaluate whether different approaches were actually better.

The deeper investigation revealed important architectural details: connectors live in S3, account-specific connectors need special handling, builds trigger every 24 hours plus a manual endpoint, and Turbopuffer would serve as our vector store. Each of these details shaped the eventual solution.

If we wanted to replace large portions of this process with AI, we would need to give it access to literally all of the knowledge of the business - slack channels, market positioning info, every git repo and documentation and design decision. It’s feasible to do this and we’ve actively had conversations about this. I still think there are some human input decision points - there’s nearly always a pareto optimum of solutions given your specification. Is it quicker to have AI do this? Maybe, but really it only probably took two total human hours (split across three people) to design the specification.

Phase 2: POC

With requirements gathered, I outlined everything to Claude and set it running. The key was being explicit about what success looked like: make a benchmark dataset, define the metrics I care about, test BM25 against semantic search, and iterate on the model. I also knew reranking might help, so I asked it to include that in the comparison.

Methods Tested

MethodDescription
BM25 OnlyKeyword matching via Orama
Semantic OnlyLocal embeddings using all-MiniLM-L6-v2
Enhanced EmbeddingsEnriches text with synonyms before embedding
Enhanced + RerankEnhanced embeddings + cross-encoder reranking
Semantic + RerankStandard semantic + cross-encoder reranking

The enhanced embeddings approach enriches action text with synonyms before generating embeddings:

"Create Employee" → "Create Employee add new make insert worker staff team member..."

This is important because it bridges the vocabulary gap between how users think about actions and how actions are actually named.

Benchmark Results

We tested against 103 semantically-challenging queries - queries that use synonyms and natural language rather than exact keywords.

All Connectors (9,340 actions)

ApproachHit@1Hit@3Hit@5Latency
Enhanced Embeddings56%81%84%6ms
Enhanced + Rerank56%81%84%36ms
Semantic Only42%58%70%8ms
BM25 Only9%12%21%19ms

Per Connector (filtered search)

ApproachHit@1Hit@3Hit@5Latency
Enhanced Embeddings67%80%90%0.9ms
Semantic Only62%75%80%0.9ms
BM25 Only47%60%65%0.2ms

Key Findings:

  1. Enhanced Embeddings wins — 84-90% Hit@5 with ~6ms latency
  2. Reranking provides no benefit — adds 30ms with zero accuracy improvement
  3. BM25 fails on natural language — only 21% Hit@5 (can’t match “onboard” to “create”)
  4. Synonym enrichment adds +14 percentage points over standard semantic search

Storage: ~206 MB total (BM25 index 39 MB, enhanced embeddings 72 MB, model 23 MB)

The results validated the approach. Enhanced embeddings met our accuracy targets with acceptable latency and storage. Could we improve further? Probably. But the goal was getting something deployed and collecting real user feedback, not perfecting a system in isolation.

Phase 3: API Integration

Having deployed the resolutions feature previously, I knew the process. The key insight was pointing Claude Code at both the unified-api and ai-generation repos simultaneously, giving it the full context needed to wire everything together.

Architecture

Build Flow:

GitHub Actions ─────► Lambda ◄───── S3 Configs
(cron/manual)         │              (connectors)

unified-api ────► Transformer ────► Turbopuffer
POST /build       + Embeddings      (vector DB)
                  (MiniLM-L6)

Search Flow:

Client ────► unified-api ────► Lambda
             POST /search      │
             + project_id      ▼
                            Embed Query


             Turbopuffer ◄── Vector Search
             (cosine sim)     + filters


             Results

The build process pulls connector configs from S3, transforms them into indexed actions with enhanced text, generates embeddings, and upserts to Turbopuffer. Search embeds the query, applies connector and project filters, and returns ranked results.

Deployment surfaced the usual issues: missing IAM permissions for listBuckets, which required a redeploy. Then, I found tests passing silently because there were no custom connectors in the staging bucket - a reminder that passing tests don’t guarantee correctness. After transferring test data and re-validating locally, everything worked as expected.

The New Development Lifecycle

The pattern that’s emerging:

  1. Specification — understand the problem and solution (unchanged from pre-AI)
  2. Context handoff — provide Claude with the spec, relevant code locations, test criteria, benchmark requirements, and decision points where it should pause for input
  3. Verification — review the output and validate it works

I’m also finding value in longer sessions with more context compaction. The CLAUDE.md file that Claude uses, which I initially found annoying, has become really useful as a project tracker and decision point log.

The human role is still specification, verification, and knowing when something is good enough to ship. But execution is increasingly delegated. The interesting question is how this continues to evolve.