Case Studies

Experiment

Building a Local Agentic AI Development Workflow

A local-first AI development workflow evolved from prompt-only experimentation into a structured system for validation, orchestration, retrieval, observability, and human oversight.

Type
Experiment
Focus areas
Local AI Infrastructure, Agentic AI, Multi-Agent Systems, Validation Architecture, Evaluation-Driven Development, RAG, Human-Centered AI Engineering
Tools or methods
Ollama, VS Code Agentic Coding Workflows, Local LLMs, RAG, Contract-Based Validation, Eval Harnesses, Structured Logging
Status
Ongoing
Timeframe
Multi-phase ongoing exploration
Role
Designer, Systems Architect, Product Strategist, Developer
Visibility
Public experiment

Why This Exists

I wanted to explore whether AI-assisted software development could become more reliable, observable, and operationally grounded without relying entirely on commercial AI platforms or cloud-hosted infrastructure.

The goal was not simply to experiment with AI coding tools. The larger objective was to build a repeatable local-first workflow that combined AI generation with deterministic validation, structured orchestration, and human oversight.

This project became a broader exploration of:

  • Local AI infrastructure
  • Multi-agent coordination
  • Evaluation-driven development
  • Governance and validation systems
  • Human-centered AI engineering workflows

The work also helped me better understand the tradeoffs organizations face when security, privacy, or operational constraints limit the use of commercial AI services.

Context

Most modern AI coding workflows rely heavily on:

  • Cloud-hosted models
  • Prompt-only orchestration
  • Vendor-controlled infrastructure
  • Limited validation and observability

While these systems are powerful, they also create operational concerns around:

  • Reliability
  • Repeatability
  • Governance
  • Security
  • Transparency
  • Cost control

I wanted to explore whether a more structured engineering approach could improve the stability and trustworthiness of AI-assisted development workflows.

Approach

The workflow evolved through several phases.

Workflow evolution diagram showing prompt-only experimentation moving into validation, orchestration, and iterative refinement.
The workflow evolved from prompt-only experimentation into a more structured system built around validation, orchestration, and iterative refinement.

Phase 1 – Prompt-Only Experimentation

Initial workflows relied heavily on direct prompting and manual review.

This created several recurring problems:

  • Inconsistent outputs
  • Hallucinated structures
  • Weak repeatability
  • Difficult debugging

Phase 2 – Structured Validation

Contract-driven validation layers were introduced to improve output consistency and reduce drift.

This shifted the workflow from:

Generate and hope

to:

Generate, validate, refine.

Phase 3 – Multi-Agent Coordination

Additional orchestration layers introduced:

  • Specialized agent roles
  • Separation of concerns
  • Targeted regeneration cycles
  • Structured evaluation loops

One of the most important discoveries during this phase was realizing that orchestration alone was not enough. The system also needed clear definitions of quality.

To improve this, I developed iterative evaluation criteria and scoring rubrics through repeated testing and refinement. In some cases, I even used AI-assisted interviews to help externalize and formalize what “good” output actually meant for different workflows.

System or Process View

Core Components

  • Local LLM infrastructure via Ollama
  • VS Code agentic coding workflows
  • Contract-driven validators
  • Multi-agent orchestration
  • RAG-based retrieval systems
  • Evaluation harnesses
  • Structured logging and observability

Architectural Principles

  • Deterministic systems before generative systems
  • Governance before autonomy
  • Human review over blind execution
  • Modular components over monolithic agents
  • Clear operational boundaries between agents
High-level local-first AI workflow architecture showing orchestration, retrieval, validation, and governance layers.
High-level view of the local-first AI workflow showing orchestration, retrieval, validation, and governance layers working together to improve reliability, observability, and iterative refinement.

What Worked

Several patterns consistently improved reliability and workflow quality:

  • Deterministic validation dramatically reduced hallucinations and structural drift
  • Multi-agent separation of concerns improved maintainability
  • Evaluation harnesses made iteration cycles more measurable
  • Local-first infrastructure reduced operational cost for experimentation
  • Structured observability improved debugging and refinement

The resulting workflow also became reusable across additional projects and experiments, creating a foundation for future AI exploration work.

AI output approval flow showing validation, evaluation, and human oversight loops before approval.
AI-generated outputs moved through structured validation, evaluation, and human oversight loops before approval, helping improve reliability, repeatability, and operational trust.

What Did Not Work

Several challenges emerged throughout development:

  • Local hardware introduced significant performance limitations
  • Context window management became increasingly difficult as orchestration complexity grew
  • Over-engineering became a recurring risk
  • Different local models behaved inconsistently under similar prompts and evaluation conditions

Running local LLMs entirely on consumer hardware also highlighted practical tradeoffs between privacy, speed, and output quality. Tasks that take seconds using commercial infrastructure or server-grade hardware could take several minutes locally.

Even with those limitations, the workflow remained valuable as an experimentation and learning environment where the primary success criteria was understanding what worked, what failed, and why.

Lessons Learned

This project reinforced several important ideas:

  • Prompt engineering is not architecture
  • Reliability emerges from systems, not prompts alone
  • Validation matters more than raw model size
  • AI workflows require operational discipline
  • Human-centered design principles apply to AI tooling itself

Most importantly, building a local-first workflow created a safe environment for experimentation without depending entirely on commercial platforms, hidden system behavior, or ongoing API costs.

That freedom made it easier to test ideas, iterate quickly, fail safely, and better understand how AI systems can be designed more intentionally.

What I Would Do Next

Future iterations would likely focus on:

  • Distributed local inference infrastructure
  • Improved orchestration observability
  • More advanced evaluation frameworks
  • Better memory and context management
  • Expanded reusable agent libraries
  • Stronger governance and audit tooling

Related Writing

Potential related posts:

  • Delivery Is the Strategy
  • Constraints as Design Material
  • You Can’t Have Good AI Without Good IA
  • Future writing on operational AI governance and human-centered AI systems

Digital, design, and AI strategy for mission-driven businesses and nonprofits.

© 2026 OneStrayThought LLC. All rights reserved.