Case Studies

Experiment

Building Daggerheart Forge: Designing an Agentic AI System for Human-Centered Game Mastering

Daggerheart Forge began as a personal experiment to reduce the operational overhead of Game Mastering while preserving creative control. The project evolved into a larger exploration of agentic AI systems, structured validation, RAG pipelines, and human-centered AI orchestration for narrative content generation.

Type
Experiment
Focus areas
Agentic AI, Multi-Agent Systems, Validation Architecture, Information Architecture, Human-Centered Design, RAG
Tools or methods
Ollama, Mistral, OpenAI API, RAG, Semantic Retrieval, Contract-Based Validation, Eval Harnesses
Status
Ongoing
Timeframe
Multi-phase ongoing exploration
Role
Designer, Systems Architect, Product Strategist, Developer
Visibility
Public experiment

Why This Exists

I have a demanding full-time role, a family, and limited time for campaign preparation. Like many Game Masters, I found myself wanting to create richer content for my players while struggling to consistently find the time and mental energy required to build everything manually.

The original problem was not creativity. I already had the ideas.

The problem was operational overhead.

I wanted a way to rapidly generate campaign support material, maintain consistency with Daggerheart rules and guidelines, reduce repetitive prep work, preserve creative control, reuse and organize previously created content, and support improvisation during live sessions.

Most importantly, I did not want AI replacing the creative role of the Game Master. The goal was always augmentation, not automation.

Daggerheart Forge became an opportunity to explore what a human-centered AI system might look like when the intent was not replacing creativity, but supporting it.

The Original Assumption

The first version of the project was based on a relatively common assumption in AI experimentation: a sufficiently detailed prompt combined with a capable language model should be able to generate usable TTRPG content.

Initially, the system was extremely simple:

  • local models running through Ollama
  • basic generation scripts
  • a single generation agent
  • no structured validation
  • manual review and cleanup

The idea was straightforward: write a structured prompt, generate adversaries, environments, or equipment, and use the results in game sessions.

At first, this seemed promising. Then the outputs started failing in predictable ways.

Early Failure Modes

Inconsistent Formatting

Outputs varied wildly in structure and completeness. Some content resembled usable game material. Other outputs became long instructional essays or incoherent mechanical descriptions.

D&D Contamination

Because Daggerheart is relatively new and Dungeons & Dragons dominates online TTRPG training data, the models frequently introduced D&D terminology, incompatible mechanics, incorrect rule structures, and unrelated gameplay assumptions.

This became one of the most persistent problems in the system.

Poor Mechanical Balance

Generated adversaries frequently contained wildly inflated values, unusable action economies, contradictory mechanics, and abilities that violated the intended design philosophy of the game. Some outputs were technically formatted correctly but completely unplayable.

Narrative Weakness

Even when the mechanics were functional, the content often lacked narrative usefulness. The generated material felt generic, instructional, emotionally flat, and disconnected from the campaign world.

The content technically existed, but it did not feel alive.

The Turning Point

The project changed significantly when I realized the problem was not simply model quality. The problem was architectural.

I was asking a single model to understand intent, generate content, validate rules, maintain lore consistency, preserve narrative quality, avoid contamination, and self-correct errors.

That was too much responsibility for a single generation layer.

The project stopped improving when the model was asked to “be smarter,” and started improving when the system became more structured.

From Prompting to Orchestration

The architecture evolved from a single-generation workflow into a multi-agent orchestration system with specialized responsibilities. Instead of one model trying to do everything, the system became a collaborative pipeline.

High-level Daggerheart Forge system architecture showing orchestration, generation, validation, retrieval, retry, and evaluation workflows.
Daggerheart Forge evolved from a single-generation workflow into a layered orchestration system with specialized generation, validation, retrieval, and repair responsibilities.
GM Intent
  ↓
Structured Prompt + Contracts
  ↓
Specialized Generator Agents
  ├── Adversary Generator
  ├── Environment Generator
  ├── Equipment Generator
  └── NPC Generator
  ↓
Validation Layer
  ├── Rules Keeper
  ├── Lore Keeper
  ├── Content Repair Agent
  └── Contamination Sentinel
  ↓
RAG + Semantic Retrieval
  ↓
Eval Harness
  ↓
Playable Content
Daggerheart Forge dashboard showing generated content counts and recent activity.
The Daggerheart Forge dashboard shows the project moving from one-off generation into a reusable content library and operating environment for Game Master preparation.

Specialized Generator Agents

One of the most important architectural decisions was separating generators by content type.

Instead of a single content bot, the system introduced focused agents for adversary generation, environment generation, equipment generation, and NPC generation.

This added complexity, but it improved output consistency, iteration speed, validation accuracy, and narrative quality. Each generator could specialize around role expectations, balance constraints, formatting contracts, narrative patterns, and gameplay intent.

Threat intent step used to capture table behavior and generation intent before content creation.
The threat intent step captures table behavior and pressure before generation, keeping the system grounded in Game Master intent rather than generic fantasy output.

Validation as a First-Class System

The project improved substantially once validation became explicit rather than implied.

The system introduced deterministic validation, contracts, canonical enums, range checking, contamination detection, and structured repair workflows.

This was necessary because TTRPG systems are not purely binary. Many rules exist inside ranges, relationships, and gameplay expectations rather than strict yes/no correctness.

The solution became deterministic enforcement for structure and safety, paired with advisory review for narrative and experiential quality.

Generation readiness screen showing blockers-only validation before submitting a generation request.
The readiness step checks for blockers before generation, helping separate intent capture from the backend generation pipeline.

Keeper Agents and Editorial Oversight

The system eventually evolved into something closer to a board of editors than a simple generation workflow.

Rules Keeper

The Rules Keeper validated mechanical compliance, ranges, canonical structures, and gameplay consistency.

Lore Keeper

The Lore Keeper validated world consistency, narrative continuity, and semantic alignment.

Content Repair Agent

The Content Repair Agent handled schema repair, cleanup, formatting correction, and contract compliance adjustments.

This separation of responsibilities dramatically improved output quality.

Generated content review screen showing deterministic blockers and advisory validation notes.
Deterministic validation and advisory review helped separate rules compliance from narrative quality.

RAG and Semantic Grounding

Retrieval-Augmented Generation became increasingly important as the project matured.

RAG was used for rules retrieval, semantic grounding, balance guidance, example retrieval, lore consistency, and gameplay pattern reinforcement.

Without grounding, the models drifted too easily into generic fantasy content, unrelated systems, and mechanically unstable designs.

The retrieval layer helped reinforce both system identity and gameplay philosophy.

Low Cognitive Load Design

One of the project’s core design goals became reducing cognitive load for Game Masters.

A good adversary or environment was not simply correct. It needed to fit naturally into the world, be quickly scannable, support improvisation, provide narrative prompts, and avoid unnecessary complexity.

Game Masters already juggle pacing, narrative, player engagement, rules interpretation, improvisation, and emotional tone. Dense or confusing content increases operational friction during live play.

Good Game Master tools should reduce cognitive load, not replace creativity.

Generated adversary preview showing structured playable content and export options.
Generated content is presented as a table-ready preview with structured mechanics, narrative description, metadata, and export options.

The Tension Between Compliance and Fun

One of the most interesting lessons in the project involved the tension between strict rules compliance and experiential quality.

A small deviation from ideal balance was often acceptable if the encounter was memorable, the narrative supported it, and the scene created interesting decisions.

But inconsistency was destructive. The problem was never slight variance. The problem was unpredictability.

This reinforced an important design lesson: systems do not need perfect rigidity. They need understandable boundaries.

Local Models vs API Models

The project initially relied heavily on local models through Ollama using Mistral variants.

Local models were extremely valuable for experimentation, learning, rapid iteration, understanding system behavior, and architecture exploration. They helped expose validation problems, orchestration needs, prompt limitations, and structural weaknesses.

Eventually, API models significantly improved output quality and generation speed.

The move toward OpenAI API models did not eliminate the need for structure. It reinforced it. Better models still required contracts, retrieval, validation, orchestration, and explicit intent capture.

The Moment It Became Playable

The project finally started feeling genuinely usable once several systems converged: contracts, eval harnesses, RAG grounding, validation layers, specialized generators, and advisory review.

At that point, the generated content stopped feeling random and began feeling intentionally shaped.

Outputs became narratively coherent, mechanically usable, easier to improvise with, more aligned with campaign tone, and significantly lower effort to use in live sessions.

That was the moment the project shifted from interesting experiment to practical tool.

What Still Needs Improvement

Cross-System Relationships

Adversaries and environments still need deeper interconnected behavior. The long-term goal is stronger systemic relationships between world state, factions, encounters, environments, and narrative consequences.

Campaign Knowledge Management

The project increasingly wants to become a Game Master operating system, campaign notebook, reusable content catalog, and structured worldbuilding system.

This expands the challenge from generation into knowledge architecture, retrieval design, and continuity management.

What I Would Do Differently

If restarting today, I would focus on one content type first, establish contracts earlier, build validation before expansion, avoid overloading a single generation agent, and implement retrieval and eval systems sooner.

Trying to support adversaries, equipment, and environments all at once created unnecessary complexity early in the project.

Broader Lessons

Daggerheart Forge ultimately became less about tabletop gaming and more about designing AI systems responsibly.

The project reinforced several ideas: AI systems need architecture, prompting alone is insufficient, intent capture matters, validation matters, role separation matters, retrieval matters, and human-centered constraints matter.

Most importantly, creativity does not disappear when structure is introduced. In many cases, thoughtful structure creates more room for creativity.

As both a father and a Game Master, the project helped me create more meaningful experiences with my family while also teaching me more about how AI systems succeed or fail in real-world environments.

The goal was never automation for its own sake. It was creating more room for storytelling.

Related Writing

Related essays and follow-up notes will be added as the Daggerheart Forge work continues.

Digital, design, and AI strategy for mission-driven businesses and nonprofits.

© 2026 OneStrayThought LLC. All rights reserved.