By The Elder Scripts — Oct 27, 2025

Grounding LLMs in the Logic of Planning

What makes a plan good?

Have you ever thought about this? We humans never stop planning: from stacking toys as children to shipping products at work. A good plan feels intuitive when you see it: it's structured, realistic and flexible enough to adapt when things change. In other words, a good plan is both science and art.

For decades, AI researchers have tried to capture this balance. Symbolic planners can reason precisely about steps and dependencies, but they’re rigid and expensive to design. Large language models, on the other hand, can break down messy problems into steps that sound reasonable, but their plans often fall apart under scrutiny.

The sweet spot lies somewhere in between: human-defined structure with AI-driven flexibility. In this post, I’ll show how I combined elements of Hierarchical Task Network (HTN) planning with LLMs in my project called Rock-n-Roll (https://rocknroll.dev/) to create a more consistent and flexible planning system. I hope that this case study adds a data point for existing research and inspires further work in this promising area.

A quick primer on HTN planning

In the early days of AI, researchers thought of intelligence as the ability to plan, or to decide what to do next given a goal and a set of possible actions. One of the most enduring approaches to this problem is called Hierarchical Task Network (HTN) planning.

The core idea is simple: large, abstract goals are decomposed into smaller, concrete tasks. For example, the goal “launch a product” might be broken down into “design the MVP,” “build core features,” and “publish the landing page.” Each of those can in turn be decomposed into even smaller actions until you reach something executable, like “generate pricing copy” or “set up Stripe.”

This mirrors how humans plan: we reason in layers rather than details. An HTN planner uses methods (rules for decomposing high-level tasks) and operators (atomic actions that can be executed). The result is a plan network, a structured graph showing dependencies and order of execution.

Classical HTN planners, such as SHOP or Pyhop, are highly reliable: if a plan exists, they will find one. But they require a fully defined model of the world: every task, condition, and dependency must be hand-crafted. This rigidity makes them powerful in controlled environments (like robotics or logistics) and impractical for open-ended creative work.

Recent research explores how LLMs can fill this gap. Surveys such as “LLMs as Planning Modelers” by Tantakoun et al. (2025) and “A Survey of Task Planning with Large Language Models” by Zhai (2025) show a growing consensus: LLMs can help generate or refine planning models by translating natural-language descriptions into structured task representations. In other words, they are remarkably good at the decomposition part (breaking vague goals into plausible steps) but less consistent at maintaining logical structure across those steps.

This is precisely where the hybrid HTN + LLM approach becomes attractive. Instead of asking an LLM to produce an entire plan in one pass, we let it handle creative breakdown while the HTN-inspired framework ensures structural consistency. The two complement each other: one brings imagination, the other discipline.

How Rock-n-Roll Uses HTN Planning

Rock-n-Roll is a workspace for product builders. Its goal is to turn loose ideas into structured, actionable plans. These plans include user journeys, feature flows, and implementation steps. Where many generative tools stop at ideation, Rock-n-Roll continues into structured reasoning: it helps users understand how their product fits together before they begin building.

To do that, the system needs to think like a planner. When a user writes, “I want to build an app for independent designers,” the machine must translate that intent into a sequence of connected steps. Each step depends on what came before and enables what follows.

The first version of Rock-n-Roll relied on LLMs to produce the complete plan. The first prompt asked the model to break the target product into candidate user flows. The second prompt was called for each flow and instructed the model to produce a graph of user actions and system states that would implement the flow.

The results were precisely what you'd expect. LLMs produced plausible graphs, but on further inspection user could easily notice that something was wrong. The models could invent nonexistent preconditions or skip necessary steps.

After some research I stumbled on an idea to use a symbolic planner. This type of reasoning from abstract goals down to executable tasks is precisely what HTN planning was designed to do.

From UX Patterns to Flow Archetypes

Classical HTN planners work in domains such as robotics, where tasks are physical actions: pick up block, move arm, stack object. In product design, however, the atomic actions are conceptual: user/service interaction patterns rather than physical motions.

To make automated planning possible in this domain, we distilled these recurring UX patterns into a library of flow archetypes. Each archetype encapsulates a small, reusable interaction principle such as collect input, validate progress, display feedback, or deliver result. Archetypes have preconditions (what must already be true) and effects (what becomes true afterwards).

First Login archetype, pictured with preconditions and effects

The goal was to make them both fundamental and composable: small enough to mix across product types, yet expressive enough to describe entire user journeys when linked. They form the vocabulary of product logic, a way for the AI to reason about software behaviour using the same abstractions designers and engineers think in.

A Hybrid Planner Built on These Foundations

Using this vocabulary, Rock-n-Roll implements a hybrid planner that combines symbolic structure with LLM-based inference. The design follows the “guided integration” framework of Puerta-Merino et al. (2025) and the interleaved symbolic-neural approach of Muñoz-Avila et al. (2025).

In this system, the LLM acts as a semantic modeler: it interprets the user’s domain brief, identifies relevant product patterns, and proposes a tentative structure. The symbolic layer then grounds this proposal in formal consistency, checking dependencies and ensuring that outputs from one task feed the inputs of the next.

How the Planning Loop Works

Inducing user flows. The process begins with the user’s idea in natural language. This project brief describes the target product, audience and their behaviors. The LLM analyses the description to infer the domain, key goals, and expected user actions, sketching preliminary flows.
Matching to archetypes. Each flow segment is mapped to the relevant archetypes. Their preconditions and effects define how steps connect into a coherent plan graph.
Stitching the archetypes. A typical user flow consists of several flow archetypes. We stitch them together using their effects and preconditions as interface points.
Inserting shims. When a dependency is missing, the planner inserts a shim, a small connector flow (for example, collect user data or confirm details) to bridge the gap.
Validation. The system verifies that the plan has no loops or missing links and that every user goal is covered by at least one path.
Polishing. Finally, the LLM re-enters to enrich the structure with domain-specific language, better names, and smoother transitions.

This alternating cycle of symbolic validation and generative refinement allows Rock-n-Roll to produce plans that are both structured and expressive. The framework behaves less like a chatbot producing paragraphs and more like a design collaborator assembling a blueprint.

Evaluation and feedback

Early results look promising. Human feedback favors the new hybrid approach. Early testers note that the produced user flows resemble the product plans common in the industry. The flows are internally consistent and "make sense" for both engineers and designers. One point of improvement testers proposed was to make the flows less elaborate. The planner tends to produce comprehensive graphs, but in practice humans don't always use that level of detail.

Of course we also implemented automated evaluation using the LLM as a judge approach. Our evaluation setup generates two sets of user flows for each project: one using the hybrid planner, and another using the old LLM based approach. Then we ask judge LLMs to score and compare the two sets.

We've used GPT-5, Claude Sonnet 4.5 and Gemini Flash 2.5 as judges. In around 70% of the cases judges prefer the flows created by the new planner. On average they score the new flows ~2.5 points higher on the 1-10 scale.

Our rollout plan includes extending the evaluation tests and recruiting more human experts to confirm these initial findings.

Reflections and Future Work

Experiments like this show both the promise and the limits of hybrid planning. LLMs excel at recognising intent and decomposing problems but remain unreliable at enforcing logical consistency. Symbolic systems are the inverse: precise and consistent, yet brittle when faced with ambiguity or creative work.

The productive space lies between these extremes. A planner must use structure to guide creativity and creativity to soften structure’s rigidity. In Rock-n-Roll, this balance emerges naturally: human-designed archetypes define the boundaries, and the LLM fills the conceptual gaps.

There is still room to evolve. Current validation depends on predefined archetypes, preconditions and effects. Future versions could learn these automatically by analysing successful and abandoned projects, refining their sense of what constitutes a good plan in practice, much as Petruzzellis et al. (2025) propose in their adaptive knowledge-graph feedback loops.

Another direction involves closing the loop entirely: allowing the system to generate a plan, simulate or score it, and iterate. Similar cycles are already studied in task-planning research, where planners evaluate intermediate results before committing to execution.

The aim, however, is not to imitate or replace human reasoning but to augment it. Rock-n-Roll’s planner does not dictate what to build; it helps users think more clearly about how to build it. By anchoring language models in structured methods, we make their reasoning more grounded, verifiable, and useful for real design work.