As LLMs begin to automatically design multi-step customer interactions (emails, chats, notifications), a fundamental question arises: 'How do we know if this AI-generated journey is actually effective?' Traditional evaluation metrics—accuracy, similarity, and even LLM-as-a-Judge—often assess style or tone but fail to capture the structural logic of the journey and its alignment with business goals. This article introduces the CDP framework, a set of deterministic metrics that measure journey quality based on a predefined taxonomy. You can explore the foundational concepts in the source material.
![]()
The Three Pillars of CDP Metrics
CDP evaluates journey quality along three complementary dimensions:
-
Continuity (C)
- What it measures: Whether each message fits the context established by prior interactions. It checks for smooth thematic transitions without abrupt jumps.
- How it's calculated: Assigns weights to transition patterns within the taxonomy tree (e.g., same topic, related topic, forward stage move, backward move) and averages step-level scores.
-
Deepening (D)
- What it measures: Whether the journey moves from general content toward more specific, detailed, or personalized interactions, thereby deepening the relationship.
- How it's calculated: A weighted combination of two components:
- Journey-based Deepening: Measures the increase in depth level within the taxonomy tree from one step to the next.
- Taxonomy-aware Deepening: Calculates the ratio of possible deeper content items (under visited thematic heads) that are actually explored during the journey.
-
Progression (P)
- What it measures: The directional movement and pace through defined customer journey stages (e.g., Awareness -> Purchase -> Ownership), detecting unnecessary backtracking or stagnation.
- How it's calculated: Scores are summed based on stage transitions (anchor ID difference) and the relative importance of the current stage, then normalized to a [-1, 1] range using a tanh function.

Applied Example: An Automotive Purchase Journey
Let's walk through a simplified example to see CDP evaluation in action.
Input Journey (LLM-generated message sequence):
- Take a virtual tour to discover key features and trims.
- We found a time slot for a test drive that fits your schedule.
- Upload your income verification and ID to finalize the pre-approval decision.
- Estimate costs for upcoming maintenance items.
- Track retention offers as your lease end nears.
- Add plates and registration info before handover.
Taxonomy Mapping Result: Each message is mapped to a specific node in the taxonomy tree (anchor, thematic head, depth level) based on embedding similarity. CDP scores are computed by analyzing step-to-step transitions.
Interpreting the CDP Signals for This Journey:
- Continuity: Mostly smooth, though the score may dip where stages become intermixed.
- Deepening: Captures moments of diving deeper into a topic, such as the transition from 'scheduling a test drive' to 'uploading documents for pre-approval'.
- Progression: Shows an overall forward movement toward purchase, but also reveals structural regression, like an unexpected 'handover' task appearing during the ownership stage.
These computed CDP scores can be used directly to compare alternative journeys generated by different prompts or models, or to provide automated feedback for continuously improving LLM-based journey generation.

Conclusion: Why Structure Matters
LLMs are already capable of generating fluent and persuasive text. The greater challenge now is ensuring those text sequences form coherent narratives that align with business logic and user experience. The CDP framework doesn't replace stylistic evaluation; it complements it by providing a new, primary signal: structure.
This approach isn't limited to automotive commerce. Any system that generates ordered, goal-oriented content—be it educational course design, healthcare consultation paths, or in-game quest lines—requires a strong structural foundation. CDP offers a way to make that structure explicit, measurable, and actionable. Next time you evaluate an AI-generated sequence, ask not just 'Is the language natural?' but 'Is the structure sound?'