Watch any AI video model launch and you'll see the same trick. A hero shot — usually a single character, beautifully lit, doing something cinematic. The demo works. The marketing works. The model gets called state of the art.

Then you try to use that character in a second shot. And a third. And by the fifth, the face has subtly changed — wider jaw, different eyes, slightly different age. By shot ten, it's basically a different person. The model doesn't know it's the same character. It never did.

This is the wall every AI filmmaker hits within their first week. And it's the wall almost no model fixes by itself.

Why models drift between shots

The honest answer: diffusion models don't have a concept of identity. They have a concept of features. When you describe "a 32-year-old woman with dark wavy hair, freckles, almond eyes, soft cheekbones, wearing a beige trench coat," the model samples something that matches those features. But "matches those features" leaves enormous room for variation.

The next generation, the model samples again — independently. Same features, different sample. Same description, different person.

This isn't a bug. It's literally how the architecture works. Each output is a fresh sampling from a probability distribution that was conditioned on your prompt. The prompt doesn't carry identity. It carries vibes.

The myth of "just be more specific in your prompt"

The first instinct is to write a longer prompt. Add freckle counts. Add eye color codes. Describe the exact bridge of the nose. People do this for hours.

It doesn't work. Not because the prompt isn't specific enough, but because the underlying distribution has a finite resolution. There are only so many faces the model can plausibly generate that match "32-year-old woman with dark wavy hair." Even with a 500-word prompt, you're not pointing at one face. You're pointing at a region of face-space.

What reference images actually do

Reference images are the first real fix. Modern AI video tools — Cleom included — let you upload a reference image of your character. The model uses this as a visual conditioning signal, not just text.

This works much better. The model now has a concrete face to anchor to, not a textual description. Variation drops significantly. But it doesn't disappear.

Here's why: references condition the first frame very well. But video involves motion, head turns, lighting changes, expression shifts. As the character moves through the clip, the model needs to imagine views of them that weren't in the reference. And those imagined views drift.

Add another shot, with a different angle, and the model imagines the character from a new perspective — sometimes inheriting features from the reference, sometimes filling in plausible-but-different details. Drift compounds across shots.

The orchestration problem nobody solves with one model

Real character consistency in AI video requires solving three layered problems:

  • Visual identity locking — keeping the face structurally identical across angles, expressions and lighting
  • Cross-shot consistency — tracking the same identity through scene breaks and time jumps
  • Style anchoring — keeping the same costume, color palette and atmosphere when the prompt changes

No single foundation model solves all three. They might solve two. They might solve one and a half. But getting all three reliably is an infrastructure problem, not a model problem.

This is why every serious AI video studio ends up building (or wishing they had) a pipeline around their model. References get processed. Identities get tagged. Style sheets get extracted. The system tracks "this is Character A in scene 3, lighting setup C" and conditions the model accordingly.

The model generates a face. The pipeline remembers it.

How production pipelines fix drift

The pattern most AI video production systems converge on looks like this:

  1. Define identity nodes. Upload one or more references for each character. The pipeline extracts a stable identity embedding.
  2. Tag scenes with identity references. When you write a prompt for shot 4, the pipeline auto-injects the identity embedding of Character A so the model is conditioned on the same face.
  3. Lock style and lighting per scene block. A "scene" is a coherent unit — same set, same time of day, same costumes. The pipeline carries style across shots within a scene automatically.
  4. Validate consistency between generations. After generation, the pipeline compares face embeddings between shots and flags drift. You re-roll the bad ones.

This is what Cleom does in Pro Canvas. Identity nodes connect to scene nodes. Scene nodes inherit style from the scene block. Generation requests carry the right conditioning automatically. You stop thinking about consistency as a prompt-writing problem and start thinking of it as a graph problem.

The future is identity infrastructure

The next wave of AI video tools won't compete on raw model quality alone — that wave is already crashing. They'll compete on how well they manage identity, style and scene state across an entire production.

The tools that win will be the ones that turn identity into a first-class object. Not a string in a prompt. Not a single reference image. A persistent, embeddable, queryable identity that follows a character through the entire timeline of a film.

This is unglamorous work. It's not a hero demo. But it's the difference between making one clip and making a movie.

Conclusion

If you've ever opened an AI video tool and given up on a longer project because the character kept changing, you weren't doing it wrong. The tool was missing a layer.

Character consistency isn't solved by better prompting. It's solved by production infrastructure that treats identity as something to maintain — across shots, across scenes, across an entire piece of work. That's what we're building.

The model generates. The pipeline remembers. That's the actual job.