When the camera disappears, what happens to realism? Generative AI can make an image that refers to nothing yet still reads, unmistakably, as cinema. This is a companion walkthrough of AI Cinematic Realism β the framework in brief: what realism becomes after the camera, and how to build, judge, and teach it. The full text, with its sources and studies, is in the book.

A Companion to the Book

For more than a century, cinema rested on a single promise: that what you were watching had been there. That promise is over. This is AI Cinematic Realism (AICR): an account of what realism becomes when the camera disappears, and of the one question worth asking in its place.

Here’s the question that no longer works. Is this real? It’s a forensic reflex. It asks whether a sensor recorded an event β it belongs to the logic of evidence, and it only ever returns a yes or a no. That was the right question when every image had a camera behind it. Generative imagery breaks the premise. The synthetic image refers to nothing. No lens gathered its light. And it is still read, unmistakably, as cinema.
So the useful question becomes is this true? Does the image carry narrative truth? Does it hold emotional weight? Does it use the machine’s own affordances β its fluidity, its dream logic β to say something a camera never could?
Notice what changes when you switch questions. Real is binary. True is graduated. And anything graduated can be measured β which is exactly where this framework is heading.

If the book reduces to one sentence, it’s this: realism is coherence, and coherence is intention. The image used to borrow its truth from a world it recorded. It can’t anymore β there’s no world behind it. So it has to earn that truth a different way: by holding together under your scrutiny. Perceptually, environmentally, authorially β three kinds of holding-together I’ll come to.
And coherence of that kind is never an accident. Someone builds it. Someone means it. Someone answers for it.
The first edition of this book was an argument for that idea. What this edition adds is the architecture β a way to actually build the truth, to judge it, and to teach it. The rest of the talk is that architecture.

Here’s the shape of the road. Seven movements. First, the rupture β why realism after the camera needs a new account at all. Then the manifesto β eight convictions, the why turned into a how. Then the heart of it: the architecture β the Frame, the strata, the craft, the pillars. After that, the texture of the medium and the maker behind it. Then the ethics. Then the tools β one for judging the work, one for teaching the eye. And finally the studio the whole thing came out of. If you only hold onto one part, make it the third. But these parts are built to come apart, so any one of them stands on its own.

Part one. The rupture. Cinema has carried the promise of realism since the very first films β since spectators supposedly ducked the LumiΓ¨re train. Here’s the thing that promise always quietly leaned on, and the thing generative AI takes away.

Strip an AI-generated shot of everything cinema once required, and something strange is left over. No lens gathered the light. No sensor recorded an event. Nothing stood in front of the frame to be photographed. And yet β you feel its depth. You read its mood. You sense the weight of a moment you know never happened.
Ask the forensic question, was this filmed?, and the answer is a flat no. Ask the cinematic question, does this read as film?, and the answer is an unmistakable yes.
For a century, realism rested on an indexical bond between the image and the world β the photograph as a trace of something that was actually there. That bond is gone. So realism has to be re-grounded: not as a property the medium guarantees, but as a coherence the image manages to sustain.

Once the world stops anchoring the image, realism stops being a property of the medium and becomes a phenomenon of experience. That’s where philosophy earns its place. For Merleau-Ponty, perception isn’t passive reception β it’s active engagement, the body meeting the world. The same is true of the moving image. We don’t simply receive a film; we meet it.
Which means realism doesn’t live in the pixels. It lives in the encounter β in whether the image holds up under our looking. And if meaning comes from how a thing is structured, rather than from a recorded origin it can no longer claim, then we need a vocabulary for that structure. Building that vocabulary is the work of the book.

Part two. The manifesto. Before there was an architecture, there were convictions β eight of them. I wrote them as provocations, for navigating a terrain without maps, where the tools change weekly and the ground keeps shifting. Here they are.

Eight principles. One: realism is not replication β we don’t recreate the world, we simulate its emotional gravity; not resemblance, resonance. Two: the frame is a thought, not a capture β every image is a synthesis of memory and computation. Three: time is a fluid construct β AI cinema loops, stalls, reverses; its rhythm answers to emotion, not the clock. Four: imperfection is proof of conscious assembly β the glitch isn’t a flaw, it’s the fingerprint of something built on purpose.
Five: emotion can be engineered β a latent vector can mourn, because meaning comes from structure, not origin. Six: the camera is a myth β the cinematic eye has moved into code, into prompts, into generative space. Seven: ethics are embedded β every scene carries a training set, a bias, a filter, and we stay awake to it. Eight: spectatorship is rewritten β we no longer watch to confirm the world; we watch to confront the constructed. I won’t dwell on all eight. The point is what happened to them next.

Because these were provocations β and in the months after I wrote them, every one of them hardened into structure. Realism is not replication became the governing thesis: felt coherence in place of fidelity. The frame is a thought became the Ideational Frame. Time is a fluid construct became a pillar β temporal implication, and its expansion into synthetic time. Imperfection is proof of conscious assembly named a whole discipline. Emotion can be engineered became a criterion the rubric scores directly. The camera is a myth became the premise the craft grammar inherits and rebuilds. Ethics are embedded became accountable authorship. And spectatorship is rewritten became a pedagogy.
The manifesto stated the convictions. The rest of the book is the architecture that makes them usable β and that architecture is where we go now.

Part three. The architecture. This is the heart of the second edition β the part the first one was missing. Four moves: the Frame, the strata, the craft, the pillars. And watch for the thing that happens at the end of them β the architecture closes on itself.

Four moves, in order. The Frame names what a synthetic image has to achieve β the commitments it inherits from cinema whether it wants them or not. The strata organize those commitments into layers we can actually analyze and teach. The craft grammar supplies the methods β a century of staging and cinematography, repurposed. And the four pillars mark the places where the maker’s own deliberate work matters most. Each move begins in inheritance β everything cinema already knows how to do β and ends in invention, because in the latent space each of these disciplines gains a power the camera never had: the capacity to bend its own rules in service of feeling.

Start with the Frame. The strange fact about a synthetic image is that it arrives already carrying cinema’s commitments. The model has ingested a century of film, and it can’t help reproducing the reasoning of that film. That inheritance is what I call the Ideational Frame β everything a synthetic image is implicitly expected to honor, before the maker makes a single deliberate choice.
There are eight of these inherited commitments: implied temporality, embodied vantage, material plausibility; spatial coherence, atmospheric integration, expressive world-building; narrative implication, and character interiority.
Eight commitments β but they don’t sit as a flat list. They resolve into three layers. And those three layers are the spine of everything else.

The three strata. This is the slide to hold onto. A synthetic image succeeds or fails at three levels, stacked β and it can be flawless at one and bankrupt at another.
The first is perceptual: the image at the speed of the eye, what registers in the half-second before judgment. The morphing hand, the gaze that drifts off its axis, the limb that gains a finger between frames β those are perceptual failures, and they’re violent precisely because they’re pre-rational. We flinch before we reason. The test here isn’t is it sharp? It’s does it hold together at the level of direct seeing?
The second is environmental: how the world is built. Geometry, scale, mood, weather. And here’s the strange freedom β a world doesn’t have to be possible to be coherent. The staircase that returns to itself can read as true, as long as it obeys its own declared logic. What breaks this layer isn’t impossibility; it’s inconsistency.
The third is authorial β the hardest, because it can’t be rendered at all. It’s the sense that the image belongs to a story, that its figures have an inside. No quantity of pixels produces interiority. It’s implied, or it’s absent. The test is does it feel meaningfully authored?
And here’s why this matters beyond description: it’s a diagnostic. When an image disappoints you, you can now locate the failure β surface, world, or meaning. A critic who can name the stratum has moved past looks real and looks fake into language precise enough to teach.

But no single stratum produces realism on its own. It emerges between them β and naming the in-between is what turns three categories into a working model. When the perceptual and the environmental hold together, you get physical believability: a world that looks seen. When the environmental and the authorial align, you get narrative worldbuilding: a place that means something. When the perceptual and the authorial meet, you get stylistic intentionality: a look that reads as a choice rather than an accident of the model.
And when all three hold at once, you get what the framework simply calls cinematic realism β a coherence so complete that the question of the camera never even comes up. That’s also why, later, the rubric overlaps at the seams. The framework is interactive by design, so the instrument that measures it has to be interactive too.

So how does a maker actually build these layers? Not from nothing. Everything the Frame asks for, cinema has spent a hundred years learning to construct. Eight disciplines, each crossing from what it did in front of a lens to what it becomes when there’s no lens at all.
Mise-en-scΓ¨ne β the director’s total command of the frame, the way Ozu builds meaning in Tokyo Story out of precise, quiet arrangement β becomes total in the latent space, because every element is placed, and the maker answers for all of it. Worldbuilding runs from the authentic replica of Cameron’s Titanic to the painted nightmare of The Cabinet of Dr. Caligari β and becomes the authoring of non-Euclidean geographies whose laws are set by theme, not gravity. The expressive surface β costume, makeup, the hard shadows of film noir β becomes light as authored intent, an emotional spotlight that follows no source but the story’s center. Performance β Bergman staging bodies against the horizon β becomes the orchestration of a presence rather than the direction of a person. Composition β Kurosawa’s diagonals, or the way Bong Joon Ho’s Parasite renders a whole social order in the arrangement of a room β gains the power to make a feeling physically manifest.
And then the strangest inheritance of all: the camera that isn’t there. The lens becomes latent optics β focus governed by emotional importance, not distance. The angle becomes psychological vantage β a horizon that drops as a character gains power, a world that cants while the figure stays level. And movement becomes resonant flow β not a dolly, but the latent space itself reshaping to the pace of a journey.
The classical masters here aren’t nostalgia. They’re the training data of the art form. The point isn’t to copy their conventions β it’s to inherit their soul.

Now β of everything the Frame asks for, four commitments stand apart. These are the four pillars, and they share a quality: they’re the places where the maker has to intervene most consciously, and where the reward for intervening is a power physical cinema couldn’t reach.
First, temporal implication: a before, a during, and an after, without literal photographic motion β and its expansion, synthetic time, where a face can seem to wear several ages at once. Second, spatial coherence: a world a body could step into β expanding into impossible geometries that hold because they keep their own internal law. Third, atmospheric continuity: mood and light binding separate frames into one emotional experience β expanding into synthetic atmospheres, weather that behaves like a character. Fourth, character interiority: a figure that seems to possess a mind β expanding into literalizing the psyche, where the environment itself shifts to mirror an inner state, and inner weather becomes visible weather.
And here’s the closing move. Map these four back onto the strata, and they distribute one, two, one β temporal at the perceptual surface, spatial and atmospheric in the construction of the world, interiority in the shaping of meaning. They land in exactly the layers the strata chapter placed them. The architecture closes on itself.

There’s an ethical fact buried in all of this, and it’s worth saying plainly. A camera has an alibi. It only recorded what was there; much of what makes its image cohere comes for free, supplied by a world that was already coherent before the lens arrived. The maker of synthetic cinema has no such alibi. Nothing stands behind the frame to guarantee its logic. To bend time, to legislate a space, to conjure an atmosphere, to turn a psyche inside out β each of those is an authored decision.
And authorship is accountability. The more freely the latent space lets you exceed the camera, the more the coherence of your image becomes a moral fact, and not merely an aesthetic one. Realism is coherence; coherence is intention; and intention, in the end, is answerable.

Part four. Texture, and the maker. Two ideas that cut against the grain of how AI video usually gets talked about. The first is about the glitch. The second is about who’s actually doing the work.

The standard advice for making “cinematic” AI is a checklist: consistent faces, even lighting, smooth sound. But that reduces realism to polish β gloss without gravity. When image-craft outruns intention, you get a hollow sheen: footage that looks convincing and persuades no one. Spectacle standing in for conviction.
The studio taught me a single maxim against this, and it may be the most useful sentence in the book: truth over resolution. The instinct to clear the image β to polish away every artifact β tends to destroy the very atmosphere that was carrying the feeling. A grainy, unstable image can be perfectly coherent if its instability is consistent. A flawless render can be empty. Realism, it turns out, is not the absence of noise. It’s the presence of an atmosphere heavy enough to hold a memory.

There’s a myth that the AI creator is passive β a prompt typist, feeding words into a black box and waiting for a payout. That view reduces the artist to a consumer and the creative act to a transaction. And if you accept it, then yes, AI video is just automation.
I reject the premise. The maker is not a prompt typist; the maker is a moral agent. Someone chooses what to prompt β the intention that comes before the output. Someone curates what to keep β directing the machine’s accidents instead of erasing them. And someone publishes what to release β and answers for every one of those choices. This isn’t a style to imitate. It’s a genre to invent.

Which brings the stakes into focus. The old worry was that anything could be made to look real. The new condition is sharper: everything can be made to feel real. We’re entering a culture of asymmetrical knowledge β the maker may know the full extent of a video’s synthetic nature, and the audience may not. That gap is where the danger lives.
So draw the line clearly. On one side, forgery: forcing AI video into the domain of captured reality, wanting it to pass as evidence. The goal there is deception β to trick the eye. On the other side, filmmaking: a genre with a recognizable aesthetic, one that privileges resonance over mimicry. When the goal is to move the heart instead of fool the eye, the work no longer has to win by hiding that it’s made.
That’s the safety layer of style. It’s how we stop being forgers and start being filmmakers. And notice β I don’t leave that ethic at the level of a principle. I build it into how we score the work.

Part six. Evaluation and pedagogy. A vocabulary can be admired. An instrument can be used β handed to a student, applied by a jury, argued over in a seminar, refined against new work. So here’s the instrument. And then here’s what happens when you turn it around and point it at the student instead of the screen.

The forty-point rubric is the architecture made scorable. Eight criteria, each scored one to five, summing to a total out of forty. Two of them read the perceptual surface: perceptual realism, and temporal coherence. Two read the constructed world: environmental realism, and atmospheric continuity. Two read the authored layer: character realism, and authorial intentionality. And two cut across all three, because they’re properties of the whole image β emotional plausibility, and ethical accountability. That the criteria don’t partition cleanly isn’t a flaw; it’s the interplay principle showing up in the measure.
The total reads through four tiers β not grades, but descriptions of how completely the coherence holds. Thirty-two to forty: highly convincing, the technology dissolves. Twenty-four to thirty-one: strong, but one or two strata waver under scrutiny. Sixteen to twenty-three: developing, real strengths undercut by recurring inconsistency. Eight to fifteen: not yet persuasive, a collection of fragments rather than a coherent whole.
Two rules keep it honest. First, a number never stands alone β every score is paired with a note that says why, because the score is where the conversation starts, not where it ends. And second, this is not a fidelity meter. A perfectly photoreal clip can score low if its world contradicts itself or its figures feel empty. A frankly stylized, openly synthetic sequence can score high if every choice holds together and means something. The rubric rewards cinematic truth, not photographic mimicry.

Now turn the framework around. Almost all the worry about AI in education collects on one word β cheating. Did the student write this, or did the machine. That’s a real question, but it circles a small problem while a larger one goes unnamed. The larger problem is atrophy. When a tool can produce a finished image in seconds, the faculties we once exercised in making things quietly fall out of use. The student stops noticing. Stops testing whether a thing holds together. Stops asking what it means and who it’s for.
So point the three strata not at the machine’s output, but at the student’s own attention β and the rubric becomes a curriculum. The perceptual stratum is the discipline of noticing β catching what’s off before you can even say what; the refusal to look past things. It’s the same attention that catches the flawed step in a proof, or the off note in an argument. The environmental stratum is coherence thinking β asking whether something holds together as a system, whether this world could exist independently of the prompt that produced it. That’s one of the most transferable faculties in all of education: the historian, the scientist, the engineer all do exactly this.
And the authorial stratum is moral agency β what the framework calls aboutness. A point of view, a reason to exist, a stake. AI can generate images without end. What it cannot generate, on its own, is aboutness. That’s the part only a person brings β and so it’s the part education has to protect most carefully.
Noticing, coherence, responsibility for meaning. These aren’t film skills. They’re the foundational faculties of an educated mind β and they’re exactly what frictionless generation lets waste away.

I want to be honest about where this came from, because it wasn’t invented at a desk. It was derived from my own practice β a body of work called the Life-world Series, now past thirty studies. The strata, the Frame, the pillars β none of them were deduced and then illustrated. They were noticed, in the act of trying to make a synthetic image hold together.
The series splits into two eras. The first twenty-five studies were lens-based β a physical camera, beginning in Hong Kong in 2016, chasing the gravity of the ordinary. Then, in 2025, a deliberate “Year Zero”: the move from the camera to the latent space. The medium inverted β from capturing a world to directing a machine that dreams one β but the mission held.
Two studies make the point. In Study 30, I asked the model for an empty corner, and it hallucinated a hand reaching from the shadows β the Ideational Frame made visible, a model so saturated with human-centered cinema it can’t imagine a corner without a witness. I kept the glitch instead of deleting it: accountable authorship as a practical act. In Study 31, a single brass pocket watch is placed in four different worlds, and its meaning transforms each time β the strata held as a controllable instrument.
Across the whole decade, glass lens and latent field, one thing stays constant: the eye, the author. The tools were replaced completely; the author was not. Meaning is the part that doesn’t transfer to the tool.

When the first edition reached its conclusion, the call for a new language was mostly a promise with little behind it yet. This edition tried to keep it. The new language now has a grammar β the Ideational Frame. A structure β the three strata. A craft, drawn from a century of cinema. A discipline β the four pillars of conscious assembly. A measure β the forty-point rubric. And a pedagogy β intentional seeing. What was a call has become a working language. But the point of a language was never the grammar. It’s what the grammar lets you say. And what AI Cinematic Realism exists to let us say is that a synthetic image can be true β not real, in the forensic sense the camera once guaranteed, but true: coherent, authored, answerable, felt. The architecture is only there to make that truth buildable on purpose.

The deepfake panic exists because bad actors try to force AI video into the domain of captured reality. They want it to pass as evidence; they want deception. This genre refuses that premise. When the goal is not to trick the eye but to move the heart, the work no longer has to win by hiding what it is. We stop being forgers and start being filmmakers. AI Cinematic Realism is not a replacement for cinema. It’s a new language for it. And the realism of the future is still ours to shape.
Watch the Video
AI Cinematic Realism (Second Edition) β The Framework in Brief


Leave a comment