I Built an AI Video Pipeline. It's Mostly Enterprise Integration Patterns.

I’ve spent the last few weeks building a documentary video generation pipeline. A topic goes in (say: “Sankebetsu brown bear incident 1915 Hokkaido”) and a finished mp4 comes out (a ten-minute narrated documentary, 27 scenes, original visuals, voiced by a documentary-grade TTS), all running on a homelab cluster.

Between those two endpoints sits a chain of eight passes that hand off to each other. Several fan out to “do N things in parallel and aggregate.” Several produce artifacts measured in MB per scene that should not be shoved through a message broker. A few are the right place to put a human in the loop, where one minute of editorial review saves hours of GPU time downstream.

I built it on Keip, a Kubernetes-native runtime for Spring Integration routes. Most AI pipelines I’ve seen in the wild are written in Python or NodeJS with ad-hoc orchestration; people invent their own splitters, their own waiting, their own retry loops. Every problem in that space already has a name, and the names have been settled for twenty years. Splitting work across scenes is a Splitter. Waiting for all the scene results to come back before assembling is an Aggregator. Stashing the big artifacts somewhere and passing a reference around is a Claim Check. Adding entity metadata to a prompt right before sending it to a model is a Content Enricher. Letting a human approve the outline before downstream burns GPU on a flawed structure is, near enough, an Idempotent Receiver.

I made the general case for this approach in an earlier post (“Enterprise Integration Patterns Aren’t Dead; They’re Running on Kubernetes and Orchestrating AI”), with a smaller scoring-pipeline example. This post is the bigger case study: which specific patterns mapped to which AI video pipeline problems, what Keip buys you concretely, and where the AI world needed something that classic EIP doesn’t quite provide.

What the pipeline does#

The pipeline takes a small JSON payload — slug, title, language, target duration, tone, scene count — and produces a finished mp4. Eight passes, each with a defined input and a defined artifact written to a shared NFS volume so the pipeline’s state is always inspectable on disk.

Research. The title becomes a search query against a self-hosted SearxNG instance. The top hits’ titles, URLs, and content snippets are dumped into the next pass verbatim. The whole job is to put real-world text under the LLM’s nose before it starts inventing.
Story bible. An LLM (Qwen3.6 27B on a local llama.cpp server, served through LiteLLM) reads the research and emits a structured JSON “bible” with entities, sourced facts (each carrying a source_idx pointing into the sources array), a three-act arc, and a few production fields like tone and target duration. Grammar-constrained decoding enforces the schema at the token level so the output is parseable by construction. The bible is frozen at the end of this pass; every downstream pass cites it as the source of truth, which is what eliminates the “the LLM gave me three different spellings of the same person’s name across the script” failure mode.
Outline. The same LLM, with a different prompt and a different grammar schema, produces a per-scene outline from the bible. For an 18-minute target with scene_count=27, that’s 27 scenes with monotonic 40-second time slots, each carrying a title, intent line, and a continuity-carry line that describes what the viewer should know or feel by the end. Constraints (total duration, contiguous timestamps, cold-open paid off by the final scene, enumeration of discrete time-stamped sub-events) are enforced in the prompt rules.
Per-scene narration. The outline is split into 27 individual scene messages. Each one is expanded into a JSON object containing 2-4 sentences of TTS-friendly narration, a visual prompt, alternate visual prompts, and safety notes. The Pass-3 prompt-builder reads the prior two scenes’ narration off disk and includes it as a “RECENT NARRATIONS” section, with explicit instructions to pick up the thread without repeating phrasings or re-introducing entities the listener has just met. That sliding window is what stops the script from feeling like 27 independent paragraphs glued in sequence.
Lint. A small Java component sorts the assembled scenes by id (Spring Integration aggregators release in arrival order, not message-id order) and runs structural plus semantic checks: scene count matches, ids are 1..N contiguous, timestamps are monotonic and within tolerance of the target duration, entity surface forms match the bible canonical forms, and the narration across scenes meets a few basic coherence rules. Lint failures route to a re-roll signal queue instead of crashing the pipeline.
Visuals. Each scene’s visual prompt is published to an AMQP queue consumed by a separate image-render service. That service fetches a stored ComfyUI workflow by name, substitutes the prompt and a random seed into the workflow’s KSampler node, submits the workflow to a local ComfyUI instance running FLUX dev on an AMD MI60, polls until the render completes, then uploads the resulting image to Immich in a per-run album. Pass-5 also enriches the prompt with entity visual descriptors before sending; the Content Enricher section below covers how and why.
Audio. Each scene’s narration is split out the same way and dispatched over AMQP to a self-hosted XTTS-v2 service sharing the same GPU as ComfyUI. Voice is one of the 58 built-in speakers (I use “Damien Black” for the documentary tone). Pass-6 optionally pre-substitutes phonetic spellings for hard-to-pronounce names before the text hits TTS — same Content Enricher pattern as Pass-5, covered below. One wav per scene, written to the shared volume.
Assembly. A small Python service consumes an assembly-request message and runs ffmpeg over the per-scene images and wavs. For each scene it fetches the image from Immich by asset id, reads the wav from the PVC, sizes the clip to the actual audio duration plus a small trailing pad, applies a ken-burns zoompan to give the still image motion (rendered at 2x output resolution then lanczos-downscaled to 1920x1080 so the zoom doesn’t visibly stair-step), adds fade-in/out and a short lead silence for smoother transitions, and concatenates everything into the final mp4.

Eight passes, all needing to talk to each other, several with fan-out to “do N things in parallel and aggregate,” several with expensive artifacts that should not be shoved through RabbitMQ, several with a natural place for a human to step in and approve before downstream burns hours of GPU time.

What’s stock, what I built#

The goal throughout was to assemble, not to construct from scratch. Most of the work in this pipeline is done by open-source services I deploy in the cluster and call:

SearxNG for the Pass-0 research search.
LiteLLM as the LLM gateway, with grammar-constrained decoding passed through to llama.cpp.
llama.cpp serving Qwen3.6 27B (Q4_K_M) on one MI60.
ComfyUI running FLUX dev on the other MI60 for image generation.
Coqui XTTS-v2 for narration audio, sharing the ComfyUI GPU.
Immich as the image store and review surface.
RabbitMQ as the message broker.
An NFS-backed Kubernetes PersistentVolumeClaim as the shared workspace.
Keip as the integration runtime, hosting all the Spring Integration routes as ConfigMap-backed IntegrationRoute custom resources on the cluster.

What I had to write is the connective tissue plus a handful of bridges where no off-the-shelf API existed in the shape I needed:

The route XML. Eight passes worth of Spring Integration routes (channels, splitters, aggregators, transformers, AMQP adapters, HTTP outbound gateways), declared in XML and shipped as ConfigMaps. This is the bulk of the pipeline. It is, very deliberately, configuration rather than code.
A Java continuity-lint component. Wired into the route via <int:service-activator ref="continuityLint" method="lint">. Sorts the assembled scenes by id and runs structural plus semantic checks.
A small Java entity helper. Two static methods called from SpEL inside the per-scene transformers: one appends entity visual_descriptor fragments to FLUX prompts, the other does name-to-pronunciation substitution on narration text before TTS. Maybe sixty lines including imports.
A shared image-render service. A separate Spring Integration route that wraps ComfyUI behind an AMQP request/response interface, fetches workflows stored in ComfyUI’s userdata API by name, substitutes the prompt and seed, polls for completion, and uploads results to Immich. Shared across the video pipeline and an unrelated image-generation pipeline that lives in the same cluster.
An AMQP consumer wrapping XTTS. Coqui’s library is Python and synchronous; the shim is a small Python process that subscribes to an xtts.synthesize-request queue, runs the requested text through XTTS, and publishes the wav bytes back to the caller’s reply queue via correlation-id. The same process also exposes a tiny OpenAI-compatible HTTP endpoint (/v1/audio/speech) for one-off testing and the occasional swap-in of a cloud TTS, but the production pipeline drives it over AMQP — request/response stays inside the broker and Pass-6 doesn’t have to hold an HTTP connection open for the ~20 seconds per scene XTTS takes to render.
A Python assembly service. Consumes assembly-request AMQP messages and orchestrates ffmpeg. Fetches images from Immich, reads wavs from the PVC, applies the zoompan and fades, concatenates the final mp4. Around 400 lines of Python; one of the few places where shelling out from Python was the natural fit.

The split matters because most of what people call “building an AI pipeline” turns out to be writing connective tissue. The components themselves — search, LLM, image gen, TTS, broker, storage — are already excellent. The thing to build is the wiring, and the thing to choose is the wiring runtime.

The patterns that did the heavy lifting#

Splitter, then Aggregator#

The per-scene work is the obvious one. Pass-3 (narration), Pass-6 (audio), and Pass-5 (visuals via a separate render-service route) all use the same shape: split the scene list, do per-scene work in parallel, aggregate the results back into one message, emit a manifest.

In Spring Integration that’s literally a <int:splitter> followed by a <int:aggregator> with a correlation strategy and a release strategy:

<int:splitter input-channel="pass5TriggerSplit"
              output-channel="pass5RenderRequest"/>

<!-- ... per-scene transformer + AMQP outbound ... -->

<int:aggregator input-channel="pass5DoneAggregate"
                output-channel="pass5ManifestChannel"
                correlation-strategy-expression="headers['slug']"
                release-strategy-expression="size() == one.headers['sceneCount']"
                expire-groups-upon-completion="true"
                group-timeout="1800000"/>

Two things to know if you’re new to SI aggregators. First, one.headers is only valid inside the release-strategy expression; downstream of the aggregator the released Message<List<T>> exposes plain headers (merged from the input messages by DefaultAggregatingMessageGroupProcessor, which keeps the first message’s headers). Second, if your splitter can ever be re-run (because the AMQP listener nacks and the message redelivers, say), expire your groups eagerly or you will get aggregator pollution where stale messages from a partial first run mix with a clean second run, and the release predicate never fires. I learned that one the hard way.

Content Enricher for entity metadata#

The bible’s entities[] carries two fields that downstream passes need: a pronunciation field with a phonetic English re-spelling for TTS, and a visual_descriptor field with a dense prompt fragment for image generation. The bear in this run, for instance, has pronunciation: "Keh-sah-GAH-keh" and visual_descriptor: "massive aggressive Hokkaido brown bear, dark brown fur, scarred face, no cartoonish features".

Pass-5 (visuals) calls a tiny static Java method to walk the entity list and append visual descriptors for any entity whose canonical name appears in the scene’s visual prompt. Pass-6 (audio) calls a different method to regex-substitute canonical names with their phonetic spellings before posting to the TTS service. Both are textbook Content Enricher: take a payload, augment it from a side reference (the bible), pass the enriched payload downstream.

<int:transformer input-channel="pass6SceneChannel"
                 output-channel="pass6HttpChannel"
    expression="T(...MessageBuilder).withPayload(
        new ObjectMapper().writeValueAsString(Map.of(
          'input', T(com.example.video.EntityHelper)
                     .applyPronunciation(payload['narration'], headers['entities']),
          'voice', 'Damien Black',
          'response_format', 'wav')))
      .copyHeaders(headers)
      .setHeader('Content-Type', 'application/json')
      .build()"/>

This is the pattern that gave me the biggest jump in fidelity. Without entity-driven enrichment, the model named the bear correctly in the script but the TTS mangled it, and the FLUX renders gave me a friendly cartoon bear that looked like Smokey because the prompt said “bear” without anchoring what kind. Once the enricher injects visual_descriptor into every scene that mentions the bear, the bear comes through as the same fearsome animal in all 27 renders.

Claim Check for the big artifacts#

Per-scene wavs are around 1 MB. Per-scene jpgs are similar. Multiplied by 27 scenes, with multiple passes reading from and writing to the same artifacts, you really do not want to ship the bytes through RabbitMQ.

Classic Claim Check: write the bytes to durable storage, pass a pointer through the messaging system. My pointer is the slug. Every pass that needs the artifacts reads them from an NFS-backed PersistentVolumeClaim at /var/runs/{slug}/, then writes its own outputs to the same directory. Inter-pass messages are tiny: {"slug": "..."} is the entire payload of every gate signal.

The bonus, which I didn’t appreciate until a run failed midway, is that this gives you a free debug surface. When something goes wrong, cat 01-bible.json, cat 02-outline.json, ls scenes/. The pipeline’s intermediate state is always on disk, in a human-readable format, exactly where you’d look for it. No serialization mystery, no “what was in the message that we just acked,” no special tooling.

AMQP signal queues for HITL gates#

The pipeline has four gates where a human can step in: outline-approved, script-approved, visuals-approved, audio-approved. Each is a durable AMQP queue. The “approval” payload is just {"slug": "..."}. The downstream route consumes the signal, reads the relevant artifact from the PVC, and continues.

This is closer to a manual Idempotent Receiver than anything else in the patterns book. The route waits patiently. A human (or, for unattended runs, a watcher script) publishes the signal. The route fires. If you publish twice, the route handles it; if you never publish, the route waits forever (which is the correct behavior at a gate).

I learned to leave at least one of these gates manual no matter how much I want fully unattended operation. The outline gate is where narrative shape gets decided, and 30 seconds of human review there saves hours of GPU time downstream when the LLM made a structural choice you’d have caught at a glance.

The pattern EIP didn’t have#

The classic patterns book assumes that when a transformer produces a message, the message conforms to whatever schema the downstream consumer expects. Datatype Channel. In a normal Spring Integration system you enforce that by writing code that produces the right shape.

LLMs do not work that way by default. The model might emit JSON that looks correct and parses fine, but quietly omit a field you needed, or include one you didn’t ask for, or add prose around the JSON, or hallucinate an extra entity that breaks the downstream join.

The fix that closed the loop for me was grammar-constrained decoding at the model boundary. My LLM gateway is LiteLLM, which accepts a response_format: {type: json_schema, json_schema: {...strict: true}} extra_body parameter and passes it through to the underlying engine. llama.cpp implements the constraint at the token-sampling level: the model literally cannot emit a token sequence that would violate the schema.

I extended the bible schema to require the new pronunciation and visual_descriptor fields with additionalProperties: false. Once the schema enforced it, the model populated both fields cleanly on every run, with Pass-1 producing exactly the structure Pass-5 and Pass-6 needed to consume. The schema is the Datatype Channel contract, enforced at the protocol level instead of as a downstream parse error.

The reasoning gotcha worth flagging: the same llama.cpp build supports a chat_template_kwargs.enable_thinking flag for models with reasoning. With reasoning on and grammar mode on, the model spent its entire token budget on the reasoning block and emitted empty content with finish_reason: length. Grammar mode already constrains structure, so the reasoning block isn’t doing work; disable it when you turn grammar mode on.

Why Keip specifically#

I could have written all of this directly in Spring Boot with the SI XML files in the classpath. Keip’s contribution is that the route XML becomes a Kubernetes ConfigMap, and the entire integration is an IntegrationRoute Custom Resource. The runtime is provided by Keip’s operator; my pod is a thin Spring Boot wrapper that imports the routes and registers any custom beans (the continuity lint, the entity helper) I want to call from SpEL.

What this buys me in practice:

Route changes are kubectl apply followed by a pod rollout. No image rebuild for a prompt tweak or a new transformer.
The route XML lives in a git repo next to the pod’s deployment manifest. Reviewing a pipeline change is reviewing a diff to a file, not picking apart a custom orchestrator.
Standard Kubernetes tooling works because the integration is a Kubernetes workload. Liveness and readiness probes, log collection, Prometheus scraping, secret injection through env vars, ConfigMap and Secret hot-reload, rolling deployments with zero downtime, all behave the way they would for any other pod.
Scaling is the cluster’s job. Each integration runs as its own pod, with its own resource requests and limits, so a heavy Pass-3 doesn’t starve a lightweight assembly pod. When a pass is CPU-bound rather than GPU-bound, kubectl scale --replicas=N adds workers and AMQP fans the work across them with no code change; the splitter and aggregator already handle concurrent consumers correctly.
Self-healing for free. If a pod OOMs or panics, the kubelet restarts it; unacked AMQP messages return to the queue and the new pod picks them up where the old one left off. The transactional message handling at the Spring Integration layer composes with the restart semantics at the Kubernetes layer.
Service discovery and scheduling are solved problems. The LLM gateway, the image-gen service, the TTS service, the broker are all in-cluster, reachable by DNS name. GPU-resident workloads can be pinned to the right node with affinity rules, and network policies isolate routes that shouldn’t talk to each other. None of this required custom plumbing.
I can declare multiple IntegrationRoute CRs against different image tags if I want to A/B a route change. The pod is cheap to spin up.

Compared to writing this in Python with bespoke orchestration, the Spring side gives me thread-safe channels, declarative AMQP wiring, transactional message handling, a well-understood failure model around acks and redeliveries, and a vocabulary that survives googling. Compared to plain Spring Integration, Keip inherits the entire Kubernetes operational story for free: scaling, self-healing, scheduling, observability, and rollout safety are all just there because the integration runs on the same control plane as everything else in the cluster.

This is one of several Keip integrations I run in the homelab for different jobs — a research-feed scoring pipeline, an image-generation service, a health-tracking ingest, and others — all following the same patterns at different scales. I’ll write up the inventory separately.

What I’d tell someone starting#

If you’re building anything more than a single LLM call wrapped in retry logic, you’re building a messaging system. The patterns you’ll want already have names, and the runtimes that implement them already exist.

The three I’d start with, in order:

Claim Check. Use a shared PVC (or S3, if you must) for artifacts. Pass the slug, not the bytes. Your AMQP broker will thank you, and your debug surface becomes ls and cat.
Splitter + Aggregator. The moment you have per-scene, per-chunk, or per-anything work, this is your shape. The aggregator’s release-strategy expression is where the “we have N things now, proceed” logic lives. Don’t reinvent it.
Grammar-constrained decoding. Treat the LLM as a producer that needs schema enforcement at the protocol level. Don’t catch parse errors downstream; prevent them at the source.

And keep a gate. One human-approved checkpoint, somewhere near the front of the pipeline where structural decisions get made, will save more compute than any optimization you can apply downstream.

For the higher-level framing around why this matters, the shift from writing prompts to writing pipelines, see My Job Is to Write Loops.