Earngenix Logo
Skip to main content

Tutorial · Roadmap Level 2 · Video Prompting

How to Write Prompts for LTX-2 — ComfyUI Beginner Guide

Tag-based prompts fail on LTX-2. This guide covers the 6-block structure that gets consistent results, copy-paste examples for all 4 generation modes, and a free skill file that lets AI write structured prompts for you.

7 ready to use

Prompt examples

T2V · I2V · A2V · FLF

Generation modes

LTX-2 & LTX-2.3

Model covered

By Earngenix Team··15 min read

⚡ Quick Answer

To write prompts for LTX-2, structure each prompt like a shot description — not a list of tags. Cover six things: scene setup, action sequence, camera movement, lighting, audio, and visual style. A 5–10 second clip needs at least 6–10 detailed sentences written in present tense.

What makes LTX-2 prompting different from every other model

You typed something like: cinematic woman, city at night, rain, moody, 4K. You hit generate. The output flickered, morphed, and looked nothing like your idea. That's not a settings problem — it's the wrong approach entirely.

If you've used Stable Diffusion, FLUX, or Midjourney before, you have one habit to break. Those models reward tag-stacking — short comma-separated lists of keywords and quality modifiers. LTX-2 actively resists that approach.

Why tag-style prompts fail on LTX-2

LTX-2 is a video model. It doesn't blend mood keywords — it generates a scene that unfolds over time. It needs to know what happens, in what order, and from what angle. A list of adjectives gives it nothing to build a sequence from. The result is drift, flicker, incoherent motion, and a clip that has nothing to do with your idea.

Think like a cinematographer, not a keyword stacker

The shift that changes everything: stop thinking about what the video should look like, and start thinking about what a camera operator would physically do.

If a real camera crew could execute your prompt without asking a single follow-up question — the framing, the move, the action, the lighting — your prompt is ready. If they'd have to guess anything, rewrite it. LTX-2 is closer to a virtual camera crew than a keyword blender.

The mental model in one line: "Does a real cinematographer know exactly what to shoot from this description?" If yes, you're done. If no, add one more specific detail.

The 6-block LTX-2 prompt structure

Every strong LTX-2 prompt covers six blocks. You don't need to label them — but you do need to cover all six. Miss one and the model starts inventing things you didn't ask for.

🎬
01
Scene Start
🎭
02
Action & Dialogue
📷
03
Camera Move
💡
04
Lighting & Mood
🔊
05
Audio
🎨
06
Aesthetic Style

Block 1 — Scene Start: anchor the shot before anything moves

Open with a precise establishing description. This is frame one — what the camera sees the moment the clip begins. Define: shot angle and scale, the location with specific detail, who is in the scene (observable details only — no internal states), what they're doing at frame 1, and one key environmental detail.

✗ Weak

"A woman is in a city at night."

No angle, no location, no action. Model guesses everything.

✓ Strong

"EXT. TOKYO BACK ALLEY — NIGHT. A wide low-angle shot frames a young woman in her mid-20s, red umbrella over her shoulder, rain hammering the pavement. Neon signs glow pink and green above, their reflections shimmering in every puddle."

Block 2 — Action: describe motion as cause and effect

Write the core movement as a natural sequence — like stage direction. Use present tense verbs: walks, turns, reaches, exhales. For dialogue, put speech in "quotation marks" and add acting directions between each line. Stick to 1–2 characters — more than two causes drift and artifacts.

✗ Weak

"She says something emotional about waiting."

✓ Strong

"She takes a slow breath: 'I waited three years for this.' Pauses, glances aside. 'Three years.' Her voice drops to a near-whisper. Her jaw tightens."

Block 3 — Camera movement: name the move, timing, and what it reveals

Don't just say "the camera moves." Tell the model exactly when the move starts, what it does, and what it shows afterward. That three-part structure is what produces consistent results.

Camera moveBest use
Slow dolly in / push inBuilding intimacy, focusing on the subject
Dolly back / pull outRevealing scale, showing the environment
Pan left / rightFollowing action, off-screen reveals
Handheld trackingEnergy, chase sequences, documentary feel
Arc / orbitHero moments, 360° character reveal
Crane up / downScale reveals, establishing context
Static / locked-offTension, stillness, observation
Over-the-shoulderDialogue scenes, perspective shots
Dutch angleTension, unease, disorientation
Overhead / bird's eyeGod's-eye view, isolation, scale
Camera description example

"The camera opens in a tight close-up on her hands around the coffee cup. As she begins speaking, it slowly pulls back to a medium shot — and on her final line it settles into a wide shot, revealing the empty restaurant around her."

Block 4 — Lighting: show mood through light, not labels

Never write "the scene feels sad" or "the mood is tense." The model doesn't process emotional labels — it processes what the camera physically sees. Replace every mood word with a lighting and atmosphere description.

Golden hourWarm directional, long shadows, amber tones
Neon glowSpecify colour — cyan, magenta, amber, red
Overcast diffuseSoft, no hard shadows, muted palette
Rim / backlightSilhouette effect, outline glow
CandlelightWarm, flickering, intimate atmosphere
Fluorescent overheadClinical, harsh, uncomfortable

Block 5 — Audio: LTX-2.3 generates sound — don't skip this

LTX-2.3 generates synchronized audio and video in a single pass. Leave the audio block empty and the model invents something. Sometimes fine, usually not. One sentence of audio description prevents a lot of bad surprises.

Describe ambient sound, music (genre + tempo + instrumentation — never name specific songs), dialogue delivery, and foley. "The ambient sound of the street fades to near-silence. A single car passes far in the background." — that's all it takes for most clips.

Block 6 — Aesthetic: name your style and commit to it

LTX-2 locks onto named styles reliably — but they need to be placed clearly, not buried mid-prompt. Put your aesthetic note first or last. Mixing two styles ("cinematic noir but also bright Pixar") produces visual mud. Pick one and commit.

Film noir
16mm grain
Pixar-style 3D
Cyberpunk
Arthouse slow cinema
Documentary realism
Fashion editorial
Vintage VHS

Copy-paste examples — one for each generation mode

All four examples below are structured using the 6-block system. Each covers a different generation mode. Copy the prompt directly into the Gemma text encoder node in your ComfyUI workflow, generate the clip, then swap in your own scene details.

Text-to-Video (T2V) — Cinematic Drama
Prompt — copy and paste into ComfyUI

EXT. COASTAL CLIFF – GOLDEN HOUR. A wide establishing shot frames a lighthouse standing alone on a rocky Atlantic cliff, battered by strong onshore wind. Tall grass whips violently around the base of the structure. The camera begins far back, slowly pushing in over 6 seconds toward a weathered wooden door at the base of the lighthouse. The door rattles in the wind but does not open. No characters are visible — only the structure, the cliff, the churning grey-green sea below, and an enormous orange-gold sky pressing down from above. The audio is all wind — deep, sustained, and surrounding — with the crash of distant waves and the low metallic creak of the lighthouse lantern housing swaying above. The overall colour grade is warm amber at the top of the frame fading to cold slate at the sea. 16mm film grain, anamorphic lens with subtle horizontal flare from the sun. Cinematic, arthouse slow cinema aesthetic.

🎬 Generated output — add your video here

Loading…
T2V · Coastal cliff · Add your generated output here

Replace the placeholder with your actual generated video for this mode.

Image-to-Video (I2V) — Portrait Coming Alive
Prompt — copy and paste into ComfyUI

She slowly exhales, her shoulders dropping slightly from their raised position as the tension releases. Her eyes shift from the middle distance to a direct look into the camera — not confrontational, but searching. The camera holds completely still in the tight medium close-up already established. As she moves, the wind visible in her hair at the start of the clip dies down to a still. The ambient sound of the street scene behind her fades to a near-silence. A single car passes far in the background, its headlights sweeping briefly across the bokeh. No dialogue. The mood is quiet and unresolved. Cinematic colour grade: cool, slightly desaturated, with soft warm fill on the face only.

🎬 Generated output — add your video here

Loading…
I2V · Portrait animation · Add your generated output here

Replace the placeholder with your actual generated video for this mode.

Audio-to-Video (A2V) — Music Visualisation
Prompt — copy and paste into ComfyUI

A continuous slow-motion shot through an empty ballroom at night, the camera tracking steadily forward down the centre of the room at waist height. Crystal chandeliers hang overhead, each one slowly rotating as the camera passes beneath. Their light fractures into hundreds of warm white points that drift and scatter as the camera moves through them. At each major beat of the music, one chandelier's light pulses brighter for a single frame before returning to its ambient glow. The floor is polished dark wood, perfectly reflecting the chandelier light. No people. No cuts. The camera never stops moving forward. The visual pacing is unhurried and matches the slow, suspended quality of the music perfectly. Luxury editorial aesthetic — ultra-slow motion, sharp detail, rich warm palette.

🎬 Generated output — add your video here

Loading…
A2V · Ballroom · Add your generated output here

Replace the placeholder with your actual generated video for this mode.

First-and-Last-Frame (FLF2V) — Day-to-Night Transformation
Prompt — copy and paste into ComfyUI

The scene slowly transitions from late afternoon to full night over the duration of the clip. The primary subject — a cobblestone town square with a central fountain — remains completely static, as does the camera in its wide locked-off establishing shot. The transformation happens entirely through light: warm golden afternoon light gradually drains from the upper right of frame, fading through a brief blue-hour phase where deep indigo shadows fill the square, before the scene settles into full night lit by warm street lanterns and the cold white glow of a full moon. The fountain water, catching the light differently at each phase, shimmers from gold to silver. The audio begins with evening birds and fading traffic, transitions through total silence at the blue hour, then arrives at quiet night: crickets, the distant sound of a café, and the soft splash of the fountain. No motion other than light change. No people. Atmospheric, slow, painterly aesthetic.

🎬 Generated output — add your video here

Loading…
FLF2V · Day-to-night · Add your generated output here

Replace the placeholder with your actual generated video for this mode.

For I2V prompts: never re-describe what's already visible in your input image. Describe what happens next — the motion, the camera follow, sounds that emerge as the scene comes alive. The model reads the scene from your image; it only needs to know how things move.

What to avoid in LTX-2 prompts

These show up in almost every beginner's first ten attempts. Each one produces a specific failure mode — knowing the failure helps you fix it fast.

Emotional labels without physical description

"She looks sad"

"She looks down, shoulders drop, exhales slowly"

The model processes what the camera sees — not internal states.

More than 2 characters in simultaneous action

"Three friends argue around a table"

One or two characters. Describe crowd as atmosphere, not actors.

Three or more causes merged faces and inconsistent motion.

Readable text, signs, logos

"A Times Square billboard reads SALE"

Avoid any content requiring specific legible text.

LTX-2 cannot reliably generate legible text.

Short prompt for a long video

"A forest scene. Birds fly. Trees sway."

Match word count to clip length. 10 seconds = 8–12 sentences.

The model fills empty duration by inventing random motion.

Contradictory light sources

"Warm golden sunset with cold fluorescent glow"

Pick one dominant source. Add secondary only with clear reason.

Conflicting lighting logic produces flickering and visual mud.

Tag-stacking from Stable Diffusion habits

"cinematic, moody, 4K, masterpiece, best quality, bokeh"

Write in sentences. Describe the scene cinematographically.

LTX-2 needs sequential narrative, not a list of adjectives.

Standard negative prompt — add this to every generation:
morphing, distortion, warping, flicker, jitter, stutter, shaky camera, erratic motion, temporal artifacts, frame blending, low quality, watermark, text, logo

How to fix a prompt that isn't working

Bad clip. Don't start over — debug it. LTX-2 is built for fast iteration. Work through these five steps before changing anything else.

  1. Simplify before you add anything: Remove one element and regenerate. Find what's causing the problem before stacking more instructions on top of it.
  2. Make the camera move explicit: Vague camera instructions are the most common source of jitter. Name the exact move, when it starts, and what it shows afterward.
  3. Check prompt length against video length: If the output feels rushed or fills time with random motion, your prompt is too short. Add detail to each block.
  4. Move the aesthetic note to the first line: If the visual style is drifting, anchor it at the start of the prompt — before the scene description.
  5. Split complex scenes into two clips: A scene change, two dialogue lines, and a camera reveal in 10 seconds is a lot to ask. Generate two 5-second clips and cut them together in post.

Use AI to write your LTX-2 prompts — free skill file

Writing a full 6-block prompt from scratch takes practice. A faster approach: paste the skill file below into Claude before you describe your scene, and it builds the structured prompt for you.

The skill file teaches Claude everything in this guide — the block structure, mode-specific rules for T2V, I2V, A2V, and FLF2V, camera vocabulary, lighting language, and what to avoid. You describe your idea in plain language; it outputs a production-ready prompt.

How to use this skill file:

  1. Open a new conversation at claude.ai (free tier works)
  2. Copy the full skill file below and paste it as your first message
  3. Follow up with: "Write an LTX-2 prompt for: [describe your scene in plain language]"
  4. Copy the output into your ComfyUI Gemma text encoder node

📋 LTX-2 Prompt Skill File

Paste this into Claude before describing your scene

---
name: ltx2-prompting
description: Write high-quality prompts for LTX-2 / LTX-2.3 AI video generation. Use this skill when the user wants to generate a video with LTX Studio, LTX API, RunDiffusion, or the LTX-2 / LTX-2.3 model, or when they want to convert an idea, scene, or script into an LTX-2 compatible prompt. Covers all four generation modes: Text-to-Video (T2V), Image-to-Video (I2V), Audio-to-Video (A2V), and First-and-Last-Frame (FLF2V). Triggers include: "write a prompt for LTX", "make a video prompt", "LTX Studio prompt", "I want to generate a video", or any request to turn a creative idea into a structured LTX-2 video generation prompt.
---

This skill guides the writing of effective, production-quality prompts for the LTX-2 and LTX-2.3 video generation models by Lightricks. LTX-2.3 generates synchronized audio and video in a single pass, at native 4K resolution and up to 50 FPS. Strong prompts are the primary lever for output quality.

The user provides a video idea, scene concept, character description, script, or reference images. Transform it into a well-structured LTX-2 prompt using the correct mode and principles below.

---

## Step 0 — Identify the Generation Mode

Before writing a single word of the prompt, determine which mode applies:

| Mode | When to use | What the model needs from the prompt |
|---|---|---|
| **Text-to-Video (T2V)** | No input image; idea only | Everything — subject, action, environment, lighting, camera, audio |
| **Image-to-Video (I2V)** | User has a starting image | Motion only — what happens next; skip describing what's already visible |
| **Audio-to-Video (A2V)** | User has an audio file (music, VO, SFX) | Visual interpretation — what scenes and camera work should accompany the sound |
| **First-and-Last-Frame (FLF2V)** | User has both a start image and an end image | Transition — the journey between the two states (motion path, pacing, camera) |

Ask the user which mode they intend if it's unclear.

---

## Output Format — CRITICAL

**The final prompt output is ALWAYS one single flowing paragraph. No labels. No headers. No line breaks between sections. No [SCENE START], [ACTION], [CAMERA] tags in the output.**

This is the correct output format:
> EXT. COASTAL CLIFF – GOLDEN HOUR. A wide establishing shot frames a lighthouse standing alone on a rocky Atlantic cliff, battered by strong onshore wind. Tall grass whips violently around the base of the structure. The camera begins far back, slowly pushing in over 6 seconds toward a weathered wooden door at the base of the lighthouse. The door rattles in the wind but does not open. No characters are visible — only the structure, the cliff, the churning grey-green sea below, and an enormous orange-gold sky pressing down from above. The audio is all wind — deep, sustained, and surrounding — with the crash of distant waves and the low metallic creak of the lighthouse lantern housing swaying above. The overall colour grade is warm amber at the top of the frame fading to cold slate at the sea. 16mm film grain, anamorphic lens with subtle horizontal flare from the sun. Cinematic, arthouse slow cinema aesthetic.

This is WRONG — never output labelled blocks:
> [SCENE START] EXT. COASTAL CLIFF – GOLDEN HOUR...
> [CAMERA MOVEMENT] The camera begins far back...
> [AUDIO] The audio is all wind...

The labelled blocks below (SCENE START, ACTION, CAMERA, etc.) are a **planning checklist only** — use them internally to make sure you have covered every element, then merge everything into one seamless paragraph before giving the final answer.

**Prompt length must match video length — strictly.** Use these word count targets:

| Video duration | Target prompt length | Rough sentence count |
|---|---|---|
| 3–5 seconds | 50–80 words | 3–5 sentences |
| 5–8 seconds | 80–120 words | 5–8 sentences |
| 8–12 seconds | 120–170 words | 8–11 sentences |
| 12–20 seconds | 170–230 words | 11–15 sentences |

If the user does not specify a duration, default to **5–8 seconds (80–120 words)**. Do not exceed that range unless asked for a longer clip. More words does not mean better output — cover only what matters: subject, action, camera, light, audio, style. Cut anything redundant or decorative.

## Planning Checklist (Internal Use Only — Never Output These Labels)

Cover all six elements when building the prompt, then merge into one paragraph:

```
SCENE START   → shot angle, INT/EXT, location, character(s), initial state
ACTION        → motion sequence beat by beat, dialogue in quotes
CAMERA        → when it moves, how, what it reveals
LIGHTING      → light source, quality, colour, atmosphere
AUDIO         → ambient sound, music, dialogue delivery, foley
AESTHETIC     → named style/genre, film look, colour grade
```

---

## Section-by-Section Guide

## Planning Element 1 — Scene Establishment
Open with a wide establishing description. This anchors everything else.

Define:
- **Shot angle & scale**: medium-low angle, bird's eye, extreme close-up, wide establishing shot, Dutch angle, over-the-shoulder, etc.
- **Interior/Exterior + Location**: INT. / EXT. notation is optional but helpful. Describe the environment precisely — not "a city" but "a rain-soaked Tokyo back alley at midnight, neon signs reflecting in puddles."
- **Who is in the scene**: 1–2 characters max. Include age range, clothing, hair, distinguishing features — only observable details, not internal states.
- **Initial pose or state**: what the character or object is doing at frame 1.
- **Key environmental detail visible at frame 1**: a prop, texture, atmospheric element.

> ✅ "EXT. TOKYO BACK ALLEY – NIGHT. A wide low-angle shot frames a young woman in her mid-20s standing at the entrance of a narrow alley, a red umbrella over her shoulder, rain hammering the pavement around her. Neon signs in Japanese glow pink and green above her, their reflections shimmering in every puddle."
> ❌ "A woman is in a city at night."

---

## Planning Element 2 — Action & Dialogue
Describe the core motion as a **natural, cause-and-effect sequence from beginning to end**. Write it like stage direction.

- Use **present tense verbs**: walks, turns, reaches, says, exhales, pauses, lifts, drops
- For dialogue: place speech in **"quotation marks"** and mention accent/tone/language if relevant
- Break complex dialogue into short phrases with acting directions between each line. This is critical for LTX-2.3:

> ✅ "She takes a slow breath, then begins: 'I waited three years for this.' She pauses, looks to the side. 'Three years.' Her voice drops to a near-whisper. 'And now you are telling me it does not matter?' Her jaw tightens."
> ❌ "She says a long dramatic thing about waiting and being disappointed."

- Limit to **1–2 characters**. More than 2 causes drift and artifacts.
- Use connectors to ensure smooth temporal flow: "as," "then," "while," "just before"
- Close the action arc — give the scene a clear ending beat.

---

## Planning Element 3 — Camera Movement
Be specific and sequential. Tell the model exactly how the camera behaves at each moment, and what it reveals after each move.

**Camera language vocabulary:**

| Movement | Use for |
|---|---|
| Static / locked-off shot | Tension, stillness, observation |
| Slow dolly in / push in | Building intimacy, focus on subject |
| Dolly back / pull back | Revealing scale or environment |
| Pan left / pan right | Following action or revealing what is off-screen |
| Tilt up / tilt down | Scale reveal, dramatic emphasis |
| Handheld tracking shot | Energy, chase, documentary feel |
| Arc / orbit around subject | Hero moment, 360 degree reveal |
| Crane up / crane down | Scale, establishing context |
| Over-the-shoulder | Dialogue, perspective |
| Dutch angle | Tension, unease, disorientation |
| Overhead / bird's eye | God's-eye view, isolation |
| Macro lens / extreme close-up | Texture, detail, emotional intimacy |

**Always describe:** when the move starts, what it does, and what it reveals at the end.

> ✅ "The camera opens in a tight close-up on her hands around the coffee cup. As she begins speaking, it slowly pulls back to a medium shot, and then — on her final line — it cuts to a wide shot revealing the empty restaurant around her."

---

## Planning Element 4 — Lighting & Mood
Describe what the camera physically sees. Never use abstract mood labels like "sad" or "tense" — use lighting, texture, and physical detail to convey mood instead.

**Lighting conditions:**
- Golden hour / magic hour (warm, directional, long shadows)
- Overcast diffused light (soft, no shadows, muted palette)
- Dramatic single spotlight (high contrast, deep shadow)
- Neon glow: specify colour — cyan, magenta, amber, red
- Backlighting / rim lighting (silhouette effect, outline glow)
- Candlelight / practical lamp (warm, flickering)
- Hard overhead fluorescent (clinical, uncomfortable)
- Deep underwater blue-green
- Firelight (orange-red, dynamic, flickering)

**Atmospheric elements:**
- Fog, mist, haze, steam rising
- Dust motes suspended in light beams
- Rain, wet surfaces, puddle reflections
- Smoke, heat shimmer, ash falling
- Shallow depth of field / bokeh in background
- Lens flare, anamorphic flare
- Film grain overlay

**Colour palette:**
- Warm amber vs. cold desaturated blue
- High contrast black and white
- Monochromatic (single colour family)
- Neon-saturated complementary pop
- Muted, faded, vintage washed tones

> ✅ "Golden hour light rakes in from the right, casting long warm shadows across the cracked asphalt and leaving the left side of her face in cool shadow. Steam rises from a nearby grate, diffusing the light into a soft haze. The scene is warm in tone, with amber dominating and occasional flashes of blue from a distant police light."

---

## Planning Element 5 — Audio
LTX-2 generates synchronized audio and video in a single pass. Audio prompting is **not optional** — it directly shapes the output. Be as specific about what you hear as what you see.

**Ambient / environmental sound:**
- Rain on pavement, thunder rumbling in the distance
- Wind through trees, leaves rustling
- Coffeeshop murmur, low room tone, clinking cups
- Forest ambience — birds, insects, stream
- City traffic, distant sirens, crowd hum
- Mechanical hum, industrial drone, server room buzz

**Music:**
- Describe genre, tempo, instrumentation — "a slow, mournful cello melody," "an upbeat lo-fi hip-hop beat with a warm vinyl crackle," "tense orchestral strings building to a crescendo"
- Do NOT name specific songs or artists

**Dialogue:**
- Always in quotation marks
- Specify: language, accent, tone, volume, pacing

**Foley / sound design:**
- "Heavy footsteps on wet gravel," "the click of a gun being cocked," "pages turning rapidly," "a distant glass shattering"

**Audio-to-Video specific rules (A2V mode):**
- Describe what scenes and subjects should appear in response to the audio
- Use timing cues: "at 2 seconds, the logo reveals," "the camera moves on the snare hit"
- Motion regularity: "constant speed pan" syncs better with metronomic beats than variable motion
- Add guardrails: "single continuous shot, no cuts" if you do not want the model to invent transitions

---

## Planning Element 6 — Stylised Aesthetic
Name the visual style or genre early and commit to it. LTX-2 responds well to named aesthetics placed clearly at the start or end of the prompt.

**Animation styles:** High-detail 3D animation (Pixar-style), 2D hand-drawn animation, stop-motion / claymation texture, 8-bit pixel art, anime / manga stylization

**Cinematic genres:** Film noir, epic space opera, period drama, psychological thriller, modern romance, documentary / handheld realism, arthouse / experimental, horror

**Visual aesthetics:** Painterly / impressionist texture, cyberpunk, vintage analog film, fashion editorial, surrealist / dreamlike, comic book / graphic novel, luxury / high-end commercial

---

## Mode-Specific Prompt Rules

### TEXT-TO-VIDEO (T2V)
The model generates everything from scratch. You are the only source of visual information.

- Define **everything**: subject, appearance, action, environment, lighting, camera, audio, style
- Start with the strongest visual concept — the most important element goes first
- Use 6–10 sentences minimum for a 5–10 second clip
- Avoid vague opening lines like "A beautiful scene of..." — be specific from word one

**Example T2V prompt:**
> EXT. FOGGY LAKE – PRE-DAWN. A lone fisherman rows across a perfectly still, fog-covered lake in the moments before sunrise. The camera glides overhead in a slow crane shot, tracking his slow progress across the dark water. His wooden rowboat creaks softly with each pull of the oars, and a small lantern at the bow casts a warm amber circle that reflects in gentle ripples. As the camera descends to a low water-level angle, reeds along the distant shoreline come into focus, swaying faintly in a cold breeze. A single bird calls somewhere in the mist. The sound design is minimal: oar dips, water laps, the faint creak of wood, and a distant wind. The colour palette is deep navy and slate blue, with the lantern's amber warmth as the only contrast. Documentary realism aesthetic, 16mm film grain, anamorphic lens.

---

### IMAGE-TO-VIDEO (I2V)
Your input image defines the visual starting point. The model knows what the scene looks like — your prompt should describe **what happens next**.

**Key rule:** Do NOT re-describe what is already visible in the image. Describe the transition from stillness to motion.

> ✅ "She slowly raises her right hand toward the light source. The camera gently pushes in as her fingers extend. The ambient sound of rain on glass grows louder, and a soft orchestral swell begins to build."
> ❌ "A woman with dark hair in a white dress stands by a window looking at the rain." (This describes the image — the model already knows this.)

---

### AUDIO-TO-VIDEO (A2V)
Your audio file anchors the timing and rhythm. The model uses your audio as a structural backbone and generates visuals around it.

- Use timing cues to anchor visuals to audio beats
- Describe motion that matches audio energy: fast-paced music → handheld tracking; ambient drone → slow dolly
- Use "single continuous shot" or "no cuts" to prevent unwanted edits

**Example A2V prompt (for a lo-fi hip-hop track):**
> A warm, amber-lit bedroom at 2am. A young woman sits at a wooden desk, studying by the glow of a desk lamp. The camera begins in a medium shot framing her from the side, slowly pushing in over 6 seconds to a close-up on her hand writing in a notebook. As the beat's vinyl crackle plays, dust motes float in the lamp light. She pauses, looks up out the window at rain hitting the glass. Outside the window, a blurred city glows softly. The visual tempo is slow and unhurried, matching the relaxed rhythm of the track. Warm colour grade — amber, honey, and low contrast. Shallow depth of field. Cinematic but intimate.

---

### FIRST-AND-LAST-FRAME (FLF2V)
You provide a **starting image** and an **ending image**. The model builds the motion between them. Your prompt guides the journey — the path, the pacing, and the camera.

Your prompt should cover:
- **Motion path**: how does the subject move from the starting state to the ending state?
- **Camera behaviour**: does it stay static, or follow the transformation?
- **Pacing**: slow and gradual, or sudden mid-point shift?
- **What happens in the middle**: any intermediate actions, reveals, or changes?
- **Audio** that accompanies the transformation

**Example FLF2V prompt:**
> Starting with a woman's face in profile, eyes closed and expression neutral, the scene gently animates as she slowly turns to face the camera over 4 seconds. Her eyes open gradually as warm morning light brightens across her face from left to right, as if a curtain is being drawn. The camera holds completely still in a tight medium close-up throughout. Her expression shifts from rest to a quiet, content smile by the final frame. The audio is a soft piano melody, with birdsong beginning faintly at the midpoint. No cuts — single continuous shot.

---

## Universal Prompt Quality Rules

### DO:
- Write in **present tense** throughout the entire prompt
- Use **specific, observable physical details** — not emotional labels
- Match **prompt length to video length** — longer video = longer prompt
- Place the **style/genre early** to set the visual baseline
- Use **connectors** for temporal flow: "as," "then," "while," "just before," "the moment"
- Describe camera movement with **timing and reveal**: when it starts, what it shows afterward
- **Break dialogue** into short phrases with acting beats between lines

### DON'T:
- Use emotional labels: "she feels sad," "he seems nervous" — describe posture, gesture, and expression instead
- Include readable text, signs, logos, or brand names — LTX-2 cannot reliably generate legible text
- Stack more than 2 characters in simultaneous action
- Mix contradictory light sources without a clear narrative reason
- Over-complicate: more than 8 simultaneous instructions degrades output
- Use passive voice: "camera is moved" — write "camera pushes in"
- Write a 10-word prompt for a 10-second video

---

## Quick-Reference: What LTX-2 Handles Well vs. Poorly

| Works Well | Avoid |
|---|---|
| Single-subject emotional moments | Crowds or 3+ simultaneous characters |
| Clear sequential camera language | Vague or contradictory camera instructions |
| Named visual styles (noir, Pixar, cyberpunk) | Abstract mood labels without visual grounding |
| Atmospheric effects (fog, rain, bokeh, dust) | Readable text, signs, logos, brand names |
| Dialogue broken into short phrases with acting beats | Long single-sentence speeches with no direction |
| Detailed audio descriptions woven into the prompt | No audio description at all |
| Sequential action with clear cause and effect | Simultaneous complex multi-character actions |
| Facial nuance and gestural performance | Internal emotional states |
| Dance and rhythmic / repetitive motion | Complex non-linear physics |
| Consistent lighting logic | Contradictory light sources in one scene |

---

## Iteration Strategy

LTX-2 is built for fast experimentation. When a generation misses:

1. **Simplify first** — remove one element and regenerate before adding anything
2. **Clarify camera** — be more specific about framing, movement timing, and what gets revealed
3. **Anchor style earlier** — move the aesthetic note to the very first sentence
4. **Match prompt length to video length** — if the output feels rushed, your prompt is too short
5. **Break complex dialogue** — split long speeches into short phrases with physical beats between each line
6. **Split into two shots** — if the scene is complex, generate two sequential 5-second clips and cut together
7. **Add specificity** — vague actions produce vague results; name the body part, direction, speed

---

## Helpful Vocabulary Reference

**Camera angles:** wide establishing, medium shot, medium close-up, close-up, extreme close-up, low angle, high angle, Dutch angle, over-the-shoulder, bird's eye / overhead, worm's eye

**Camera moves:** static/locked, slow dolly in, dolly back/pull out, pan left/right, tilt up/down, handheld track, arc/orbit, crane up/down, whip pan, push in, pull back, glide/float

**Lighting:** golden hour, magic hour, blue hour, overcast diffuse, rim light, backlight, hard key light, soft fill, practical lamp, neon glow, candlelight, moonlight, fluorescent overhead

**Atmosphere:** fog, mist, haze, steam, smoke, dust motes, rain, snow, embers, ash, bokeh, depth of field, lens flare, anamorphic streak

**Pacing:** slow motion, real time, time-lapse, lingering, continuous, freeze-frame, fade in, fade out, seamless transition, sudden stop

**Audio:** ambient, foley, score, VO, diegetic, non-diegetic, room tone, reverb, echo, stereo width, intimate mic sound, wide acoustic, distorted radio-style, crisp and dry

**Film look:** 16mm grain, 35mm film, anamorphic, vintage VHS, Super 8, clean digital, stylized digital, RAW log look, high contrast grade, bleach bypass

How to use: Open claude.ai, paste this file as your first message, then follow up with: "Write an LTX-2 prompt for: [describe your scene in plain language]". Copy the output into your ComfyUI text encoder node.

Frequently asked questions

Match the length to the video duration. A 5–10 second clip needs a minimum of 6–10 detailed sentences. A short prompt for a long video leaves the model without enough direction — it fills the time by inventing motion that wasn't asked for. When in doubt, write more.

No. Stable Diffusion rewards tag stacking — short comma-separated keyword lists. LTX-2 actively resists that approach. Rewrite your prompts as flowing, present-tense shot descriptions that cover scene, motion, camera, and audio. The output difference is significant.

Yes. Place speech in quotation marks and add short acting directions between each line — don't stack all the dialogue in one block. Specify tone, accent, volume, and language where relevant. LTX-2.3 handles multi-line dialogue well when it's broken up this way.

Three common causes: CFG set too high (keep it at 3.0–3.5 for the dev model, 1.0 for distilled), no negative prompt, or a vague motion description. Add a standard negative prompt to every generation: "morphing, distortion, warping, flicker, jitter, stutter, shaky camera, temporal artifacts."

LTX-2.3 is the current release. It adds synchronized audio generation in a single pass, improved portrait quality, stronger prompt understanding, better text rendering, and higher resolution up to 2560×1440. Both models use the same 6-block prompting approach covered in this guide.

Yes — more than for image models. LTX-2 invents motion if nothing constrains it. Always include at minimum: flickering, jitter, distortion, warping, blurry, low quality, watermark. Add more specific terms if you keep seeing a particular artifact.

What to do next

Copy Example 1, paste it into your Gemma encoder node, and run it.

That single iteration loop — generate, identify what's off, fix one block — teaches you more than reading ten more guides. The workflow is the next step.

Tutorial · Roadmap Level 2 · Video Generation series · ComfyUI interface overview · LTX-2.3 workflow guide