---
name: ltx2-prompting
description: Write high-quality prompts for LTX-2 / LTX-2.3 AI video generation. Use this skill when the user wants to generate a video with LTX Studio, LTX API, RunDiffusion, or the LTX-2 / LTX-2.3 model, or when they want to convert an idea, scene, or script into an LTX-2 compatible prompt. Covers all four generation modes: Text-to-Video (T2V), Image-to-Video (I2V), Audio-to-Video (A2V), and First-and-Last-Frame (FLF2V). Triggers include: "write a prompt for LTX", "make a video prompt", "LTX Studio prompt", "I want to generate a video", or any request to turn a creative idea into a structured LTX-2 video generation prompt.
---
This skill guides the writing of effective, production-quality prompts for the LTX-2 and LTX-2.3 video generation models by Lightricks. LTX-2.3 generates synchronized audio and video in a single pass, at native 4K resolution and up to 50 FPS. Strong prompts are the primary lever for output quality.
The user provides a video idea, scene concept, character description, script, or reference images. Transform it into a well-structured LTX-2 prompt using the correct mode and principles below.
---
## Step 0 — Identify the Generation Mode
Before writing a single word of the prompt, determine which mode applies:
| Mode | When to use | What the model needs from the prompt |
|---|---|---|
| **Text-to-Video (T2V)** | No input image; idea only | Everything — subject, action, environment, lighting, camera, audio |
| **Image-to-Video (I2V)** | User has a starting image | Motion only — what happens next; skip describing what's already visible |
| **Audio-to-Video (A2V)** | User has an audio file (music, VO, SFX) | Visual interpretation — what scenes and camera work should accompany the sound |
| **First-and-Last-Frame (FLF2V)** | User has both a start image and an end image | Transition — the journey between the two states (motion path, pacing, camera) |
Ask the user which mode they intend if it's unclear.
---
## Output Format — CRITICAL
**The final prompt output is ALWAYS one single flowing paragraph. No labels. No headers. No line breaks between sections. No [SCENE START], [ACTION], [CAMERA] tags in the output.**
This is the correct output format:
> EXT. COASTAL CLIFF – GOLDEN HOUR. A wide establishing shot frames a lighthouse standing alone on a rocky Atlantic cliff, battered by strong onshore wind. Tall grass whips violently around the base of the structure. The camera begins far back, slowly pushing in over 6 seconds toward a weathered wooden door at the base of the lighthouse. The door rattles in the wind but does not open. No characters are visible — only the structure, the cliff, the churning grey-green sea below, and an enormous orange-gold sky pressing down from above. The audio is all wind — deep, sustained, and surrounding — with the crash of distant waves and the low metallic creak of the lighthouse lantern housing swaying above. The overall colour grade is warm amber at the top of the frame fading to cold slate at the sea. 16mm film grain, anamorphic lens with subtle horizontal flare from the sun. Cinematic, arthouse slow cinema aesthetic.
This is WRONG — never output labelled blocks:
> [SCENE START] EXT. COASTAL CLIFF – GOLDEN HOUR...
> [CAMERA MOVEMENT] The camera begins far back...
> [AUDIO] The audio is all wind...
The labelled blocks below (SCENE START, ACTION, CAMERA, etc.) are a **planning checklist only** — use them internally to make sure you have covered every element, then merge everything into one seamless paragraph before giving the final answer.
**Prompt length must match video length — strictly.** Use these word count targets:
| Video duration | Target prompt length | Rough sentence count |
|---|---|---|
| 3–5 seconds | 50–80 words | 3–5 sentences |
| 5–8 seconds | 80–120 words | 5–8 sentences |
| 8–12 seconds | 120–170 words | 8–11 sentences |
| 12–20 seconds | 170–230 words | 11–15 sentences |
If the user does not specify a duration, default to **5–8 seconds (80–120 words)**. Do not exceed that range unless asked for a longer clip. More words does not mean better output — cover only what matters: subject, action, camera, light, audio, style. Cut anything redundant or decorative.
## Planning Checklist (Internal Use Only — Never Output These Labels)
Cover all six elements when building the prompt, then merge into one paragraph:
```
SCENE START → shot angle, INT/EXT, location, character(s), initial state
ACTION → motion sequence beat by beat, dialogue in quotes
CAMERA → when it moves, how, what it reveals
LIGHTING → light source, quality, colour, atmosphere
AUDIO → ambient sound, music, dialogue delivery, foley
AESTHETIC → named style/genre, film look, colour grade
```
---
## Section-by-Section Guide
## Planning Element 1 — Scene Establishment
Open with a wide establishing description. This anchors everything else.
Define:
- **Shot angle & scale**: medium-low angle, bird's eye, extreme close-up, wide establishing shot, Dutch angle, over-the-shoulder, etc.
- **Interior/Exterior + Location**: INT. / EXT. notation is optional but helpful. Describe the environment precisely — not "a city" but "a rain-soaked Tokyo back alley at midnight, neon signs reflecting in puddles."
- **Who is in the scene**: 1–2 characters max. Include age range, clothing, hair, distinguishing features — only observable details, not internal states.
- **Initial pose or state**: what the character or object is doing at frame 1.
- **Key environmental detail visible at frame 1**: a prop, texture, atmospheric element.
> ✅ "EXT. TOKYO BACK ALLEY – NIGHT. A wide low-angle shot frames a young woman in her mid-20s standing at the entrance of a narrow alley, a red umbrella over her shoulder, rain hammering the pavement around her. Neon signs in Japanese glow pink and green above her, their reflections shimmering in every puddle."
> ❌ "A woman is in a city at night."
---
## Planning Element 2 — Action & Dialogue
Describe the core motion as a **natural, cause-and-effect sequence from beginning to end**. Write it like stage direction.
- Use **present tense verbs**: walks, turns, reaches, says, exhales, pauses, lifts, drops
- For dialogue: place speech in **"quotation marks"** and mention accent/tone/language if relevant
- Break complex dialogue into short phrases with acting directions between each line. This is critical for LTX-2.3:
> ✅ "She takes a slow breath, then begins: 'I waited three years for this.' She pauses, looks to the side. 'Three years.' Her voice drops to a near-whisper. 'And now you are telling me it does not matter?' Her jaw tightens."
> ❌ "She says a long dramatic thing about waiting and being disappointed."
- Limit to **1–2 characters**. More than 2 causes drift and artifacts.
- Use connectors to ensure smooth temporal flow: "as," "then," "while," "just before"
- Close the action arc — give the scene a clear ending beat.
---
## Planning Element 3 — Camera Movement
Be specific and sequential. Tell the model exactly how the camera behaves at each moment, and what it reveals after each move.
**Camera language vocabulary:**
| Movement | Use for |
|---|---|
| Static / locked-off shot | Tension, stillness, observation |
| Slow dolly in / push in | Building intimacy, focus on subject |
| Dolly back / pull back | Revealing scale or environment |
| Pan left / pan right | Following action or revealing what is off-screen |
| Tilt up / tilt down | Scale reveal, dramatic emphasis |
| Handheld tracking shot | Energy, chase, documentary feel |
| Arc / orbit around subject | Hero moment, 360 degree reveal |
| Crane up / crane down | Scale, establishing context |
| Over-the-shoulder | Dialogue, perspective |
| Dutch angle | Tension, unease, disorientation |
| Overhead / bird's eye | God's-eye view, isolation |
| Macro lens / extreme close-up | Texture, detail, emotional intimacy |
**Always describe:** when the move starts, what it does, and what it reveals at the end.
> ✅ "The camera opens in a tight close-up on her hands around the coffee cup. As she begins speaking, it slowly pulls back to a medium shot, and then — on her final line — it cuts to a wide shot revealing the empty restaurant around her."
---
## Planning Element 4 — Lighting & Mood
Describe what the camera physically sees. Never use abstract mood labels like "sad" or "tense" — use lighting, texture, and physical detail to convey mood instead.
**Lighting conditions:**
- Golden hour / magic hour (warm, directional, long shadows)
- Overcast diffused light (soft, no shadows, muted palette)
- Dramatic single spotlight (high contrast, deep shadow)
- Neon glow: specify colour — cyan, magenta, amber, red
- Backlighting / rim lighting (silhouette effect, outline glow)
- Candlelight / practical lamp (warm, flickering)
- Hard overhead fluorescent (clinical, uncomfortable)
- Deep underwater blue-green
- Firelight (orange-red, dynamic, flickering)
**Atmospheric elements:**
- Fog, mist, haze, steam rising
- Dust motes suspended in light beams
- Rain, wet surfaces, puddle reflections
- Smoke, heat shimmer, ash falling
- Shallow depth of field / bokeh in background
- Lens flare, anamorphic flare
- Film grain overlay
**Colour palette:**
- Warm amber vs. cold desaturated blue
- High contrast black and white
- Monochromatic (single colour family)
- Neon-saturated complementary pop
- Muted, faded, vintage washed tones
> ✅ "Golden hour light rakes in from the right, casting long warm shadows across the cracked asphalt and leaving the left side of her face in cool shadow. Steam rises from a nearby grate, diffusing the light into a soft haze. The scene is warm in tone, with amber dominating and occasional flashes of blue from a distant police light."
---
## Planning Element 5 — Audio
LTX-2 generates synchronized audio and video in a single pass. Audio prompting is **not optional** — it directly shapes the output. Be as specific about what you hear as what you see.
**Ambient / environmental sound:**
- Rain on pavement, thunder rumbling in the distance
- Wind through trees, leaves rustling
- Coffeeshop murmur, low room tone, clinking cups
- Forest ambience — birds, insects, stream
- City traffic, distant sirens, crowd hum
- Mechanical hum, industrial drone, server room buzz
**Music:**
- Describe genre, tempo, instrumentation — "a slow, mournful cello melody," "an upbeat lo-fi hip-hop beat with a warm vinyl crackle," "tense orchestral strings building to a crescendo"
- Do NOT name specific songs or artists
**Dialogue:**
- Always in quotation marks
- Specify: language, accent, tone, volume, pacing
**Foley / sound design:**
- "Heavy footsteps on wet gravel," "the click of a gun being cocked," "pages turning rapidly," "a distant glass shattering"
**Audio-to-Video specific rules (A2V mode):**
- Describe what scenes and subjects should appear in response to the audio
- Use timing cues: "at 2 seconds, the logo reveals," "the camera moves on the snare hit"
- Motion regularity: "constant speed pan" syncs better with metronomic beats than variable motion
- Add guardrails: "single continuous shot, no cuts" if you do not want the model to invent transitions
---
## Planning Element 6 — Stylised Aesthetic
Name the visual style or genre early and commit to it. LTX-2 responds well to named aesthetics placed clearly at the start or end of the prompt.
**Animation styles:** High-detail 3D animation (Pixar-style), 2D hand-drawn animation, stop-motion / claymation texture, 8-bit pixel art, anime / manga stylization
**Cinematic genres:** Film noir, epic space opera, period drama, psychological thriller, modern romance, documentary / handheld realism, arthouse / experimental, horror
**Visual aesthetics:** Painterly / impressionist texture, cyberpunk, vintage analog film, fashion editorial, surrealist / dreamlike, comic book / graphic novel, luxury / high-end commercial
---
## Mode-Specific Prompt Rules
### TEXT-TO-VIDEO (T2V)
The model generates everything from scratch. You are the only source of visual information.
- Define **everything**: subject, appearance, action, environment, lighting, camera, audio, style
- Start with the strongest visual concept — the most important element goes first
- Use 6–10 sentences minimum for a 5–10 second clip
- Avoid vague opening lines like "A beautiful scene of..." — be specific from word one
**Example T2V prompt:**
> EXT. FOGGY LAKE – PRE-DAWN. A lone fisherman rows across a perfectly still, fog-covered lake in the moments before sunrise. The camera glides overhead in a slow crane shot, tracking his slow progress across the dark water. His wooden rowboat creaks softly with each pull of the oars, and a small lantern at the bow casts a warm amber circle that reflects in gentle ripples. As the camera descends to a low water-level angle, reeds along the distant shoreline come into focus, swaying faintly in a cold breeze. A single bird calls somewhere in the mist. The sound design is minimal: oar dips, water laps, the faint creak of wood, and a distant wind. The colour palette is deep navy and slate blue, with the lantern's amber warmth as the only contrast. Documentary realism aesthetic, 16mm film grain, anamorphic lens.
---
### IMAGE-TO-VIDEO (I2V)
Your input image defines the visual starting point. The model knows what the scene looks like — your prompt should describe **what happens next**.
**Key rule:** Do NOT re-describe what is already visible in the image. Describe the transition from stillness to motion.
> ✅ "She slowly raises her right hand toward the light source. The camera gently pushes in as her fingers extend. The ambient sound of rain on glass grows louder, and a soft orchestral swell begins to build."
> ❌ "A woman with dark hair in a white dress stands by a window looking at the rain." (This describes the image — the model already knows this.)
---
### AUDIO-TO-VIDEO (A2V)
Your audio file anchors the timing and rhythm. The model uses your audio as a structural backbone and generates visuals around it.
- Use timing cues to anchor visuals to audio beats
- Describe motion that matches audio energy: fast-paced music → handheld tracking; ambient drone → slow dolly
- Use "single continuous shot" or "no cuts" to prevent unwanted edits
**Example A2V prompt (for a lo-fi hip-hop track):**
> A warm, amber-lit bedroom at 2am. A young woman sits at a wooden desk, studying by the glow of a desk lamp. The camera begins in a medium shot framing her from the side, slowly pushing in over 6 seconds to a close-up on her hand writing in a notebook. As the beat's vinyl crackle plays, dust motes float in the lamp light. She pauses, looks up out the window at rain hitting the glass. Outside the window, a blurred city glows softly. The visual tempo is slow and unhurried, matching the relaxed rhythm of the track. Warm colour grade — amber, honey, and low contrast. Shallow depth of field. Cinematic but intimate.
---
### FIRST-AND-LAST-FRAME (FLF2V)
You provide a **starting image** and an **ending image**. The model builds the motion between them. Your prompt guides the journey — the path, the pacing, and the camera.
Your prompt should cover:
- **Motion path**: how does the subject move from the starting state to the ending state?
- **Camera behaviour**: does it stay static, or follow the transformation?
- **Pacing**: slow and gradual, or sudden mid-point shift?
- **What happens in the middle**: any intermediate actions, reveals, or changes?
- **Audio** that accompanies the transformation
**Example FLF2V prompt:**
> Starting with a woman's face in profile, eyes closed and expression neutral, the scene gently animates as she slowly turns to face the camera over 4 seconds. Her eyes open gradually as warm morning light brightens across her face from left to right, as if a curtain is being drawn. The camera holds completely still in a tight medium close-up throughout. Her expression shifts from rest to a quiet, content smile by the final frame. The audio is a soft piano melody, with birdsong beginning faintly at the midpoint. No cuts — single continuous shot.
---
## Universal Prompt Quality Rules
### DO:
- Write in **present tense** throughout the entire prompt
- Use **specific, observable physical details** — not emotional labels
- Match **prompt length to video length** — longer video = longer prompt
- Place the **style/genre early** to set the visual baseline
- Use **connectors** for temporal flow: "as," "then," "while," "just before," "the moment"
- Describe camera movement with **timing and reveal**: when it starts, what it shows afterward
- **Break dialogue** into short phrases with acting beats between lines
### DON'T:
- Use emotional labels: "she feels sad," "he seems nervous" — describe posture, gesture, and expression instead
- Include readable text, signs, logos, or brand names — LTX-2 cannot reliably generate legible text
- Stack more than 2 characters in simultaneous action
- Mix contradictory light sources without a clear narrative reason
- Over-complicate: more than 8 simultaneous instructions degrades output
- Use passive voice: "camera is moved" — write "camera pushes in"
- Write a 10-word prompt for a 10-second video
---
## Quick-Reference: What LTX-2 Handles Well vs. Poorly
| Works Well | Avoid |
|---|---|
| Single-subject emotional moments | Crowds or 3+ simultaneous characters |
| Clear sequential camera language | Vague or contradictory camera instructions |
| Named visual styles (noir, Pixar, cyberpunk) | Abstract mood labels without visual grounding |
| Atmospheric effects (fog, rain, bokeh, dust) | Readable text, signs, logos, brand names |
| Dialogue broken into short phrases with acting beats | Long single-sentence speeches with no direction |
| Detailed audio descriptions woven into the prompt | No audio description at all |
| Sequential action with clear cause and effect | Simultaneous complex multi-character actions |
| Facial nuance and gestural performance | Internal emotional states |
| Dance and rhythmic / repetitive motion | Complex non-linear physics |
| Consistent lighting logic | Contradictory light sources in one scene |
---
## Iteration Strategy
LTX-2 is built for fast experimentation. When a generation misses:
1. **Simplify first** — remove one element and regenerate before adding anything
2. **Clarify camera** — be more specific about framing, movement timing, and what gets revealed
3. **Anchor style earlier** — move the aesthetic note to the very first sentence
4. **Match prompt length to video length** — if the output feels rushed, your prompt is too short
5. **Break complex dialogue** — split long speeches into short phrases with physical beats between each line
6. **Split into two shots** — if the scene is complex, generate two sequential 5-second clips and cut together
7. **Add specificity** — vague actions produce vague results; name the body part, direction, speed
---
## Helpful Vocabulary Reference
**Camera angles:** wide establishing, medium shot, medium close-up, close-up, extreme close-up, low angle, high angle, Dutch angle, over-the-shoulder, bird's eye / overhead, worm's eye
**Camera moves:** static/locked, slow dolly in, dolly back/pull out, pan left/right, tilt up/down, handheld track, arc/orbit, crane up/down, whip pan, push in, pull back, glide/float
**Lighting:** golden hour, magic hour, blue hour, overcast diffuse, rim light, backlight, hard key light, soft fill, practical lamp, neon glow, candlelight, moonlight, fluorescent overhead
**Atmosphere:** fog, mist, haze, steam, smoke, dust motes, rain, snow, embers, ash, bokeh, depth of field, lens flare, anamorphic streak
**Pacing:** slow motion, real time, time-lapse, lingering, continuous, freeze-frame, fade in, fade out, seamless transition, sudden stop
**Audio:** ambient, foley, score, VO, diegetic, non-diegetic, room tone, reverb, echo, stereo width, intimate mic sound, wide acoustic, distorted radio-style, crisp and dry
**Film look:** 16mm grain, 35mm film, anamorphic, vintage VHS, Super 8, clean digital, stylized digital, RAW log look, high contrast grade, bleach bypass