Earngenix Logo
Skip to main content

ComfyUI Workflows · Roadmap Level 2 · Video Generation

LTX 2.3 ComfyUI Tutorial: Text to Video & Image to Video

Most people spend an hour failing before they generate a single frame. Wrong model file, wrong folder, black frames on output. This guide gives you the exact sequence — free downloadable workflows included.

Free download

Workflows included

BF16 · FP8 · GGUF

Models covered

v0.22+

ComfyUI version

By Earngenix Team··12 min read

⚡ Quick Answer

LTX 2.3 runs in ComfyUI v0.22 or later with no extra custom nodes (unless you use GGUF — more on that below). Download the workflow file, place the model files in the correct folders, and open the workflow in ComfyUI. The workflow supports both BF16 and FP8 — you switch between them inside the workflow. GGUF uses a separate workflow. Both T2V and I2V have a toggle switch inside the workflow too.

What is LTX 2.3?

LTX 2.3 is a free, open-source video generation model built by Lightricks. At 22B parameters, it handles text-to-video, image-to-video, and native 9:16 portrait output without any cropping — and it runs locally through ComfyUI with no subscription required.

It's roughly 18× faster throughput than WAN 2.2 14B, and audio-to-video is built directly into the generation process — synced during generation, not tacked on afterwards. ComfyUI is the best way to run it locally: full control over every setting, access to the official workflow, and zero dependency on third-party services.

🎬 Demo — LTX 2.3 output

Loading video…
LTX 2.3 · BF16 Distilled · 2560×1408 · 24 frames

Generated with the prompt below — cinematic framing, clear motion, and a specific scene are the three things LTX 2.3 responds to best.

Demo prompt — copy to try it yourself

"A wide establishing shot frames a lighthouse standing alone on a rocky Atlantic cliff, battered by strong onshore wind. Tall grass whips violently around the base of the structure. The camera begins far back, slowly pushing in over 6 seconds toward a weathered wooden door at the base of the lighthouse. The door rattles in the wind but does not open. No characters are visible — only the structure, the cliff, the churning grey-green sea below, and an enormous orange-gold sky pressing down from above. The audio is all wind — deep, sustained, and surrounding — with the crash of distant waves and the low metallic creak of the lighthouse lantern housing swaying above. The overall colour grade is warm amber at the top of the frame fading to cold slate at the sea. 16mm film grain, anamorphic lens with subtle horizontal flare from the sun. Cinematic, arthouse slow cinema aesthetic."

See the results before you download anything

All three model versions produce solid output. The differences mostly show up at the limits of your hardware. Here's the same prompt run through each version — watch before you commit to a 23–39GB download.

🎬 Same prompt · Three models · Compare the output

BF16BF16 · Highest detail · 24GB+
Loading video…
BF16 · Highest detail · 24GB+

Sharpest edges, slowest — only worth it over FP8 in very close detail shots.

FP8 ✓FP8 · Recommended · 16GB+
Loading video…
FP8 · Recommended · 16GB+

Near-identical to BF16 at normal playback. Best choice for RTX 4070 and above.

GGUFGGUF · Fastest · 8–16GB
Loading video…
GGUF · Fastest · 8–16GB

Slightly softer edges on fine textures. Best option for older or 8–12GB cards.

Comparison prompt — used for all three outputs above

"A medium tracking shot frames a young woman in her late 20s walking slowly along a glistening urban pavement, a light beige trench coat over her shoulders and dark hair slightly damp, a few strands resting naturally across her cheek, holding a transparent umbrella overhead as raindrops patter steadily on its surface. The camera opens in a smooth frontal tracking shot moving with her at walking pace, then arcs gently to her right over four seconds, revealing a colorful flower stand and glowing café frontage behind her, settling into a clean medium side-angle shot. As the camera settles, she slows to a stop beside the stand, reaches forward, and lifts a small bouquet of pale pink flowers, turning it slowly with a quiet smile. A cool breeze shifts a few damp strands of hair across her cheek. Amber light from café windows spills across her coat and pools as long shimmering reflections on the wet pavement below; a single car moves through the background, its headlights drawing a brief warm streak across the road. The audio is steady medium rain on the umbrella canopy and pavement, wet footsteps on stone, muffled café warmth from a nearby doorway, and a sparse two-note piano melody sitting just beneath the ambient layer. Cinematic street photography aesthetic, 35mm film look with fine grain, warm amber and honey colour grade cooling to desaturated teal in the background, anamorphic lens with soft bokeh on background lights"

ModelOutput qualityGeneration timeVRAM neededUse this when...
Distilled 1.1 BF16Highest detailSlowest24GB+Max quality, RTX 3090 / 4090
Distilled 1.1 FP8 ✅Near-identical to BF16Faster16GB+Best balance — RTX 4070 and up
GGUFGood, softer edgesFastest8–16GBOlder or lower-VRAM cards

Not sure which to pick? If you're on an RTX 4070 or better, download the FP8 version and stop reading this column. Only go BF16 if you need absolute maximum quality and have 24GB+ VRAM to spare.

What you need before you start

Two things break most installs before they begin: the wrong ComfyUI version, and not enough VRAM for the chosen model. Check both before downloading anything.

ComfyUI version — check you're on v0.22 or later

LTX 2.3 nodes are built into ComfyUI natively from v0.22 onwards — no custom nodes needed for BF16 and FP8. The version number is shown in the bottom-left corner of the interface.

⚠️ If you load a workflow and see red nodes, it's almost always a version issue or a missing node. See the fix red or missing nodes guide for step-by-step instructions.

Which model should you download?

The v1.1 distilled models are newer and better than the original LTX 2.3 release. Use one of these, not the original dev weights, unless you specifically need dev-quality for a final render.

VersionFileSizeVRAMUse this when...
Distilled 1.1 BF16ltx-2.3-22b-distilled-1.1_transformer_only_bf16.safetensors39.1GB24GB+Maximum quality, RTX 3090/4090
Distilled 1.1 FP8 ✅ltx-2.3-22b-distilled-1.1_transformer_only_fp8_scaled.safetensors23.4GB16GB+Best balance — RTX 4080/4090, 30-series
GGUF (Q4_K_M)ltx-2-3-22b-dev-Q4_K_M.gguf~12GB8–16GBOlder or lower-VRAM cards

Running on less than 8GB VRAM?

If your GPU is under 8GB, RunPod lets you rent an A100 for under €1 per hour and run LTX 2.3 at full quality without touching your local machine.

Download LTX 2.3 — the right files and folders

Most failed installs come down to a missing file or a file in the wrong folder. Start by downloading the workflow file below, then follow the model download instructions — this way you have the workflow ready and the downloads listed in order.

Step 1 — Download the Workflow

Download the workflow JSON file first. There is one workflow for both BF16 and FP8 — you switch between the models inside the workflow. The GGUF version uses a separate workflow. Each workflow also has a T2V / I2V toggle button inside — you can switch between text-to-video and image-to-video without loading a different file.

💡 To load: open ComfyUI → drag the JSON file onto the canvas, or go to Load in the menu and select the file.

Step 2 — Download the model files

LTX 2.3 needs more than just the main model — here's every file and exactly where it goes. Use the download buttons to get each file directly from HuggingFace.

BF16 and FP8 Models — download the one that matches your VRAM

FileWhat it doesDownload
ltx-2.3-22b-distilled-1.1_transformer_only_bf16.safetensorsMain model — BF16 (39GB, 24GB+ VRAM)Download
ltx-2.3-22b-distilled-1.1_transformer_only_fp8_scaled.safetensorsMain model — FP8 recommended (23GB, 16GB+ VRAM)Download
gemma_3_12B_it_fp8_scaled.safetensorsText encoder 1 ⚠ requiredDownload
ltx-2.3_text_projection_bf16.safetensorsText encoder 2 (projection) ⚠ requiredDownload
LTX23_audio_vae_bf16.safetensorsAudio VAEDownload
LTX23_video_vae_bf16.safetensorsVideo VAEDownload
taeltx2_3.safetensorsTAE (fast preview VAE)Download
ltx-2.3-22b-distilled-1.1_lora-dynamic_fro09_avg_rank_111_bf16.safetensorsDistilled LoRA (required)Download
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler ×2Download

GGUF Model files — use these if you're on 8–16GB VRAM

FileWhat it doesDownload
ltx-2-3-22b-dev-Q4_K_M.ggufGGUF main model Q4_K_M (~12GB)Download
gemma_3_12B_it_fp4_mixed.safetensorsText encoder 1 ⚠ required (GGUF version)Download
ltx-2.3_text_projection_bf16.safetensorsText encoder 2 (projection) ⚠ requiredDownload
LTX23_audio_vae_bf16.safetensorsAudio VAEDownload
LTX23_video_vae_bf16.safetensorsVideo VAEDownload
taeltx2_3.safetensorsTAE (fast preview VAE)Download
ltx-2.3-22b-distilled-1.1_lora-dynamic_fro09_avg_rank_111_bf16.safetensorsDistilled LoRADownload
ltx-2-19b-ic-lora-detailer.safetensorsIC LoRA detailerDownload
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler ×2Download

Step 3 — Put each file in the right folder

Two things catch people out: (1) GGUF models go in checkpoints/, not diffusion_models/ — wrong folder means ComfyUI won't see it. (2) Both text encoder files are required — missing either one causes black frames.

📁 BF16 / FP8 folder structure

folder structure
ComfyUI/
├── models/
├── diffusion_models/
├── ltx-2.3-22b-distilled-1.1_transformer_only_bf16.safetensors← BF16 model
└── ltx-2.3-22b-distilled-1.1_transformer_only_fp8_scaled.safetensors← FP8 model (choose one)
├── text_encoders/
├── gemma_3_12B_it_fp8_scaled.safetensors← encoder 1 ⚠ required
└── ltx-2.3_text_projection_bf16.safetensors← encoder 2 ⚠ required
├── vae/
├── LTX23_audio_vae_bf16.safetensors
├── LTX23_video_vae_bf16.safetensors
└── taeltx2_3.safetensors
├── loras/
└── ltx-2.3-22b-distilled-1.1_lora-dynamic_fro09_avg_rank_111_bf16.safetensors
├── models/
└── ltx-2.3-spatial-upscaler-x2-1.1.safetensors← upscaler

⚠ Yellow = both text encoders required — missing either one causes black frames

📁 GGUF folder structure

folder structure
ComfyUI/
├── models/
├── checkpoints/← GGUF goes here, NOT diffusion_models/
└── ltx-2-3-22b-dev-Q4_K_M.gguf
├── text_encoders/
├── gemma_3_12B_it_fp4_mixed.safetensors← encoder 1 ⚠ required
└── ltx-2.3_text_projection_bf16.safetensors← encoder 2 ⚠ required
├── vae/
├── LTX23_audio_vae_bf16.safetensors
├── LTX23_video_vae_bf16.safetensors
└── taeltx2_3.safetensors
├── loras/
├── ltx-2.3-22b-distilled-1.1_lora-dynamic_fro09_avg_rank_111_bf16.safetensors
└── ltx-2-19b-ic-lora-detailer.safetensors
├── models/
└── ltx-2.3-spatial-upscaler-x2-1.1.safetensors← upscaler

⚠ Yellow = both text encoders required — missing either one causes black frames

How to load LTX 2.3 in ComfyUI

You've already downloaded the workflow JSON above. Here's how to open it in ComfyUI.

Load the workflow — step by step

  1. Open ComfyUI in your browser.
  2. Drag and drop the downloaded JSON file directly onto the ComfyUI canvas — it will load automatically.
  3. Alternatively, click the Load button in the ComfyUI menu bar and select the JSON file.
  4. The workflow opens with all nodes connected. If you downloaded the main workflow (BF16 + FP8), you'll see a model selector node — point it to your downloaded model file.
  5. To switch between BF16 and FP8, change the model file in the checkpoint loader node inside the workflow.
  6. To switch between text-to-video and image-to-video, use the T2V / I2V toggle button inside the workflow — no need to load a different file.
One workflow, two models, two modes: The main workflow (BF16 + FP8) handles both text-to-video and image-to-video through a toggle switch. You switch between BF16 and FP8 by selecting your downloaded model file in the checkpoint loader. The GGUF workflow works the same way with its own loader.

Red nodes or errors on first load?

Two causes cover most cases. First, check your ComfyUI version — you need v0.22+. Second, confirm both text encoder files are in models/text_encoders/. A single missing file produces either red nodes or silent generation failures.

🔴 Red nodes or errors on first load?

If you see red nodes, check these two things first:

  1. Your ComfyUI version must be v0.22+ — check the bottom-left corner of the interface.
  2. Both text encoder files must be in models/text_encoders/ — missing either one causes red nodes or black frames.

If you're still getting red nodes after checking both, you likely need a custom node. Follow the step-by-step guide: How to fix red / missing nodes in ComfyUI →

Required nodes

The GGUF workflow requires these two custom nodes — install via ComfyUI Manager or the GitHub links below:

Full installation instructions for both are in the missing nodes guide.

Speed up generation with SageAttention

The workflow uses SageAttention — an optimized attention implementation that can speed up generation by 30–50%. If you haven't installed it, generation will still work but will be slower. Install SageAttention once and all your ComfyUI workflows benefit.

Text to Video (T2V) with LTX 2.3

Switching between BF16, FP8, and T2V/I2V in the workflow

The main workflow covers everything — you don't need separate files for each mode. Inside the workflow, look for the model checkpoint loader to switch between BF16 and FP8. The T2V / I2V toggle is a button near the top of the workflow canvas.

Workflow overview showing model selector and T2V/I2V toggle button

Add your workflow screenshot here — shows the checkpoint loader (to switch BF16/FP8) and the T2V/I2V toggle button at the top of the canvas.

Close-up of T2V and I2V toggle button inside the workflow

Add a close-up screenshot of the T2V / I2V toggle button — so users know exactly where to click to switch modes.

Settings that work — what to set and why

These are reliable starting values. Don't change anything until you've confirmed a basic generation works.

Resolution

768×512

landscape

544×960 for portrait

Frames

24

≈ 5 sec at 24fps

Resolution

1280×720

landscape

720×1280 for portrait

Frames

48

≈ 5 sec at 48fps

How to write prompts for LTX 2.3

LTX 2.3 responds to cinematic, scene-specific language. Vague prompts produce vague motion. The more specific you are about scene, lighting, and camera behaviour, the more consistent the result.

✗ Weak prompt

"a cat playing"

No scene, no lighting, no motion description. The model guesses everything.

✓ Strong prompt

"A tabby cat batting a ball of yarn across a sunlit hardwood floor, slow motion, shallow focus, warm morning light, photorealistic"

Specific scene + lighting + camera behaviour = consistent output.

Want the full breakdown of prompt structure, camera language, and mode-specific tips? Read the complete LTX 2.3 prompting guide →

🎬 T2V output — strong prompt example

Loading video…
T2V · FP8 Distilled
T2V prompt used above

"A wide establishing shot frames a lighthouse standing alone on a rocky Atlantic cliff, battered by strong onshore wind. Tall grass whips violently around the base of the structure. The camera begins far back, slowly pushing in over 6 seconds toward a weathered wooden door at the base of the lighthouse. The door rattles in the wind but does not open. No characters are visible — only the structure, the cliff, the churning grey-green sea below, and an enormous orange-gold sky pressing down from above. The audio is all wind — deep, sustained, and surrounding — with the crash of distant waves and the low metallic creak of the lighthouse lantern housing swaying above. The overall colour grade is warm amber at the top of the frame fading to cold slate at the sea. 16mm film grain, anamorphic lens with subtle horizontal flare from the sun. Cinematic, arthouse slow cinema aesthetic."

Download the T2V / I2V workflow

The main workflow covers both T2V and I2V — switch modes with the toggle button inside. Download the version that matches your hardware.

Image to Video (I2V) with LTX 2.3

I2V works differently from T2V. You provide a still image and the model generates motion from it. Your prompt describes how things move — not what the scene looks like, because the scene is already your image. Use the I2V toggle button inside the workflow to switch to image-to-video mode.

📸 I2V workflow — where to add your image and switch modes

I2V section of the workflow showing image input node and I2V toggle active

screenshot here showing the I2V toggle switched on and the image loader node — so users can see exactly where to load their input image.

Input image example used for I2V generation

Example of the input image you used — helps users understand what makes a good I2V source image.

What makes a good input image

LTX 2.3's I2V is notably better than LTX 2 — sharper edge preservation, more consistent subjects across frames. For best results:

  • Clear subject with no motion blur — the model needs a clean frame to start from.
  • Good, even lighting — extreme contrast or heavy shadows create flickering across frames.
  • Avoid overly busy backgrounds if you want the subject to stay consistent.

Settings and prompting for I2V

Settings are similar to T2V. The key difference is denoise strength — start at 0.85 for subtle motion. Lower it for more movement from the source image.

The prompt describes motion, not the scene. Here's an example using a portrait photo as the input:

I2V prompt example — motion description, not scene description

"A beauty influencer is looking directly at the camera while filming a makeup tutorial. The camera remains completely static throughout the shot. She naturally speaks to the audience, explaining the makeup process with realistic lip movement and subtle facial expressions. While talking, she gently applies lipstick to her lips with smooth, controlled hand movements. She occasionally smiles, slightly tilts her head, blinks naturally, and maintains eye contact with the camera as if teaching viewers. Her hair moves slightly with natural body motion. The lipstick application appears realistic and precise. Her free hand occasionally makes small gestures while explaining the tutorial. Natural breathing, authentic human movement, realistic facial muscles, subtle eyebrow movements, and conversational expressions. Professional beauty content creator energy, friendly and engaging presentation style. The background remains stable and unchanged. Soft studio lighting remains consistent. High facial detail, realistic skin texture, natural eye movement, realistic hand motion, smooth temporal consistency, photorealistic beauty tutorial video, social media influencer style, professional cosmetics advertisement quality, cinematic realism, sharp focus, natural motion, 10-second beauty tutorial clip. Motion Guidance: Camera: Locked-off static shot. Subject: Speaking to camera. Action: Applying lipstick while explaining the process. Expression: Friendly, confident, engaging. Movement: Small hand gestures, blinking, subtle head movements, natural body sway. Pace: Calm and natural, not rushed. "

Input image: close-up portrait of a woman looking at camera. The prompt only describes what moves — the model reads the scene from the image itself.

🎬 I2V output — portrait animation example

Loading video…
I2V · FP8 Distilled · denoise 0.85

Settings: FP8 distilled, 24 frames

Download the I2V workflow

Use the main workflow (BF16 + FP8) and toggle to I2V mode, or use the dedicated GGUF I2V workflow if you're on a lower-VRAM card.

Frequently asked questions

FP8 distilled needs 16GB+. BF16 needs 24GB+. GGUF runs on 8–16GB depending on the quantisation variant. Under 8GB, a cloud GPU like RunPod is the practical option — you can rent an A100 for under €1 per hour.

Yes — model weights are published on HuggingFace under an open licence. Use outputs for personal and commercial projects. Always check the current licence file on the HuggingFace repo before publishing commercially, as terms can update.

Dev is the full-quality model — more VRAM, slower, better for final renders. Distilled 1.1 is compressed — faster, lower VRAM, and nearly identical quality for most outputs. If you're just getting started, distilled is the right choice.

Two causes cover almost every case: ComfyUI below v0.22, or missing/misplaced Gemma text encoder files. Update ComfyUI to v0.22+, then confirm both text encoder files are in models/text_encoders/ and restart.

On an RTX 4090 with FP8 distilled at 8 steps, expect roughly 20–40 seconds for a 5-second clip. BF16 is slower. GGUF is fastest per-step but at lower quality. Installing SageAttention can cut generation time by 30–50%.

MPS (Apple Silicon) is supported but significantly slower than CUDA — most Mac users report 3–5× longer generation times. Mac support is still catching up. Check the official LTX docs for the current status.

For BF16 and FP8 models, no custom nodes are needed — ComfyUI v0.22+ has everything built in. The GGUF workflow requires ComfyUI-GGUF and ComfyUI-KJNodes. Install both via ComfyUI Manager or the GitHub links in the red nodes section above.

What to do next

Pick your model, download your workflow, run your first generation.

Once you're generating video reliably, that skill becomes a service. Fiverr and Upwork both have active demand for AI video — here's how to turn it into income.

Roadmap Level 2 · Video Generation series · Previous → LTX 2 Text to Video · WAN 2.2 Guide

Discussion

Join the discussion

Sign in to leave a comment or reply

💬

No comments yet

Be the first to share your thoughts!