AI Image & Video Gen ยท Water Impact Explorer ยท Report v1.2

Generative AI: The Visual Cost

From a single image to platform-scale video production โ€” the hydrological reality of AI creative tools, grounded in 2025โ€“2026 benchmark estimates.

~2.4 mL
Fast AI image / image
~95 mL
Ultra quality / image
~382 mL
10s HD video clip
~7.2 L
1 min of 4K AI video

Li et al. arXiv:2311.16863 ยท Zhao et al. arXiv:2509.19222 ยท MIT Tech Review 2025 ยท LBNL 2024

Image Quality / Model

Energy per image varies 40ร— between SD-Turbo draft mode and DALL-E 3 ultra with iterative editing

SDXL / Midjourney v6 standard โ€” 1024px, ~30 diffusion steps. Most common API tier.
50
1101005001k
0.25 L/kWh

Water per Image

---
per image generated
---
Daily total
---
per day
Annual total
---
per year
Daily energy
---
kWh / day
Images / burger
---
images = 1 burger WF

Scale Equivalences โ€” Daily

๐Ÿ” 1 burger = years of this daily habit---
๐Ÿšฟ vs. 10-min shower (65 L)---
๐Ÿ  vs. household daily (341 L)---
๐Ÿ’ง 500 mL bottles equivalent---

Scale Visualization

Per-image and per-clip values vs. real-world water baselines โ€” logarithmic scale โ€” updates live with calculator

Key Findings

Image & video generation water context โ€” companion to bra-khet AI Water-Energy Nexus Report v1.2

Text generation (LLM inference) at 2026 efficiency: 0.26โ€“2.0 mL per query. Image generation adds a fundamentally different compute burden:
  • โ€ข Fast AI image (SD-Turbo, 0.5 Wh): ~2.4 mL โ€” comparable to a 2026 LLM query at standard efficiency
  • โ€ข Standard image (SDXL / MJ v6, 3 Wh): ~14.3 mL โ€” ~55ร— more water than Gemini per-query
  • โ€ข High quality (50 steps + upscale, 10 Wh): ~47.7 mL โ€” nearly a shot glass of water per image
  • โ€ข Ultra (DALL-E 3 with editing, 20 Wh): ~95.4 mL โ€” about 365ร— more water than Gemini per-query
The core reason: diffusion models run 20โ€“100 denoising steps, each requiring a full U-Net or DiT forward pass. Text models run one forward pass with KV-cache reuse โ€” an order-of-magnitude computational difference for similar output quality.
Video generation compounds image-gen cost by frame count and temporal consistency overhead:
  • โ€ข 5s 480p SD clip (~125 frames, 25 Wh): ~119 mL โ€” roughly half a small glass of water
  • โ€ข 10s 720p HD clip (~240 frames, 80 Wh): ~382 mL โ€” one full 500 mL bottle per clip
  • โ€ข 30s 1080p FHD clip (~720 frames, 300 Wh): ~1.43 L โ€” three water bottles; equivalent to ~550 Gemini queries
  • โ€ข 60s 4K clip (~1,440 frames, 1500 Wh): ~7.16 L โ€” more than a full gallon; matches ~27,500 Gemini queries
Temporal consistency models (Sora, Wan, CogVideo) require cross-frame attention over time, adding 20โ€“40% overhead vs. independent frame generation. Efficient video diffusion (StreamDiffusion, AnimateDiff V3) can reduce this by ~3โ€“5ร—.
One hamburger carries ~2,498 L total water footprint (Mekonnen & Hoekstra 2012, Ecosystems):
  • โ€ข Fast images (2.4 mL each): 1,040,833 images per burger โ€” generating one million quick AI images = one hamburger's water
  • โ€ข Standard images (14.3 mL each): 174,685 images per burger
  • โ€ข High images (47.7 mL each): 52,370 images per burger
  • โ€ข Ultra images (95.4 mL each): 26,185 images per burger
  • โ€ข 10s HD clips (382 mL each): 6,544 clips per burger โ€” ~18 years of daily creator-tier video generation
  • โ€ข 60s 4K clips (7,155 mL each): 349 clips per burger โ€” less than one year of daily 4K generation at Pro Studio rate
All tool figures apply to cloud API inference in professional data centers (WUE 0.15โ€“0.55 L/kWh + grid GWIF ~4.52 L/kWh). Running Stable Diffusion locally changes the calculus:
  • โ€ข Your GPU has no data center cooling loop โ€” zero Scope 1 direct water evaporation
  • โ€ข Scope 2 still applies: your grid's upstream thermoelectric generation (GWIF ~4.52 L/kWh national avg)
  • โ€ข RTX 4090 at 400W peak โ€” 512ร—512 image in ~0.5s โ†’ ~0.056 Wh โ†’ ~0.25 mL per image (Scope 2 only)
  • โ€ข Consumer GPU generates images at ~10ร— lower water cost than cloud API (no cooling overhead, same compute time)
  • โ€ข This also applies to video: local inference on a 4090 for a 10s clip might use ~8โ€“15 Wh vs. 80 Wh on a data center GPU cluster
  • โ€ข Diffusion distillation (LCM, TurboSD, Hyper-SD): 4-step inference vs. 30 steps โ†’ ~7.5ร— energy reduction per image at comparable quality
  • โ€ข Flow matching (SD3, FLUX): Deterministic trajectories with fewer NFE (Number of Function Evaluations) โ€” 8โ€“12 steps at high quality; ~2โ€“3ร— energy reduction vs. DDPM
  • โ€ข Video frame reuse (StreamDiffusion): Delta diffusion for slow-motion scenes reduces effective NFE by 40โ€“60% at similar temporal coherence
  • โ€ข Linear-attention architectures (Mamba, SSMs for video): Eliminate quadratic attention scaling โ†’ 3โ€“5ร— lower compute for long-duration video generation
  • โ€ข Flash-Attention-2 + Triton kernels: 30โ€“40% power efficiency gain at constant throughput via reduced GPU memory bandwidth pressure
  • โ€ข Trajectory: 2026 "fast" quality (0.5 Wh/image) will be standard quality by 2028 via distillation; current ultra (20 Wh) may drop to ~5 Wh through efficient sampling pipelines

Sources

  1. [1] Li, P. et al. (2023). Making AI Less Thirsty: Uncovering and Addressing the Secret Water Footprint of AI Models. arXiv:2311.16863.
  2. [2] Zhao, S. et al. (2025). Energy and Water Consumption in AI-Generated Content: Image and Video Models. arXiv:2509.19222.
  3. [3] Hao, K. (2025, May). How much energy does AI actually use? MIT Technology Review.
  4. [4] Ren, S. et al. (2023). On the Energy and Water Consumption of Generative AI. arXiv:2304.03271.
  5. [5] Lawrence Berkeley National Laboratory (2024). US Data Center Energy Use Report (GWIF 4.52 L/kWh).
  6. [6] Mekonnen, M.M. & Hoekstra, A.Y. (2012). A global assessment of the water footprint of farm animal products. Ecosystems 15(3):401โ€“415.
  7. [7] Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 (SD/SDXL baseline architecture).
  8. [8] OpenAI (2024). DALL-E 3 technical report. openai.com/research.
  9. [9] Luo, S. et al. (2023). LCM: Latent Consistency Models. arXiv:2310.04378 (4-step distillation inference).
  10. [10] Esser, P. et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3 / FLUX). arXiv:2403.03206.
  11. [11] Blattmann, A. et al. (2023). Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv:2311.15127.
  12. [12] Bar-Tal, O. et al. (2024). Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv:2401.12945.
  13. [13] AWWA (2022). US daily household indoor water use baseline: 341 L/day.
  14. [14] Jegham, I. et al. (2025). Empirical energy benchmarking of 30 LLMs (Gemini 0.24 Wh/query). arXiv preprint.
  15. [15] Google (2026). Ironwood TPU benchmarks and environmental disclosures. Google Environmental Reports.