Project Description
I’m building a fully-automated pipeline that turns any .wav file into a finished, high-quality MP4 music video without the per-video fees of online generators like Neuralframes. My workflow needs to live in a clean Anaconda environment and rely on the latest releases of Flux together with Hunyuan-Video in GGUF format as the core engines.
Here’s the flow I want the script to cover:
• Ingest a user-supplied .wav
• Detect its BPM automatically and segment it into logical beats
• Transcribe the vocals, generate editable lyrics, and let me tweak or overwrite them before render time
• For roughly 65 beat-aligned segments, build a text prompt list (also editable) that feeds Flux + Hunyuan-Video to create matching frames
• Stitch the generated frames and original audio into a single MP4, perfectly synced
• Run headless so I can batch 1,000+ tracks with simple CLI commands; no GUI polish required, just stability, logging, and clear config files
Key expectations
• Python 3.11+ in Anaconda, modularized for easy updates of models
• Output video must be 1080p or higher, H.264 in MP4 container
• External dependencies such as ffmpeg, whisper, librosa (or your preferred audio library) and any model weights should auto-download or be documented in the environment.yml
• A short README and sample run script that processes one demo song end-to-end
Acceptance
I’ll run the suite on a fresh machine, point it at a test .wav, adjust a couple of generated prompts, and get a synchronized MP4 with no manual video editing. If that passes and the code is clearly organized for scaling, the job is complete.
**Project Title:**
AI Music Video Generator (Desktop Software – Python)
---
## Overview
Develop a **desktop application** that generates full-length AI music videos from a WAV audio file and lyrics input. The software must provide **fine-grained manual control over every scene**, while also supporting AI-assisted automation.
The goal is to replicate and exceed tools like Neural Frames by allowing:
* Scene-by-scene control
* Frame-by-frame prompting
* AI image and video generation
* Full synchronization with music and lyrics
---
## Core Functional Requirements
### 1. Audio Input & Analysis
* Import `.wav` audio files
* Automatically analyze:
* BPM (tempo)
* Beat structure
* Song sections (intro, verse, chorus, drop)
* Generate a **timeline of scenes** based on audio segmentation
---
### 2. Lyrics Integration & Sync
* Input lyrics manually (paste text)
* Automatically align lyrics with timestamps using speech-to-text alignment
* Display lyrics synced to timeline
---
### 3. Scene Timeline Editor
* Visual timeline of the entire song
* Split into 50–100 scenes (auto + manual override)
* Each scene:
* Clickable
* Plays only that segment of audio
* User can:
* Adjust scene duration
* Merge/split scenes
* Reorder scenes
---
### 4. Scene Prompt System
Each scene must allow:
* Image prompt input (what to generate visually)
* Motion prompt input (how it animates)
* Ability to preview and edit prompts per scene
---
### 5. Image Generation Engine
* Generate **high-quality images per scene**
* Support modern models (e.g. FLUX / Stable Diffusion-class)
* Batch or single-frame generation
* Save images per scene
---
### 6. Video Generation Engine
* Convert generated images into animated video clips
* Support:
* Camera movement (zoom, pan, rotation)
* Motion prompts
* Generate short clips (2–6 seconds per scene)
---
### 7. Clip Management System
* Store all generated clips per scene
* Allow:
* Regenerate clips
* Replace clips
* Preview clips individually
---
### 8. Final Video Assembly
* Automatically stitch all clips together
* Ensure:
* Correct timing
* Smooth transitions
* Overlay original WAV audio
* Export final video (MP4)
---
### 9. Playback System
* Preview:
* Individual scenes
* Full video
* Sync playback with audio
---
## Technical Requirements
### Language & Environment
* Python (Anaconda-compatible)
* Must run locally on Windows
### GUI Framework
* PySide6 (Qt-based modern interface)
### AI Integration
* Must support integration with:
* Image generation models
* Video generation models
* Modular backend (models can be swapped)
### Video Processing
* FFmpeg integration required
---
## File & Project Structure
Each project should store:
* Audio file
* Lyrics
* Scene data (JSON)
* Generated images
* Generated clips
* Final output
---
## Advanced Features (Preferred)
* Beat-synced cuts and transitions
* AI-assisted prompt generation
* Style consistency across scenes
* Character continuity (optional)
* GPU acceleration (CUDA support)
---
## Deliverables
* Fully working desktop application
* Clean, maintainable Python code
* Installation/setup instructions
* Ability to run locally without cloud dependency (preferred)
---
## Notes for Developer
* This is not a simple generator — it is a **production tool**
* User control over scenes is critical
* Performance and stability are important
* Modular design is required for future upgrades
---
## Objective
Create a tool that enables a single user to generate professional-quality AI music videos with full creative control, combining automation with manual direction.