--- license: apache-2.0 datasets: - TempoFunk/webvid-10M language: - en tags: - text-to-video base_model: - ali-vilab/text-to-video-ms-1.7b --- # caT text to video Conditionally augmented text-to-video model. Uses pre-trained weights from modelscope text-to-video model, augmented with temporal conditioning transformers to extend generated clips and create a smooth transition between them. Supports prompt interpolation as well to change scenes during clip extensions. This model was trained at home as a hobby. Do not expect high quality samples. ## Installation ### Clone the Repository ```bash git clone https://github.com/motexture/caT-text-to-video-2.3b/ cd caT-text-to-video-2.3b python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 python run.py ``` Visit the provided URL in your browser to interact with the interface and start generating videos. Examples: A guy is riding a bike -> A guy is riding a motorcycle Will Smith is eating a hamburger -> Will Smith is eating an ice cream A lion is looking around -> A lion is running Darth Vader is surfing on the ocean A beautiful anime girl with pink hair -> Anime girl laughing