Spaces:

mcLigero
/

PoseModifier

Running

App Files Files Community

Evgeny Zhukov commited on 23 days ago

Commit

2ba4412

1 Parent(s): 45e557f

Origin: https://github.com/ali-vilab/UniAnimate/commit/d7814fa44a0a1154524b92fce0e3133a2604d333

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
UniAnimate/.gitignore +18 -0
UniAnimate/README.md +344 -0
UniAnimate/configs/UniAnimate_infer.yaml +98 -0
UniAnimate/configs/UniAnimate_infer_long.yaml +101 -0
UniAnimate/dwpose/__init__.py +0 -0
UniAnimate/dwpose/onnxdet.py +127 -0
UniAnimate/dwpose/onnxpose.py +360 -0
UniAnimate/dwpose/util.py +336 -0
UniAnimate/dwpose/wholebody.py +48 -0
UniAnimate/environment.yaml +236 -0
UniAnimate/inference.py +18 -0
UniAnimate/requirements.txt +201 -0
UniAnimate/run_align_pose.py +712 -0
UniAnimate/test_func/save_targer_keys.py +108 -0
UniAnimate/test_func/test_EndDec.py +95 -0
UniAnimate/test_func/test_dataset.py +152 -0
UniAnimate/test_func/test_models.py +56 -0
UniAnimate/test_func/test_save_video.py +24 -0
UniAnimate/tools/__init__.py +3 -0
UniAnimate/tools/datasets/__init__.py +2 -0
UniAnimate/tools/datasets/image_dataset.py +86 -0
UniAnimate/tools/datasets/video_dataset.py +118 -0
UniAnimate/tools/inferences/__init__.py +2 -0
UniAnimate/tools/inferences/inference_unianimate_entrance.py +483 -0
UniAnimate/tools/inferences/inference_unianimate_long_entrance.py +508 -0
UniAnimate/tools/modules/__init__.py +7 -0
UniAnimate/tools/modules/autoencoder.py +690 -0
UniAnimate/tools/modules/clip_embedder.py +212 -0
UniAnimate/tools/modules/config.py +206 -0
UniAnimate/tools/modules/diffusions/__init__.py +1 -0
UniAnimate/tools/modules/diffusions/diffusion_ddim.py +1121 -0
UniAnimate/tools/modules/diffusions/diffusion_gauss.py +498 -0
UniAnimate/tools/modules/diffusions/losses.py +28 -0
UniAnimate/tools/modules/diffusions/schedules.py +166 -0
UniAnimate/tools/modules/embedding_manager.py +179 -0
UniAnimate/tools/modules/unet/__init__.py +2 -0
UniAnimate/tools/modules/unet/mha_flash.py +103 -0
UniAnimate/tools/modules/unet/unet_unianimate.py +659 -0
UniAnimate/tools/modules/unet/util.py +1741 -0
UniAnimate/utils/__init__.py +0 -0
UniAnimate/utils/assign_cfg.py +78 -0
UniAnimate/utils/config.py +230 -0
UniAnimate/utils/distributed.py +430 -0
UniAnimate/utils/logging.py +90 -0
UniAnimate/utils/mp4_to_gif.py +16 -0
UniAnimate/utils/multi_port.py +9 -0
UniAnimate/utils/optim/__init__.py +2 -0
UniAnimate/utils/optim/adafactor.py +230 -0
UniAnimate/utils/optim/lr_scheduler.py +58 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+UniAnimate/data/** filter=lfs diff=lfs merge=lfs -text

UniAnimate/.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+*.pkl
+*.pt
+*.mov
+*.pth
+*.mov
+*.npz
+*.npy
+*.boj
+*.onnx
+*.tar
+*.bin
+cache*
+.DS_Store
+*DS_Store
+outputs/
+**/__pycache__
+***/__pycache__
+*/__pycache__

UniAnimate/README.md ADDED Viewed

	@@ -0,0 +1,344 @@

+<!-- main documents -->
+<div align="center">
+# UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
+[Xiang Wang](https://scholar.google.com.hk/citations?user=cQbXvkcAAAAJ&hl=zh-CN&oi=sra)<sup>1</sup>, [Shiwei Zhang](https://scholar.google.com.hk/citations?user=ZO3OQ-8AAAAJ&hl=zh-CN)<sup>2</sup>, [Changxin Gao](https://scholar.google.com.hk/citations?user=4tku-lwAAAAJ&hl=zh-CN)<sup>1</sup>, [Jiayu Wang](#)<sup>2</sup>, [Xiaoqiang Zhou](https://scholar.google.com.hk/citations?user=Z2BTkNIAAAAJ&hl=zh-CN&oi=ao)<sup>3</sup>, [Yingya Zhang](https://scholar.google.com.hk/citations?user=16RDSEUAAAAJ&hl=zh-CN)<sup>2</sup> , [Luxin Yan](#)<sup>1</sup> , [Nong Sang](https://scholar.google.com.hk/citations?user=ky_ZowEAAAAJ&hl=zh-CN)<sup>1</sup>
+<sup>1</sup>HUST &nbsp; <sup>2</sup>Alibaba Group &nbsp; <sup>3</sup>USTC
+[🎨 Project Page](https://unianimate.github.io/)
+<p align="middle">
+  <img src='https://img.alicdn.com/imgextra/i4/O1CN01bW2Y491JkHAUK4W0i_!!6000000001066-2-tps-2352-1460.png' width='784'>
+  Demo cases generated by the proposed UniAnimate
+</p>
+</div>
+## 🔥 News
+- **[2024/07/19]** 🔥 We added a **<font color=red>noise prior</font>** to the code (refer to line 381: `noise = diffusion.q_sample(random_ref_frame.clone(), getattr(cfg, "noise_prior_value", 939), noise=noise)` in `tools/inferences/inference_unianimate_long_entrance.py`), which can help achieve better appearance preservation (such as background), especially in long video generation. In addition, we are considering releasing an upgraded version of UniAnimate if we obtain an open source license from the company.
+- **[2024/06/26]** For cards with large GPU memory, such as A100 GPU, we support multiple segments parallel denoising to accelerate long video inference. You can change `context_batch_size: 1` in `configs/UniAnimate_infer_long.yaml` to other values greater than 1, such as `context_batch_size: 4`. The inference speed will be improved to a certain extent.
+- **[2024/06/15]** 🔥🔥🔥 By offloading CLIP and VAE and explicitly adding torch.float16 (i.e., set `CPU_CLIP_VAE: True` in `configs/UniAnimate_infer.yaml`), the GPU memory can be greatly reduced. Now generating a 32x768x512 video clip only requires **~12G GPU memory**. Refer to [this issue](https://github.com/ali-vilab/UniAnimate/issues/10) for more details. Thanks to [@blackight](https://github.com/blackight) for the contribution！
+- **[2024/06/13]** **🔥🔥🔥 <font color=red>We released the code and models for human image animation, enjoy it!</font>**
+- **[2024/06/13]** We have submitted the code to the company for approval, and **the code is expected to be released today or tomorrow**.
+- **[2024/06/03]** We initialized this github repository and planed to release the paper.
+## TODO
+- [x] Release the models and inference code, and pose alignment code.
+- [x] Support generating both short and long videos.
+- [ ] Release the models for longer video generation in one batch.
+- [ ] Release models based on VideoLCM for faster video synthesis.
+- [ ] Training the models on higher resolution videos.
+## Introduction
+<div align="center">
+<p align="middle">
+  <img src='https://img.alicdn.com/imgextra/i3/O1CN01VvncFJ1ueRudiMOZu_!!6000000006062-2-tps-2654-1042.png' width='784'>
+  Overall framework of UniAnimate
+</p>
+</div>
+Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy.
+## Getting Started with UniAnimate
+### (1) Installation
+Installation the python dependencies:
+```
+git clone https://github.com/ali-vilab/UniAnimate.git
+cd UniAnimate
+conda create -n UniAnimate python=3.9
+conda activate UniAnimate
+conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
+pip install -r requirements.txt
+```
+We also provide all the dependencies in `environment.yaml`.
+**Note**: for Windows operating system, you can refer to [this issue](https://github.com/ali-vilab/UniAnimate/issues/11) to install the dependencies. Thanks to [@zephirusgit](https://github.com/zephirusgit) for the contribution. If you encouter the problem of `The shape of the 2D attn_mask is torch.Size([77, 77]), but should be (1, 1).`, please refer to [this issue](https://github.com/ali-vilab/UniAnimate/issues/61) to solve it, thanks to [@Isi-dev](https://github.com/Isi-dev) for the contribution.
+### (2) Download the pretrained checkpoints
+Download models:
+```
+!pip install modelscope
+from modelscope.hub.snapshot_download import snapshot_download
+model_dir = snapshot_download('iic/unianimate', cache_dir='checkpoints/')
+```
+Then you might need the following command to move the checkpoints to the "checkpoints/" directory:
+```
+mv ./checkpoints/iic/unianimate/* ./checkpoints/
+```
+Finally, the model weights will be organized in `./checkpoints/` as follows:
+```
+./checkpoints/
+|---- dw-ll_ucoco_384.onnx
+|---- open_clip_pytorch_model.bin
+|---- unianimate_16f_32f_non_ema_223000.pth
+|---- v2-1_512-ema-pruned.ckpt
+└---- yolox_l.onnx
+```
+### (3) Pose alignment **(Important)**
+Rescale the target pose sequence to match the pose of the reference image:
+```
+# reference image 1
+python run_align_pose.py  --ref_name data/images/WOMEN-Blouses_Shirts-id_00004955-01_4_full.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/WOMEN-Blouses_Shirts-id_00004955-01_4_full
+# reference image 2
+python run_align_pose.py  --ref_name data/images/musk.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/musk
+# reference image 3
+python run_align_pose.py  --ref_name data/images/WOMEN-Blouses_Shirts-id_00005125-03_4_full.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/WOMEN-Blouses_Shirts-id_00005125-03_4_full
+# reference image 4
+python run_align_pose.py  --ref_name data/images/IMG_20240514_104337.jpg --source_video_paths data/videos/source_video.mp4 --saved_pose_dir data/saved_pose/IMG_20240514_104337
+```
+We have already provided the processed target pose for demo videos in ```data/saved_pose```, if you run our demo video example, this step can be skipped. In addition, you need to install onnxruntime-gpu (`pip install onnxruntime-gpu==1.13.1`) to run pose alignment on GPU.
+**<font color=red>&#10004; Some tips</font>**:
+- > In pose alignment, the first frame in the target pose sequence is used to calculate the scale coefficient of the alignment. Therefore, if the first frame in the target pose sequence contains the entire face and pose (hand and foot), it can help obtain more accurate estimation and better video generation results.
+### (4) Run the UniAnimate model to generate videos
+#### (4.1) Generating video clips (32 frames with 768x512 resolution)
+Execute the following command to generate video clips:
+```
+python inference.py --cfg configs/UniAnimate_infer.yaml
+```
+After this, 32-frame video clips with 768x512 resolution will be generated:
+<table>
+<center>
+  <tr>
+    <td ><center>
+      <image  height="260" src="assets/1.gif"></image>
+    </center></td>
+    <td ><center>
+      <image height="260" src="assets/2.gif"></image>
+    </center></td>
+  </tr>
+  <tr>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYV1hTb2g3Zlpmb1E/Vk9HZHZkdDBUQzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYYzNUUWRKR043c1FaZkVHSkpSMnpoeTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+  </tr>
+</center>
+</table>
+</center>
+<!-- <table>
+<center>
+<tr>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYYzNUUWRKR043c1FaZkVHSkpSMnpoeTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYV1hTb2g3Zlpmb1E/Vk9HZHZkdDBUQzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+</tr>
+</table> -->
+**<font color=red>&#10004; Some tips</font>**:
+- > To run the model, **~12G** ~~26G~~ GPU memory will be used. If your GPU is smaller than this, you can change the  `max_frames: 32` in `configs/UniAnimate_infer.yaml` to other values, e.g., 24, 16, and 8. Our model is compatible with all of them.
+#### (4.2) Generating video clips (32 frames with 1216x768 resolution)
+If you want to synthesize higher resolution results, you can change the `resolution: [512, 768]` in `configs/UniAnimate_infer.yaml` to `resolution: [768, 1216]`. And execute the following command to generate video clips:
+```
+python inference.py --cfg configs/UniAnimate_infer.yaml
+```
+After this, 32-frame video clips with 1216x768 resolution will be generated:
+<table>
+<center>
+  <tr>
+    <td ><center>
+      <image  height="260" src="assets/3.gif"></image>
+    </center></td>
+    <td ><center>
+      <image height="260" src="assets/4.gif"></image>
+    </center></td>
+  </tr>
+  <tr>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/NTFJUWJ1YXphUzU5b3dhZHJlQk1YZjA3emppMWNJbHhXSlN6WmZHc2FTYTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYYklMcGdIRFlDcXcwVEU5ZnR0VlBpRzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+  </tr>
+</center>
+</table>
+</center>
+<!-- <table>
+<center>
+<tr>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/NTFJUWJ1YXphUzU5b3dhZHJlQk1YZjA3emppMWNJbHhXSlN6WmZHc2FTYTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYYklMcGdIRFlDcXcwVEU5ZnR0VlBpRzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+</tr>
+</table> -->
+**<font color=red>&#10004; Some tips</font>**:
+- > To run the model, **~21G** ~~36G~~ GPU memory will be used.  Even though our model was trained on 512x768 resolution, we observed that direct inference on 768x1216 is usually allowed and produces satisfactory results. If this results in inconsistent apparence, you can try a different seed or adjust the resolution to 512x768.
+- > Although our model was not trained on 48 or 64 frames, we found that the model generalizes well to synthesis of these lengths.
+In the `configs/UniAnimate_infer.yaml` configuration file, you can specify the data, adjust the video length using `max_frames`, and validate your ideas with different Diffusion settings, and so on.
+#### (4.3) Generating long videos
+If you want to synthesize videos as long as the target pose sequence, you can execute the following command to generate long videos:
+```
+python inference.py --cfg configs/UniAnimate_infer_long.yaml
+```
+After this, long videos with 1216x768 resolution will be generated:
+<table>
+<center>
+  <tr>
+    <td ><center>
+      <image  height="260" src="assets/5.gif"></image>
+    </center></td>
+    <td ><center>
+      <image height="260" src="assets/6.gif"></image>
+    </center></td>
+  </tr>
+  <tr>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYVmJKZUJSbDl6N1FXU01DYTlDRmJKTzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/VUdZTUE5MWtST3VtNEdFaVpGbHN1U25nNEorTEc2SzZROUNiUjNncW5ycTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+  </tr>
+</center>
+</table>
+</center>
+<!-- <table>
+<center>
+<tr>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYVmJKZUJSbDl6N1FXU01DYTlDRmJKTzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/VUdZTUE5MWtST3VtNEdFaVpGbHN1U25nNEorTEc2SzZROUNiUjNncW5ycTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+</tr>
+</table> -->
+<table>
+<center>
+  <tr>
+    <td ><center>
+      <image  height="260" src="assets/7.gif"></image>
+    </center></td>
+    <td ><center>
+      <image height="260" src="assets/8.gif"></image>
+    </center></td>
+  </tr>
+  <tr>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYV04xKzd3eWFPVGZCQjVTUWdtbTFuQzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+    <td ><center>
+      <p>Click <a href="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYWGwxVkNMY1NXOHpWTVdNZDRxKzRuZTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ">HERE</a> to view the generated video.</p>
+    </center></td>
+  </tr>
+</center>
+</table>
+</center>
+<!-- <table>
+<center>
+<tr>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYV04xKzd3eWFPVGZCQjVTUWdtbTFuQzZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+    <td ><center>
+        <video height="260" controls autoplay loop src="https://cloud.video.taobao.com/vod/play/cEdJVkF4TXRTOTd2bTQ4andjMENYWGwxVkNMY1NXOHpWTVdNZDRxKzRuZTZQZWw1SnpKVVVCTlh4OVFON0V5UUVMUDduY1RJak82VE1sdXdHTjNOaHc9PQ" muted="false"></video>
+    </td>
+</tr>
+</table> -->
+In the `configs/UniAnimate_infer_long.yaml` configuration file, `test_list_path` should in the format of `[frame_interval, reference image, driving pose sequence]`, where `frame_interval=1` means that all frames in the target pose sequence will be used to generate the video, and `frame_interval=2` means that one frame is sampled every two frames. `reference image` is the location where the reference image is saved, and `driving pose sequence` is the location where the driving pose sequence is saved.
+**<font color=red>&#10004; Some tips</font>**:
+- > If you find inconsistent appearance, you can change the resolution from 768x1216 to 512x768, or change the `context_overlap` from 8 to 16.
+- > In the default setting of `configs/UniAnimate_infer_long.yaml`, the strategy of sliding window with temporal overlap is used. You can also generate a satisfactory video segment first, and then input the last frame of this segment into the model to generate the next segment to continue the video.
+## Citation
+If you find this codebase useful for your research, please cite the following paper:
+```
+@article{wang2024unianimate,
+      title={UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation},
+      author={Wang, Xiang and Zhang, Shiwei and Gao, Changxin and Wang, Jiayu and Zhou, Xiaoqiang and Zhang, Yingya and Yan, Luxin and Sang, Nong},
+      journal={arXiv preprint arXiv:2406.01188},
+      year={2024}
+}
+```
+## Disclaimer
+This open-source model is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
+We explicitly disclaim any responsibility for user-generated content. Users are solely liable for their actions while using the generative model. The project contributors have no legal affiliation with, nor accountability for, users' behaviors. It is imperative to use the generative model responsibly, adhering to both ethical and legal standards.

UniAnimate/configs/UniAnimate_infer.yaml ADDED Viewed

	@@ -0,0 +1,98 @@

+# manual setting
+max_frames: 32
+resolution: [512, 768]  # or resolution: [768, 1216]
+# resolution: [768, 1216]
+round: 1
+ddim_timesteps: 30  # among 25-50
+seed: 11 # 7
+test_list_path: [
+    # Format: [frame_interval, reference image, driving pose sequence]
+    [2, "data/images/WOMEN-Blouses_Shirts-id_00004955-01_4_full.jpg", "data/saved_pose/WOMEN-Blouses_Shirts-id_00004955-01_4_full"],
+    [2, "data/images/musk.jpg", "data/saved_pose/musk"],
+    [2, "data/images/WOMEN-Blouses_Shirts-id_00005125-03_4_full.jpg", "data/saved_pose/WOMEN-Blouses_Shirts-id_00005125-03_4_full"],
+    [2, "data/images/IMG_20240514_104337.jpg", "data/saved_pose/IMG_20240514_104337"]
+]
+partial_keys: [
+                    ['image','local_image', "dwpose"], # reference image as the first frame of the generated video (optional)
+                    ['image', 'randomref', "dwpose"],
+                ]
+# default settings
+TASK_TYPE: inference_unianimate_entrance
+guide_scale: 2.5
+vit_resolution: [224, 224]
+use_fp16: True
+batch_size: 1
+latent_random_ref: True
+chunk_size: 2
+decoder_bs: 2
+scale: 8
+use_fps_condition: False
+test_model: checkpoints/unianimate_16f_32f_non_ema_223000.pth
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'pretrained': 'checkpoints/open_clip_pytorch_model.bin'
+}
+auto_encoder: {
+    'type': 'AutoencoderKL',
+    'ddconfig': {
+        'double_z': True,
+        'z_channels': 4,
+        'resolution': 256,
+        'in_channels': 3,
+        'out_ch': 3,
+        'ch': 128,
+        'ch_mult': [1, 2, 4, 4],
+        'num_res_blocks': 2,
+        'attn_resolutions': [],
+        'dropout': 0.0,
+        'video_kernel_size': [3, 1, 1]
+    },
+    'embed_dim': 4,
+    'pretrained': 'checkpoints/v2-1_512-ema-pruned.ckpt'
+}
+UNet: {
+    'type': 'UNetSD_UniAnimate',
+    'config': None,
+    'in_dim': 4,
+    'dim': 320,
+    'y_dim': 1024,
+    'context_dim': 1024,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'dropout': 0.1,
+    'temporal_attention': True,
+    'num_tokens': 4,
+    'temporal_attn_times': 1,
+    'use_checkpoint': True,
+    'use_fps_condition': False,
+    'use_sim_mask': False
+}
+video_compositions: ['image', 'local_image', 'dwpose', 'randomref', 'randomref_pose']
+Diffusion: {
+    'type': 'DiffusionDDIM',
+    'schedule': 'linear_sd',
+    'schedule_param': {
+        'num_timesteps': 1000,
+        "init_beta": 0.00085,
+        "last_beta": 0.0120,
+        'zero_terminal_snr': True,
+    },
+    'mean_type': 'v',
+    'loss_type': 'mse',
+    'var_type': 'fixed_small', # 'fixed_large',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1
+}
+use_DiffusionDPM: False
+CPU_CLIP_VAE: True
+noise_prior_value: 949 # or 999, 949

UniAnimate/configs/UniAnimate_infer_long.yaml ADDED Viewed

	@@ -0,0 +1,101 @@

+# manual setting
+# resolution: [512, 768] # or [768, 1216]
+resolution: [768, 1216]
+round: 1
+ddim_timesteps: 30 # among 25-50
+context_size: 32
+context_stride: 1
+context_overlap: 8
+seed: 7
+max_frames: "None" # 64, 96, "None" mean the length of original pose sequence
+test_list_path: [
+    # Format: [frame_interval, reference image, driving pose sequence]
+    [2, "data/images/WOMEN-Blouses_Shirts-id_00004955-01_4_full.jpg", "data/saved_pose/WOMEN-Blouses_Shirts-id_00004955-01_4_full"],
+    [2, "data/images/musk.jpg", "data/saved_pose/musk"],
+    [2, "data/images/WOMEN-Blouses_Shirts-id_00005125-03_4_full.jpg", "data/saved_pose/WOMEN-Blouses_Shirts-id_00005125-03_4_full"],
+    [2, "data/images/IMG_20240514_104337.jpg", "data/saved_pose/IMG_20240514_104337"],
+    [2, "data/images/IMG_20240514_104337.jpg", "data/saved_pose/IMG_20240514_104337_dance"],
+    [2, "data/images/WOMEN-Blouses_Shirts-id_00005125-03_4_full.jpg", "data/saved_pose/WOMEN-Blouses_Shirts-id_00005125-03_4_full_dance"]
+]
+# default settings
+TASK_TYPE: inference_unianimate_long_entrance
+guide_scale: 2.5
+vit_resolution: [224, 224]
+use_fp16: True
+batch_size: 1
+latent_random_ref: True
+chunk_size: 2
+decoder_bs: 2
+scale: 8
+use_fps_condition: False
+test_model: checkpoints/unianimate_16f_32f_non_ema_223000.pth
+partial_keys: [
+                    ['image', 'randomref', "dwpose"],
+                ]
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'pretrained': 'checkpoints/open_clip_pytorch_model.bin'
+}
+auto_encoder: {
+    'type': 'AutoencoderKL',
+    'ddconfig': {
+        'double_z': True,
+        'z_channels': 4,
+        'resolution': 256,
+        'in_channels': 3,
+        'out_ch': 3,
+        'ch': 128,
+        'ch_mult': [1, 2, 4, 4],
+        'num_res_blocks': 2,
+        'attn_resolutions': [],
+        'dropout': 0.0,
+        'video_kernel_size': [3, 1, 1]
+    },
+    'embed_dim': 4,
+    'pretrained': 'checkpoints/v2-1_512-ema-pruned.ckpt'
+}
+UNet: {
+    'type': 'UNetSD_UniAnimate',
+    'config': None,
+    'in_dim': 4,
+    'dim': 320,
+    'y_dim': 1024,
+    'context_dim': 1024,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'dropout': 0.1,
+    'temporal_attention': True,
+    'num_tokens': 4,
+    'temporal_attn_times': 1,
+    'use_checkpoint': True,
+    'use_fps_condition': False,
+    'use_sim_mask': False
+}
+video_compositions: ['image', 'local_image', 'dwpose', 'randomref', 'randomref_pose']
+Diffusion: {
+    'type': 'DiffusionDDIMLong',
+    'schedule': 'linear_sd',
+    'schedule_param': {
+        'num_timesteps': 1000,
+        "init_beta": 0.00085,
+        "last_beta": 0.0120,
+        'zero_terminal_snr': True,
+    },
+    'mean_type': 'v',
+    'loss_type': 'mse',
+    'var_type': 'fixed_small',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1
+}
+CPU_CLIP_VAE: True
+context_batch_size: 1
+noise_prior_value: 939 # or 999, 949

UniAnimate/dwpose/__init__.py ADDED Viewed

File without changes

UniAnimate/dwpose/onnxdet.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import cv2
+import numpy as np
+import onnxruntime
+def nms(boxes, scores, nms_thr):
+    """Single class NMS implemented in Numpy."""
+    x1 = boxes[:, 0]
+    y1 = boxes[:, 1]
+    x2 = boxes[:, 2]
+    y2 = boxes[:, 3]
+    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    order = scores.argsort()[::-1]
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+        ovr = inter / (areas[i] + areas[order[1:]] - inter)
+        inds = np.where(ovr <= nms_thr)[0]
+        order = order[inds + 1]
+    return keep
+def multiclass_nms(boxes, scores, nms_thr, score_thr):
+    """Multiclass NMS implemented in Numpy. Class-aware version."""
+    final_dets = []
+    num_classes = scores.shape[1]
+    for cls_ind in range(num_classes):
+        cls_scores = scores[:, cls_ind]
+        valid_score_mask = cls_scores > score_thr
+        if valid_score_mask.sum() == 0:
+            continue
+        else:
+            valid_scores = cls_scores[valid_score_mask]
+            valid_boxes = boxes[valid_score_mask]
+            keep = nms(valid_boxes, valid_scores, nms_thr)
+            if len(keep) > 0:
+                cls_inds = np.ones((len(keep), 1)) * cls_ind
+                dets = np.concatenate(
+                    [valid_boxes[keep], valid_scores[keep, None], cls_inds], 1
+                )
+                final_dets.append(dets)
+    if len(final_dets) == 0:
+        return None
+    return np.concatenate(final_dets, 0)
+def demo_postprocess(outputs, img_size, p6=False):
+    grids = []
+    expanded_strides = []
+    strides = [8, 16, 32] if not p6 else [8, 16, 32, 64]
+    hsizes = [img_size[0] // stride for stride in strides]
+    wsizes = [img_size[1] // stride for stride in strides]
+    for hsize, wsize, stride in zip(hsizes, wsizes, strides):
+        xv, yv = np.meshgrid(np.arange(wsize), np.arange(hsize))
+        grid = np.stack((xv, yv), 2).reshape(1, -1, 2)
+        grids.append(grid)
+        shape = grid.shape[:2]
+        expanded_strides.append(np.full((*shape, 1), stride))
+    grids = np.concatenate(grids, 1)
+    expanded_strides = np.concatenate(expanded_strides, 1)
+    outputs[..., :2] = (outputs[..., :2] + grids) * expanded_strides
+    outputs[..., 2:4] = np.exp(outputs[..., 2:4]) * expanded_strides
+    return outputs
+def preprocess(img, input_size, swap=(2, 0, 1)):
+    if len(img.shape) == 3:
+        padded_img = np.ones((input_size[0], input_size[1], 3), dtype=np.uint8) * 114
+    else:
+        padded_img = np.ones(input_size, dtype=np.uint8) * 114
+    r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
+    resized_img = cv2.resize(
+        img,
+        (int(img.shape[1] * r), int(img.shape[0] * r)),
+        interpolation=cv2.INTER_LINEAR,
+    ).astype(np.uint8)
+    padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img
+    padded_img = padded_img.transpose(swap)
+    padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)
+    return padded_img, r
+def inference_detector(session, oriImg):
+    input_shape = (640,640)
+    img, ratio = preprocess(oriImg, input_shape)
+    ort_inputs = {session.get_inputs()[0].name: img[None, :, :, :]}
+    output = session.run(None, ort_inputs)
+    predictions = demo_postprocess(output[0], input_shape)[0]
+    boxes = predictions[:, :4]
+    scores = predictions[:, 4:5] * predictions[:, 5:]
+    boxes_xyxy = np.ones_like(boxes)
+    boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2]/2.
+    boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3]/2.
+    boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2]/2.
+    boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3]/2.
+    boxes_xyxy /= ratio
+    dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)
+    if dets is not None:
+        final_boxes, final_scores, final_cls_inds = dets[:, :4], dets[:, 4], dets[:, 5]
+        isscore = final_scores>0.3
+        iscat = final_cls_inds == 0
+        isbbox = [ i and j for (i, j) in zip(isscore, iscat)]
+        final_boxes = final_boxes[isbbox]
+    else:
+        final_boxes = np.array([])
+    return final_boxes

UniAnimate/dwpose/onnxpose.py ADDED Viewed

	@@ -0,0 +1,360 @@

+from typing import List, Tuple
+import cv2
+import numpy as np
+import onnxruntime as ort
+def preprocess(
+    img: np.ndarray, out_bbox, input_size: Tuple[int, int] = (192, 256)
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Do preprocessing for RTMPose model inference.
+    Args:
+        img (np.ndarray): Input image in shape.
+        input_size (tuple): Input image size in shape (w, h).
+    Returns:
+        tuple:
+        - resized_img (np.ndarray): Preprocessed image.
+        - center (np.ndarray): Center of image.
+        - scale (np.ndarray): Scale of image.
+    """
+    # get shape of image
+    img_shape = img.shape[:2]
+    out_img, out_center, out_scale = [], [], []
+    if len(out_bbox) == 0:
+        out_bbox = [[0, 0, img_shape[1], img_shape[0]]]
+    for i in range(len(out_bbox)):
+        x0 = out_bbox[i][0]
+        y0 = out_bbox[i][1]
+        x1 = out_bbox[i][2]
+        y1 = out_bbox[i][3]
+        bbox = np.array([x0, y0, x1, y1])
+        # get center and scale
+        center, scale = bbox_xyxy2cs(bbox, padding=1.25)
+        # do affine transformation
+        resized_img, scale = top_down_affine(input_size, scale, center, img)
+        # normalize image
+        mean = np.array([123.675, 116.28, 103.53])
+        std = np.array([58.395, 57.12, 57.375])
+        resized_img = (resized_img - mean) / std
+        out_img.append(resized_img)
+        out_center.append(center)
+        out_scale.append(scale)
+    return out_img, out_center, out_scale
+def inference(sess: ort.InferenceSession, img: np.ndarray) -> np.ndarray:
+    """Inference RTMPose model.
+    Args:
+        sess (ort.InferenceSession): ONNXRuntime session.
+        img (np.ndarray): Input image in shape.
+    Returns:
+        outputs (np.ndarray): Output of RTMPose model.
+    """
+    all_out = []
+    # build input
+    for i in range(len(img)):
+        input = [img[i].transpose(2, 0, 1)]
+        # build output
+        sess_input = {sess.get_inputs()[0].name: input}
+        sess_output = []
+        for out in sess.get_outputs():
+            sess_output.append(out.name)
+        # run model
+        outputs = sess.run(sess_output, sess_input)
+        all_out.append(outputs)
+    return all_out
+def postprocess(outputs: List[np.ndarray],
+                model_input_size: Tuple[int, int],
+                center: Tuple[int, int],
+                scale: Tuple[int, int],
+                simcc_split_ratio: float = 2.0
+                ) -> Tuple[np.ndarray, np.ndarray]:
+    """Postprocess for RTMPose model output.
+    Args:
+        outputs (np.ndarray): Output of RTMPose model.
+        model_input_size (tuple): RTMPose model Input image size.
+        center (tuple): Center of bbox in shape (x, y).
+        scale (tuple): Scale of bbox in shape (w, h).
+        simcc_split_ratio (float): Split ratio of simcc.
+    Returns:
+        tuple:
+        - keypoints (np.ndarray): Rescaled keypoints.
+        - scores (np.ndarray): Model predict scores.
+    """
+    all_key = []
+    all_score = []
+    for i in range(len(outputs)):
+        # use simcc to decode
+        simcc_x, simcc_y = outputs[i]
+        keypoints, scores = decode(simcc_x, simcc_y, simcc_split_ratio)
+        # rescale keypoints
+        keypoints = keypoints / model_input_size * scale[i] + center[i] - scale[i] / 2
+        all_key.append(keypoints[0])
+        all_score.append(scores[0])
+    return np.array(all_key), np.array(all_score)
+def bbox_xyxy2cs(bbox: np.ndarray,
+                 padding: float = 1.) -> Tuple[np.ndarray, np.ndarray]:
+    """Transform the bbox format from (x,y,w,h) into (center, scale)
+    Args:
+        bbox (ndarray): Bounding box(es) in shape (4,) or (n, 4), formatted
+            as (left, top, right, bottom)
+        padding (float): BBox padding factor that will be multilied to scale.
+            Default: 1.0
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: Center (x, y) of the bbox in shape (2,) or
+            (n, 2)
+        - np.ndarray[float32]: Scale (w, h) of the bbox in shape (2,) or
+            (n, 2)
+    """
+    # convert single bbox from (4, ) to (1, 4)
+    dim = bbox.ndim
+    if dim == 1:
+        bbox = bbox[None, :]
+    # get bbox center and scale
+    x1, y1, x2, y2 = np.hsplit(bbox, [1, 2, 3])
+    center = np.hstack([x1 + x2, y1 + y2]) * 0.5
+    scale = np.hstack([x2 - x1, y2 - y1]) * padding
+    if dim == 1:
+        center = center[0]
+        scale = scale[0]
+    return center, scale
+def _fix_aspect_ratio(bbox_scale: np.ndarray,
+                      aspect_ratio: float) -> np.ndarray:
+    """Extend the scale to match the given aspect ratio.
+    Args:
+        scale (np.ndarray): The image scale (w, h) in shape (2, )
+        aspect_ratio (float): The ratio of ``w/h``
+    Returns:
+        np.ndarray: The reshaped image scale in (2, )
+    """
+    w, h = np.hsplit(bbox_scale, [1])
+    bbox_scale = np.where(w > h * aspect_ratio,
+                          np.hstack([w, w / aspect_ratio]),
+                          np.hstack([h * aspect_ratio, h]))
+    return bbox_scale
+def _rotate_point(pt: np.ndarray, angle_rad: float) -> np.ndarray:
+    """Rotate a point by an angle.
+    Args:
+        pt (np.ndarray): 2D point coordinates (x, y) in shape (2, )
+        angle_rad (float): rotation angle in radian
+    Returns:
+        np.ndarray: Rotated point in shape (2, )
+    """
+    sn, cs = np.sin(angle_rad), np.cos(angle_rad)
+    rot_mat = np.array([[cs, -sn], [sn, cs]])
+    return rot_mat @ pt
+def _get_3rd_point(a: np.ndarray, b: np.ndarray) -> np.ndarray:
+    """To calculate the affine matrix, three pairs of points are required. This
+    function is used to get the 3rd point, given 2D points a & b.
+    The 3rd point is defined by rotating vector `a - b` by 90 degrees
+    anticlockwise, using b as the rotation center.
+    Args:
+        a (np.ndarray): The 1st point (x,y) in shape (2, )
+        b (np.ndarray): The 2nd point (x,y) in shape (2, )
+    Returns:
+        np.ndarray: The 3rd point.
+    """
+    direction = a - b
+    c = b + np.r_[-direction[1], direction[0]]
+    return c
+def get_warp_matrix(center: np.ndarray,
+                    scale: np.ndarray,
+                    rot: float,
+                    output_size: Tuple[int, int],
+                    shift: Tuple[float, float] = (0., 0.),
+                    inv: bool = False) -> np.ndarray:
+    """Calculate the affine transformation matrix that can warp the bbox area
+    in the input image to the output size.
+    Args:
+        center (np.ndarray[2, ]): Center of the bounding box (x, y).
+        scale (np.ndarray[2, ]): Scale of the bounding box
+            wrt [width, height].
+        rot (float): Rotation angle (degree).
+        output_size (np.ndarray[2, ] | list(2,)): Size of the
+            destination heatmaps.
+        shift (0-100%): Shift translation ratio wrt the width/height.
+            Default (0., 0.).
+        inv (bool): Option to inverse the affine transform direction.
+            (inv=False: src->dst or inv=True: dst->src)
+    Returns:
+        np.ndarray: A 2x3 transformation matrix
+    """
+    shift = np.array(shift)
+    src_w = scale[0]
+    dst_w = output_size[0]
+    dst_h = output_size[1]
+    # compute transformation matrix
+    rot_rad = np.deg2rad(rot)
+    src_dir = _rotate_point(np.array([0., src_w * -0.5]), rot_rad)
+    dst_dir = np.array([0., dst_w * -0.5])
+    # get four corners of the src rectangle in the original image
+    src = np.zeros((3, 2), dtype=np.float32)
+    src[0, :] = center + scale * shift
+    src[1, :] = center + src_dir + scale * shift
+    src[2, :] = _get_3rd_point(src[0, :], src[1, :])
+    # get four corners of the dst rectangle in the input image
+    dst = np.zeros((3, 2), dtype=np.float32)
+    dst[0, :] = [dst_w * 0.5, dst_h * 0.5]
+    dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir
+    dst[2, :] = _get_3rd_point(dst[0, :], dst[1, :])
+    if inv:
+        warp_mat = cv2.getAffineTransform(np.float32(dst), np.float32(src))
+    else:
+        warp_mat = cv2.getAffineTransform(np.float32(src), np.float32(dst))
+    return warp_mat
+def top_down_affine(input_size: dict, bbox_scale: dict, bbox_center: dict,
+                    img: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    """Get the bbox image as the model input by affine transform.
+    Args:
+        input_size (dict): The input size of the model.
+        bbox_scale (dict): The bbox scale of the img.
+        bbox_center (dict): The bbox center of the img.
+        img (np.ndarray): The original image.
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: img after affine transform.
+        - np.ndarray[float32]: bbox scale after affine transform.
+    """
+    w, h = input_size
+    warp_size = (int(w), int(h))
+    # reshape bbox to fixed aspect ratio
+    bbox_scale = _fix_aspect_ratio(bbox_scale, aspect_ratio=w / h)
+    # get the affine matrix
+    center = bbox_center
+    scale = bbox_scale
+    rot = 0
+    warp_mat = get_warp_matrix(center, scale, rot, output_size=(w, h))
+    # do affine transform
+    img = cv2.warpAffine(img, warp_mat, warp_size, flags=cv2.INTER_LINEAR)
+    return img, bbox_scale
+def get_simcc_maximum(simcc_x: np.ndarray,
+                      simcc_y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    """Get maximum response location and value from simcc representations.
+    Note:
+        instance number: N
+        num_keypoints: K
+        heatmap height: H
+        heatmap width: W
+    Args:
+        simcc_x (np.ndarray): x-axis SimCC in shape (K, Wx) or (N, K, Wx)
+        simcc_y (np.ndarray): y-axis SimCC in shape (K, Wy) or (N, K, Wy)
+    Returns:
+        tuple:
+        - locs (np.ndarray): locations of maximum heatmap responses in shape
+            (K, 2) or (N, K, 2)
+        - vals (np.ndarray): values of maximum heatmap responses in shape
+            (K,) or (N, K)
+    """
+    N, K, Wx = simcc_x.shape
+    simcc_x = simcc_x.reshape(N * K, -1)
+    simcc_y = simcc_y.reshape(N * K, -1)
+    # get maximum value locations
+    x_locs = np.argmax(simcc_x, axis=1)
+    y_locs = np.argmax(simcc_y, axis=1)
+    locs = np.stack((x_locs, y_locs), axis=-1).astype(np.float32)
+    max_val_x = np.amax(simcc_x, axis=1)
+    max_val_y = np.amax(simcc_y, axis=1)
+    # get maximum value across x and y axis
+    mask = max_val_x > max_val_y
+    max_val_x[mask] = max_val_y[mask]
+    vals = max_val_x
+    locs[vals <= 0.] = -1
+    # reshape
+    locs = locs.reshape(N, K, 2)
+    vals = vals.reshape(N, K)
+    return locs, vals
+def decode(simcc_x: np.ndarray, simcc_y: np.ndarray,
+           simcc_split_ratio) -> Tuple[np.ndarray, np.ndarray]:
+    """Modulate simcc distribution with Gaussian.
+    Args:
+        simcc_x (np.ndarray[K, Wx]): model predicted simcc in x.
+        simcc_y (np.ndarray[K, Wy]): model predicted simcc in y.
+        simcc_split_ratio (int): The split ratio of simcc.
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: keypoints in shape (K, 2) or (n, K, 2)
+        - np.ndarray[float32]: scores in shape (K,) or (n, K)
+    """
+    keypoints, scores = get_simcc_maximum(simcc_x, simcc_y)
+    keypoints /= simcc_split_ratio
+    return keypoints, scores
+def inference_pose(session, out_bbox, oriImg):
+    h, w = session.get_inputs()[0].shape[2:]
+    model_input_size = (w, h)
+    resized_img, center, scale = preprocess(oriImg, out_bbox, model_input_size)
+    outputs = inference(session, resized_img)
+    keypoints, scores = postprocess(outputs, model_input_size, center, scale)
+    return keypoints, scores

UniAnimate/dwpose/util.py ADDED Viewed

	@@ -0,0 +1,336 @@

+import math
+import numpy as np
+import matplotlib
+import cv2
+eps = 0.01
+def smart_resize(x, s):
+    Ht, Wt = s
+    if x.ndim == 2:
+        Ho, Wo = x.shape
+        Co = 1
+    else:
+        Ho, Wo, Co = x.shape
+    if Co == 3 or Co == 1:
+        k = float(Ht + Wt) / float(Ho + Wo)
+        return cv2.resize(x, (int(Wt), int(Ht)), interpolation=cv2.INTER_AREA if k < 1 else cv2.INTER_LANCZOS4)
+    else:
+        return np.stack([smart_resize(x[:, :, i], s) for i in range(Co)], axis=2)
+def smart_resize_k(x, fx, fy):
+    if x.ndim == 2:
+        Ho, Wo = x.shape
+        Co = 1
+    else:
+        Ho, Wo, Co = x.shape
+    Ht, Wt = Ho * fy, Wo * fx
+    if Co == 3 or Co == 1:
+        k = float(Ht + Wt) / float(Ho + Wo)
+        return cv2.resize(x, (int(Wt), int(Ht)), interpolation=cv2.INTER_AREA if k < 1 else cv2.INTER_LANCZOS4)
+    else:
+        return np.stack([smart_resize_k(x[:, :, i], fx, fy) for i in range(Co)], axis=2)
+def padRightDownCorner(img, stride, padValue):
+    h = img.shape[0]
+    w = img.shape[1]
+    pad = 4 * [None]
+    pad[0] = 0 # up
+    pad[1] = 0 # left
+    pad[2] = 0 if (h % stride == 0) else stride - (h % stride) # down
+    pad[3] = 0 if (w % stride == 0) else stride - (w % stride) # right
+    img_padded = img
+    pad_up = np.tile(img_padded[0:1, :, :]*0 + padValue, (pad[0], 1, 1))
+    img_padded = np.concatenate((pad_up, img_padded), axis=0)
+    pad_left = np.tile(img_padded[:, 0:1, :]*0 + padValue, (1, pad[1], 1))
+    img_padded = np.concatenate((pad_left, img_padded), axis=1)
+    pad_down = np.tile(img_padded[-2:-1, :, :]*0 + padValue, (pad[2], 1, 1))
+    img_padded = np.concatenate((img_padded, pad_down), axis=0)
+    pad_right = np.tile(img_padded[:, -2:-1, :]*0 + padValue, (1, pad[3], 1))
+    img_padded = np.concatenate((img_padded, pad_right), axis=1)
+    return img_padded, pad
+def transfer(model, model_weights):
+    transfered_model_weights = {}
+    for weights_name in model.state_dict().keys():
+        transfered_model_weights[weights_name] = model_weights['.'.join(weights_name.split('.')[1:])]
+    return transfered_model_weights
+def draw_bodypose(canvas, candidate, subset):
+    H, W, C = canvas.shape
+    candidate = np.array(candidate)
+    subset = np.array(subset)
+    stickwidth = 4
+    limbSeq = [[2, 3], [2, 6], [3, 4], [4, 5], [6, 7], [7, 8], [2, 9], [9, 10], \
+               [10, 11], [2, 12], [12, 13], [13, 14], [2, 1], [1, 15], [15, 17], \
+               [1, 16], [16, 18], [3, 17], [6, 18]]
+    colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0], \
+              [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255], \
+              [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]
+    for i in range(17):
+        for n in range(len(subset)):
+            index = subset[n][np.array(limbSeq[i]) - 1]
+            if -1 in index:
+                continue
+            Y = candidate[index.astype(int), 0] * float(W)
+            X = candidate[index.astype(int), 1] * float(H)
+            mX = np.mean(X)
+            mY = np.mean(Y)
+            length = ((X[0] - X[1]) ** 2 + (Y[0] - Y[1]) ** 2) ** 0.5
+            angle = math.degrees(math.atan2(X[0] - X[1], Y[0] - Y[1]))
+            polygon = cv2.ellipse2Poly((int(mY), int(mX)), (int(length / 2), stickwidth), int(angle), 0, 360, 1)
+            cv2.fillConvexPoly(canvas, polygon, colors[i])
+    canvas = (canvas * 0.6).astype(np.uint8)
+    for i in range(18):
+        for n in range(len(subset)):
+            index = int(subset[n][i])
+            if index == -1:
+                continue
+            x, y = candidate[index][0:2]
+            x = int(x * W)
+            y = int(y * H)
+            cv2.circle(canvas, (int(x), int(y)), 4, colors[i], thickness=-1)
+    return canvas
+def draw_body_and_foot(canvas, candidate, subset):
+    H, W, C = canvas.shape
+    candidate = np.array(candidate)
+    subset = np.array(subset)
+    stickwidth = 4
+    limbSeq = [[2, 3], [2, 6], [3, 4], [4, 5], [6, 7], [7, 8], [2, 9], [9, 10], \
+               [10, 11], [2, 12], [12, 13], [13, 14], [2, 1], [1, 15], [15, 17], \
+               [1, 16], [16, 18], [14,19], [11, 20]]
+    colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0], \
+              [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255], \
+              [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85], [170, 255, 255], [255, 255, 0]]
+    for i in range(19):
+        for n in range(len(subset)):
+            index = subset[n][np.array(limbSeq[i]) - 1]
+            if -1 in index:
+                continue
+            Y = candidate[index.astype(int), 0] * float(W)
+            X = candidate[index.astype(int), 1] * float(H)
+            mX = np.mean(X)
+            mY = np.mean(Y)
+            length = ((X[0] - X[1]) ** 2 + (Y[0] - Y[1]) ** 2) ** 0.5
+            angle = math.degrees(math.atan2(X[0] - X[1], Y[0] - Y[1]))
+            polygon = cv2.ellipse2Poly((int(mY), int(mX)), (int(length / 2), stickwidth), int(angle), 0, 360, 1)
+            cv2.fillConvexPoly(canvas, polygon, colors[i])
+    canvas = (canvas * 0.6).astype(np.uint8)
+    for i in range(20):
+        for n in range(len(subset)):
+            index = int(subset[n][i])
+            if index == -1:
+                continue
+            x, y = candidate[index][0:2]
+            x = int(x * W)
+            y = int(y * H)
+            cv2.circle(canvas, (int(x), int(y)), 4, colors[i], thickness=-1)
+    return canvas
+def draw_handpose(canvas, all_hand_peaks):
+    H, W, C = canvas.shape
+    edges = [[0, 1], [1, 2], [2, 3], [3, 4], [0, 5], [5, 6], [6, 7], [7, 8], [0, 9], [9, 10], \
+             [10, 11], [11, 12], [0, 13], [13, 14], [14, 15], [15, 16], [0, 17], [17, 18], [18, 19], [19, 20]]
+    for peaks in all_hand_peaks:
+        peaks = np.array(peaks)
+        for ie, e in enumerate(edges):
+            x1, y1 = peaks[e[0]]
+            x2, y2 = peaks[e[1]]
+            x1 = int(x1 * W)
+            y1 = int(y1 * H)
+            x2 = int(x2 * W)
+            y2 = int(y2 * H)
+            if x1 > eps and y1 > eps and x2 > eps and y2 > eps:
+                cv2.line(canvas, (x1, y1), (x2, y2), matplotlib.colors.hsv_to_rgb([ie / float(len(edges)), 1.0, 1.0]) * 255, thickness=2)
+        for i, keyponit in enumerate(peaks):
+            x, y = keyponit
+            x = int(x * W)
+            y = int(y * H)
+            if x > eps and y > eps:
+                cv2.circle(canvas, (x, y), 4, (0, 0, 255), thickness=-1)
+    return canvas
+def draw_facepose(canvas, all_lmks):
+    H, W, C = canvas.shape
+    for lmks in all_lmks:
+        lmks = np.array(lmks)
+        for lmk in lmks:
+            x, y = lmk
+            x = int(x * W)
+            y = int(y * H)
+            if x > eps and y > eps:
+                cv2.circle(canvas, (x, y), 3, (255, 255, 255), thickness=-1)
+    return canvas
+# detect hand according to body pose keypoints
+# please refer to https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/src/openpose/hand/handDetector.cpp
+def handDetect(candidate, subset, oriImg):
+    # right hand: wrist 4, elbow 3, shoulder 2
+    # left hand: wrist 7, elbow 6, shoulder 5
+    ratioWristElbow = 0.33
+    detect_result = []
+    image_height, image_width = oriImg.shape[0:2]
+    for person in subset.astype(int):
+        # if any of three not detected
+        has_left = np.sum(person[[5, 6, 7]] == -1) == 0
+        has_right = np.sum(person[[2, 3, 4]] == -1) == 0
+        if not (has_left or has_right):
+            continue
+        hands = []
+        #left hand
+        if has_left:
+            left_shoulder_index, left_elbow_index, left_wrist_index = person[[5, 6, 7]]
+            x1, y1 = candidate[left_shoulder_index][:2]
+            x2, y2 = candidate[left_elbow_index][:2]
+            x3, y3 = candidate[left_wrist_index][:2]
+            hands.append([x1, y1, x2, y2, x3, y3, True])
+        # right hand
+        if has_right:
+            right_shoulder_index, right_elbow_index, right_wrist_index = person[[2, 3, 4]]
+            x1, y1 = candidate[right_shoulder_index][:2]
+            x2, y2 = candidate[right_elbow_index][:2]
+            x3, y3 = candidate[right_wrist_index][:2]
+            hands.append([x1, y1, x2, y2, x3, y3, False])
+        for x1, y1, x2, y2, x3, y3, is_left in hands:
+            x = x3 + ratioWristElbow * (x3 - x2)
+            y = y3 + ratioWristElbow * (y3 - y2)
+            distanceWristElbow = math.sqrt((x3 - x2) ** 2 + (y3 - y2) ** 2)
+            distanceElbowShoulder = math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
+            width = 1.5 * max(distanceWristElbow, 0.9 * distanceElbowShoulder)
+            # x-y refers to the center --> offset to topLeft point
+            # handRectangle.x -= handRectangle.width / 2.f;
+            # handRectangle.y -= handRectangle.height / 2.f;
+            x -= width / 2
+            y -= width / 2  # width = height
+            # overflow the image
+            if x < 0: x = 0
+            if y < 0: y = 0
+            width1 = width
+            width2 = width
+            if x + width > image_width: width1 = image_width - x
+            if y + width > image_height: width2 = image_height - y
+            width = min(width1, width2)
+            # the max hand box value is 20 pixels
+            if width >= 20:
+                detect_result.append([int(x), int(y), int(width), is_left])
+    '''
+    return value: [[x, y, w, True if left hand else False]].
+    width=height since the network require squared input.
+    x, y is the coordinate of top left
+    '''
+    return detect_result
+# Written by Lvmin
+def faceDetect(candidate, subset, oriImg):
+    # left right eye ear 14 15 16 17
+    detect_result = []
+    image_height, image_width = oriImg.shape[0:2]
+    for person in subset.astype(int):
+        has_head = person[0] > -1
+        if not has_head:
+            continue
+        has_left_eye = person[14] > -1
+        has_right_eye = person[15] > -1
+        has_left_ear = person[16] > -1
+        has_right_ear = person[17] > -1
+        if not (has_left_eye or has_right_eye or has_left_ear or has_right_ear):
+            continue
+        head, left_eye, right_eye, left_ear, right_ear = person[[0, 14, 15, 16, 17]]
+        width = 0.0
+        x0, y0 = candidate[head][:2]
+        if has_left_eye:
+            x1, y1 = candidate[left_eye][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 3.0)
+        if has_right_eye:
+            x1, y1 = candidate[right_eye][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 3.0)
+        if has_left_ear:
+            x1, y1 = candidate[left_ear][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 1.5)
+        if has_right_ear:
+            x1, y1 = candidate[right_ear][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 1.5)
+        x, y = x0, y0
+        x -= width
+        y -= width
+        if x < 0:
+            x = 0
+        if y < 0:
+            y = 0
+        width1 = width * 2
+        width2 = width * 2
+        if x + width > image_width:
+            width1 = image_width - x
+        if y + width > image_height:
+            width2 = image_height - y
+        width = min(width1, width2)
+        if width >= 20:
+            detect_result.append([int(x), int(y), int(width)])
+    return detect_result
+# get max index of 2d array
+def npmax(array):
+    arrayindex = array.argmax(1)
+    arrayvalue = array.max(1)
+    i = arrayvalue.argmax()
+    j = arrayindex[i]
+    return i, j

UniAnimate/dwpose/wholebody.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import cv2
+import numpy as np
+import onnxruntime as ort
+from dwpose.onnxdet import inference_detector
+from dwpose.onnxpose import inference_pose
+class Wholebody:
+    def __init__(self):
+        device = 'cuda' # 'cpu' #
+        providers = ['CPUExecutionProvider'
+                  ] if device == 'cpu' else ['CUDAExecutionProvider']
+        onnx_det = 'checkpoints/yolox_l.onnx'
+        onnx_pose = 'checkpoints/dw-ll_ucoco_384.onnx'
+        self.session_det = ort.InferenceSession(path_or_bytes=onnx_det, providers=providers)
+        self.session_pose = ort.InferenceSession(path_or_bytes=onnx_pose, providers=providers)
+    def __call__(self, oriImg):
+        det_result = inference_detector(self.session_det, oriImg)
+        keypoints, scores = inference_pose(self.session_pose, det_result, oriImg)
+        keypoints_info = np.concatenate(
+            (keypoints, scores[..., None]), axis=-1)
+        # compute neck joint
+        neck = np.mean(keypoints_info[:, [5, 6]], axis=1)
+        # neck score when visualizing pred
+        neck[:, 2:4] = np.logical_and(
+            keypoints_info[:, 5, 2:4] > 0.3,
+            keypoints_info[:, 6, 2:4] > 0.3).astype(int)
+        new_keypoints_info = np.insert(
+            keypoints_info, 17, neck, axis=1)
+        mmpose_idx = [
+            17, 6, 8, 10, 7, 9, 12, 14, 16, 13, 15, 2, 1, 4, 3
+        ]
+        openpose_idx = [
+            1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17
+        ]
+        new_keypoints_info[:, openpose_idx] = \
+            new_keypoints_info[:, mmpose_idx]
+        keypoints_info = new_keypoints_info
+        keypoints, scores = keypoints_info[
+            ..., :2], keypoints_info[..., 2]
+        return keypoints, scores

UniAnimate/environment.yaml ADDED Viewed

	@@ -0,0 +1,236 @@

+name: /mnt/user/miniconda3/envs/dtrans
+channels:
+  - http://mirrors.aliyun.com/anaconda/pkgs/main
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - ca-certificates=2023.12.12=h06a4308_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - libffi=3.4.4=h6a678d5_0
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - ncurses=6.4=h6a678d5_0
+  - openssl=3.0.12=h7f8727e_0
+  - pip=23.3.1=py39h06a4308_0
+  - python=3.9.18=h955ad1f_0
+  - readline=8.2=h5eee18b_0
+  - setuptools=68.2.2=py39h06a4308_0
+  - sqlite=3.41.2=h5eee18b_0
+  - tk=8.6.12=h1ccaba5_0
+  - wheel=0.41.2=py39h06a4308_0
+  - xz=5.4.5=h5eee18b_0
+  - zlib=1.2.13=h5eee18b_0
+  - pip:
+      - aiofiles==23.2.1
+      - aiohttp==3.9.1
+      - aiosignal==1.3.1
+      - aliyun-python-sdk-core==2.14.0
+      - aliyun-python-sdk-kms==2.16.2
+      - altair==5.2.0
+      - annotated-types==0.6.0
+      - antlr4-python3-runtime==4.9.3
+      - anyio==4.2.0
+      - argparse==1.4.0
+      - asttokens==2.4.1
+      - async-timeout==4.0.3
+      - attrs==23.2.0
+      - automat==22.10.0
+      - beartype==0.16.4
+      - blessed==1.20.0
+      - buildtools==1.0.6
+      - causal-conv1d==1.1.3.post1
+      - certifi==2023.11.17
+      - cffi==1.16.0
+      - chardet==5.2.0
+      - charset-normalizer==3.3.2
+      - clean-fid==0.1.35
+      - click==8.1.7
+      - clip==1.0
+      - cmake==3.28.1
+      - colorama==0.4.6
+      - coloredlogs==15.0.1
+      - constantly==23.10.4
+      - contourpy==1.2.0
+      - crcmod==1.7
+      - cryptography==41.0.7
+      - cycler==0.12.1
+      - decorator==5.1.1
+      - decord==0.6.0
+      - diffusers==0.26.3
+      - docopt==0.6.2
+      - easydict==1.11
+      - einops==0.7.0
+      - exceptiongroup==1.2.0
+      - executing==2.0.1
+      - fairscale==0.4.13
+      - fastapi==0.109.0
+      - ffmpeg==1.4
+      - ffmpy==0.3.1
+      - filelock==3.13.1
+      - flatbuffers==24.3.25
+      - fonttools==4.47.2
+      - frozenlist==1.4.1
+      - fsspec==2023.12.2
+      - ftfy==6.1.3
+      - furl==2.1.3
+      - gpustat==1.1.1
+      - gradio==4.14.0
+      - gradio-client==0.8.0
+      - greenlet==3.0.3
+      - h11==0.14.0
+      - httpcore==1.0.2
+      - httpx==0.26.0
+      - huggingface-hub==0.20.2
+      - humanfriendly==10.0
+      - hyperlink==21.0.0
+      - idna==3.6
+      - imageio==2.33.1
+      - imageio-ffmpeg==0.4.9
+      - importlib-metadata==7.0.1
+      - importlib-resources==6.1.1
+      - incremental==22.10.0
+      - ipdb==0.13.13
+      - ipython==8.18.1
+      - jedi==0.19.1
+      - jinja2==3.1.3
+      - jmespath==0.10.0
+      - joblib==1.3.2
+      - jsonschema==4.21.0
+      - jsonschema-specifications==2023.12.1
+      - kiwisolver==1.4.5
+      - kornia==0.7.1
+      - lazy-loader==0.3
+      - lightning-utilities==0.10.0
+      - lit==17.0.6
+      - lpips==0.1.4
+      - mamba-ssm==1.1.4
+      - markdown-it-py==3.0.0
+      - markupsafe==2.1.3
+      - matplotlib==3.8.2
+      - matplotlib-inline==0.1.6
+      - mdurl==0.1.2
+      - motion-vector-extractor==1.0.6
+      - mpmath==1.3.0
+      - multidict==6.0.4
+      - mypy-extensions==1.0.0
+      - networkx==3.2.1
+      - ninja==1.11.1.1
+      - numpy==1.26.3
+      - nvidia-cublas-cu11==11.10.3.66
+      - nvidia-cublas-cu12==12.1.3.1
+      - nvidia-cuda-cupti-cu11==11.7.101
+      - nvidia-cuda-cupti-cu12==12.1.105
+      - nvidia-cuda-nvrtc-cu11==11.7.99
+      - nvidia-cuda-nvrtc-cu12==12.1.105
+      - nvidia-cuda-runtime-cu11==11.7.99
+      - nvidia-cuda-runtime-cu12==12.1.105
+      - nvidia-cudnn-cu11==8.5.0.96
+      - nvidia-cudnn-cu12==8.9.2.26
+      - nvidia-cufft-cu11==10.9.0.58
+      - nvidia-cufft-cu12==11.0.2.54
+      - nvidia-curand-cu11==10.2.10.91
+      - nvidia-curand-cu12==10.3.2.106
+      - nvidia-cusolver-cu11==11.4.0.1
+      - nvidia-cusolver-cu12==11.4.5.107
+      - nvidia-cusparse-cu11==11.7.4.91
+      - nvidia-cusparse-cu12==12.1.0.106
+      - nvidia-ml-py==12.535.133
+      - nvidia-nccl-cu11==2.14.3
+      - nvidia-nccl-cu12==2.19.3
+      - nvidia-nvjitlink-cu12==12.3.101
+      - nvidia-nvtx-cu11==11.7.91
+      - nvidia-nvtx-cu12==12.1.105
+      - omegaconf==2.3.0
+      - onnxruntime==1.18.0
+      - open-clip-torch==2.24.0
+      - opencv-python==4.5.3.56
+      - opencv-python-headless==4.9.0.80
+      - orderedmultidict==1.0.1
+      - orjson==3.9.10
+      - oss2==2.18.4
+      - packaging==23.2
+      - pandas==2.1.4
+      - parso==0.8.3
+      - pexpect==4.9.0
+      - pillow==10.2.0
+      - piq==0.8.0
+      - pkgconfig==1.5.5
+      - prompt-toolkit==3.0.43
+      - protobuf==4.25.2
+      - psutil==5.9.8
+      - ptflops==0.7.2.2
+      - ptyprocess==0.7.0
+      - pure-eval==0.2.2
+      - pycparser==2.21
+      - pycryptodome==3.20.0
+      - pydantic==2.5.3
+      - pydantic-core==2.14.6
+      - pydub==0.25.1
+      - pygments==2.17.2
+      - pynvml==11.5.0
+      - pyparsing==3.1.1
+      - pyre-extensions==0.0.29
+      - python-dateutil==2.8.2
+      - python-multipart==0.0.6
+      - pytorch-lightning==2.1.3
+      - pytz==2023.3.post1
+      - pyyaml==6.0.1
+      - redo==2.0.4
+      - referencing==0.32.1
+      - regex==2023.12.25
+      - requests==2.31.0
+      - rich==13.7.0
+      - rotary-embedding-torch==0.5.3
+      - rpds-py==0.17.1
+      - ruff==0.2.0
+      - safetensors==0.4.1
+      - scikit-image==0.22.0
+      - scikit-learn==1.4.0
+      - scipy==1.11.4
+      - semantic-version==2.10.0
+      - sentencepiece==0.1.99
+      - shellingham==1.5.4
+      - simplejson==3.19.2
+      - six==1.16.0
+      - sk-video==1.1.10
+      - sniffio==1.3.0
+      - sqlalchemy==2.0.27
+      - stack-data==0.6.3
+      - starlette==0.35.1
+      - sympy==1.12
+      - thop==0.1.1-2209072238
+      - threadpoolctl==3.2.0
+      - tifffile==2023.12.9
+      - timm==0.9.12
+      - tokenizers==0.15.0
+      - tomli==2.0.1
+      - tomlkit==0.12.0
+      - toolz==0.12.0
+      - torch==2.0.1+cu118
+      - torchaudio==2.0.2+cu118
+      - torchdiffeq==0.2.3
+      - torchmetrics==1.3.0.post0
+      - torchsde==0.2.6
+      - torchvision==0.15.2+cu118
+      - tqdm==4.66.1
+      - traitlets==5.14.1
+      - trampoline==0.1.2
+      - transformers==4.36.2
+      - triton==2.0.0
+      - twisted==23.10.0
+      - typer==0.9.0
+      - typing-extensions==4.9.0
+      - typing-inspect==0.9.0
+      - tzdata==2023.4
+      - urllib3==2.1.0
+      - uvicorn==0.26.0
+      - wcwidth==0.2.13
+      - websockets==11.0.3
+      - xformers==0.0.20
+      - yarl==1.9.4
+      - zipp==3.17.0
+      - zope-interface==6.2
+      - onnxruntime-gpu==1.13.1
+prefix: /mnt/user/miniconda3/envs/dtrans

UniAnimate/inference.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import os
+import sys
+import copy
+import json
+import math
+import random
+import logging
+import itertools
+import numpy as np
+from utils.config import Config
+from utils.registry_class import INFER_ENGINE
+from tools import *
+if __name__ == '__main__':
+    cfg_update = Config(load=True)
+    INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)

UniAnimate/requirements.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+# antlr4-python3-runtime==4.9.3
+# anyio==4.2.0
+# asttokens==2.4.1
+# async-timeout==4.0.3
+# attrs==23.2.0
+# Automat==22.10.0
+# beartype==0.16.4
+# blessed==1.20.0
+# buildtools==1.0.6
+# # causal-conv1d==1.1.3.post1
+# certifi==2023.11.17
+# cffi==1.16.0
+# chardet==5.2.0
+# charset-normalizer==3.3.2
+# clean-fid==0.1.35
+# click==8.1.7
+# # clip==1.0
+# cmake==3.28.1
+# colorama==0.4.6
+# coloredlogs==15.0.1
+# constantly==23.10.4
+# contourpy==1.2.0
+# crcmod==1.7
+# cryptography==41.0.7
+# cycler==0.12.1
+# decorator==5.1.1
+# decord==0.6.0
+# diffusers==0.26.3
+# docopt==0.6.2
+easydict==1.11
+einops==0.7.0
+# exceptiongroup==1.2.0
+# executing==2.0.1
+fairscale==0.4.13
+# fastapi==0.109.0
+# ffmpeg==1.4
+# ffmpy==0.3.1
+# filelock==3.13.1
+# flatbuffers==24.3.25
+# fonttools==4.47.2
+# frozenlist==1.4.1
+# fsspec==2023.12.2
+# ftfy==6.1.3
+# furl==2.1.3
+# gpustat==1.1.1
+# gradio==4.14.0
+# gradio_client==0.8.0
+# greenlet==3.0.3
+# h11==0.14.0
+# httpcore==1.0.2
+# httpx==0.26.0
+# huggingface-hub==0.20.2
+# humanfriendly==10.0
+# hyperlink==21.0.0
+# idna==3.6
+imageio==2.33.1
+imageio-ffmpeg==0.4.9
+# importlib-metadata==7.0.1
+# importlib-resources==6.1.1
+# incremental==22.10.0
+# ipdb==0.13.13
+# ipython==8.18.1
+# jedi==0.19.1
+# Jinja2==3.1.3
+# jmespath==0.10.0
+# joblib==1.3.2
+# jsonschema==4.21.0
+# jsonschema-specifications==2023.12.1
+# kiwisolver==1.4.5
+# kornia==0.7.1
+# lazy_loader==0.3
+# lightning-utilities==0.10.0
+# lit==17.0.6
+# lpips==0.1.4
+# markdown-it-py==3.0.0
+# MarkupSafe==2.1.3
+matplotlib==3.8.2
+matplotlib-inline==0.1.6
+# mdurl==0.1.2
+# # motion-vector-extractor==1.0.6
+# mpmath==1.3.0
+# multidict==6.0.4
+# mypy-extensions==1.0.0
+# networkx==3.2.1
+# ninja==1.11.1.1
+# numpy==1.26.3
+# nvidia-cublas-cu11==11.10.3.66
+# nvidia-cublas-cu12==12.1.3.1
+# nvidia-cuda-cupti-cu11==11.7.101
+# nvidia-cuda-cupti-cu12==12.1.105
+# nvidia-cuda-nvrtc-cu11==11.7.99
+# nvidia-cuda-nvrtc-cu12==12.1.105
+# nvidia-cuda-runtime-cu11==11.7.99
+# nvidia-cuda-runtime-cu12==12.1.105
+# nvidia-cudnn-cu11==8.5.0.96
+# nvidia-cudnn-cu12==8.9.2.26
+# nvidia-cufft-cu11==10.9.0.58
+# nvidia-cufft-cu12==11.0.2.54
+# nvidia-curand-cu11==10.2.10.91
+# nvidia-curand-cu12==10.3.2.106
+# nvidia-cusolver-cu11==11.4.0.1
+# nvidia-cusolver-cu12==11.4.5.107
+# nvidia-cusparse-cu11==11.7.4.91
+# nvidia-cusparse-cu12==12.1.0.106
+# nvidia-ml-py==12.535.133
+# nvidia-nccl-cu11==2.14.3
+# nvidia-nccl-cu12==2.19.3
+# nvidia-nvjitlink-cu12==12.3.101
+# nvidia-nvtx-cu11==11.7.91
+# nvidia-nvtx-cu12==12.1.105
+# omegaconf==2.3.0
+onnxruntime==1.18.0
+open-clip-torch==2.24.0
+opencv-python==4.5.3.56
+opencv-python-headless==4.9.0.80
+# orderedmultidict==1.0.1
+# orjson==3.9.10
+oss2==2.18.4
+# # packaging==23.2
+# pandas==2.1.4
+# parso==0.8.3
+# pexpect==4.9.0
+pillow==10.2.0
+# piq==0.8.0
+# pkgconfig==1.5.5
+# prompt-toolkit==3.0.43
+# protobuf==4.25.2
+# psutil==5.9.8
+ptflops==0.7.2.2
+# ptyprocess==0.7.0
+# pure-eval==0.2.2
+# pycparser==2.21
+# pycryptodome==3.20.0
+# pydantic==2.5.3
+# pydantic_core==2.14.6
+# pydub==0.25.1
+# Pygments==2.17.2
+pynvml==11.5.0
+# pyparsing==3.1.1
+# pyre-extensions==0.0.29
+# python-dateutil==2.8.2
+# python-multipart==0.0.6
+# pytorch-lightning==2.1.3
+# pytz==2023.3.post1
+PyYAML==6.0.1
+# redo==2.0.4
+# referencing==0.32.1
+# regex==2023.12.25
+requests==2.31.0
+# rich==13.7.0
+rotary-embedding-torch==0.5.3
+# rpds-py==0.17.1
+# ruff==0.2.0
+# safetensors==0.4.1
+# scikit-image==0.22.0
+# scikit-learn==1.4.0
+# scipy==1.11.4
+# semantic-version==2.10.0
+# sentencepiece==0.1.99
+# shellingham==1.5.4
+simplejson==3.19.2
+# six==1.16.0
+# sk-video==1.1.10
+# sniffio==1.3.0
+# SQLAlchemy==2.0.27
+# stack-data==0.6.3
+# starlette==0.35.1
+# sympy==1.12
+thop==0.1.1.post2209072238
+# threadpoolctl==3.2.0
+# tifffile==2023.12.9
+# timm==0.9.12
+# tokenizers==0.15.0
+# tomli==2.0.1
+# tomlkit==0.12.0
+# toolz==0.12.0
+torch==2.0.1+cu118
+# torchaudio==2.0.2+cu118
+# torchdiffeq==0.2.3
+# torchmetrics==1.3.0.post0
+torchsde==0.2.6
+torchvision==0.15.2+cu118
+tqdm==4.66.1
+# traitlets==5.14.1
+# trampoline==0.1.2
+# transformers==4.36.2
+# triton==2.0.0
+# Twisted==23.10.0
+# typer==0.9.0
+typing-inspect==0.9.0
+typing_extensions==4.9.0
+# tzdata==2023.4
+# urllib3==2.1.0
+# uvicorn==0.26.0
+# wcwidth==0.2.13
+# websockets==11.0.3
+xformers==0.0.20
+# yarl==1.9.4
+# zipp==3.17.0
+# zope.interface==6.2
+onnxruntime-gpu==1.13.1

UniAnimate/run_align_pose.py ADDED Viewed

	@@ -0,0 +1,712 @@

+# Openpose
+# Original from CMU https://github.com/CMU-Perceptual-Computing-Lab/openpose
+# 2nd Edited by https://github.com/Hzzone/pytorch-openpose
+# 3rd Edited by ControlNet
+# 4th Edited by ControlNet (added face and correct hands)
+import os
+os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
+import cv2
+import torch
+import numpy as np
+import json
+import copy
+import torch
+import random
+import argparse
+import shutil
+import tempfile
+import subprocess
+import numpy as np
+import math
+import torch.multiprocessing as mp
+import torch.distributed as dist
+import pickle
+import logging
+from io import BytesIO
+import oss2 as oss
+import os.path as osp
+import sys
+import dwpose.util as util
+from dwpose.wholebody import Wholebody
+def smoothing_factor(t_e, cutoff):
+    r = 2 * math.pi * cutoff * t_e
+    return r / (r + 1)
+def exponential_smoothing(a, x, x_prev):
+    return a * x + (1 - a) * x_prev
+class OneEuroFilter:
+    def __init__(self, t0, x0, dx0=0.0, min_cutoff=1.0, beta=0.0,
+                 d_cutoff=1.0):
+        """Initialize the one euro filter."""
+        # The parameters.
+        self.min_cutoff = float(min_cutoff)
+        self.beta = float(beta)
+        self.d_cutoff = float(d_cutoff)
+        # Previous values.
+        self.x_prev = x0
+        self.dx_prev = float(dx0)
+        self.t_prev = float(t0)
+    def __call__(self, t, x):
+        """Compute the filtered signal."""
+        t_e = t - self.t_prev
+        # The filtered derivative of the signal.
+        a_d = smoothing_factor(t_e, self.d_cutoff)
+        dx = (x - self.x_prev) / t_e
+        dx_hat = exponential_smoothing(a_d, dx, self.dx_prev)
+        # The filtered signal.
+        cutoff = self.min_cutoff + self.beta * abs(dx_hat)
+        a = smoothing_factor(t_e, cutoff)
+        x_hat = exponential_smoothing(a, x, self.x_prev)
+        # Memorize the previous values.
+        self.x_prev = x_hat
+        self.dx_prev = dx_hat
+        self.t_prev = t
+        return x_hat
+def get_logger(name="essmc2"):
+    logger = logging.getLogger(name)
+    logger.propagate = False
+    if len(logger.handlers) == 0:
+        std_handler = logging.StreamHandler(sys.stdout)
+        formatter = logging.Formatter(
+            '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+        std_handler.setFormatter(formatter)
+        std_handler.setLevel(logging.INFO)
+        logger.setLevel(logging.INFO)
+        logger.addHandler(std_handler)
+    return logger
+class DWposeDetector:
+    def __init__(self):
+        self.pose_estimation = Wholebody()
+    def __call__(self, oriImg):
+        oriImg = oriImg.copy()
+        H, W, C = oriImg.shape
+        with torch.no_grad():
+            candidate, subset = self.pose_estimation(oriImg)
+            candidate = candidate[0][np.newaxis, :, :]
+            subset = subset[0][np.newaxis, :]
+            nums, keys, locs = candidate.shape
+            candidate[..., 0] /= float(W)
+            candidate[..., 1] /= float(H)
+            body = candidate[:,:18].copy()
+            body = body.reshape(nums*18, locs)
+            score = subset[:,:18].copy()
+            for i in range(len(score)):
+                for j in range(len(score[i])):
+                    if score[i][j] > 0.3:
+                        score[i][j] = int(18*i+j)
+                    else:
+                        score[i][j] = -1
+            un_visible = subset<0.3
+            candidate[un_visible] = -1
+            bodyfoot_score = subset[:,:24].copy()
+            for i in range(len(bodyfoot_score)):
+                for j in range(len(bodyfoot_score[i])):
+                    if bodyfoot_score[i][j] > 0.3:
+                        bodyfoot_score[i][j] = int(18*i+j)
+                    else:
+                        bodyfoot_score[i][j] = -1
+            if -1 not in bodyfoot_score[:,18] and -1 not in bodyfoot_score[:,19]:
+                bodyfoot_score[:,18] = np.array([18.])
+            else:
+                bodyfoot_score[:,18] = np.array([-1.])
+            if -1 not in bodyfoot_score[:,21] and -1 not in bodyfoot_score[:,22]:
+                bodyfoot_score[:,19] = np.array([19.])
+            else:
+                bodyfoot_score[:,19] = np.array([-1.])
+            bodyfoot_score = bodyfoot_score[:, :20]
+            bodyfoot = candidate[:,:24].copy()
+            for i in range(nums):
+                if -1 not in bodyfoot[i][18] and -1 not in bodyfoot[i][19]:
+                    bodyfoot[i][18] = (bodyfoot[i][18]+bodyfoot[i][19])/2
+                else:
+                    bodyfoot[i][18] = np.array([-1., -1.])
+                if -1 not in bodyfoot[i][21] and -1 not in bodyfoot[i][22]:
+                    bodyfoot[i][19] = (bodyfoot[i][21]+bodyfoot[i][22])/2
+                else:
+                    bodyfoot[i][19] = np.array([-1., -1.])
+            bodyfoot = bodyfoot[:,:20,:]
+            bodyfoot = bodyfoot.reshape(nums*20, locs)
+            foot = candidate[:,18:24]
+            faces = candidate[:,24:92]
+            hands = candidate[:,92:113]
+            hands = np.vstack([hands, candidate[:,113:]])
+            # bodies = dict(candidate=body, subset=score)
+            bodies = dict(candidate=bodyfoot, subset=bodyfoot_score)
+            pose = dict(bodies=bodies, hands=hands, faces=faces)
+            # return draw_pose(pose, H, W)
+            return pose
+def draw_pose(pose, H, W):
+    bodies = pose['bodies']
+    faces = pose['faces']
+    hands = pose['hands']
+    candidate = bodies['candidate']
+    subset = bodies['subset']
+    canvas = np.zeros(shape=(H, W, 3), dtype=np.uint8)
+    canvas = util.draw_body_and_foot(canvas, candidate, subset)
+    canvas = util.draw_handpose(canvas, hands)
+    canvas_without_face = copy.deepcopy(canvas)
+    canvas = util.draw_facepose(canvas, faces)
+    return canvas_without_face, canvas
+def dw_func(_id, frame, dwpose_model, dwpose_woface_folder='tmp_dwpose_wo_face', dwpose_withface_folder='tmp_dwpose_with_face'):
+    # frame = cv2.imread(frame_name, cv2.IMREAD_COLOR)
+    pose = dwpose_model(frame)
+    return pose
+def mp_main(args):
+    if args.source_video_paths.endswith('mp4'):
+        video_paths = [args.source_video_paths]
+    else:
+        # video list
+        video_paths = [os.path.join(args.source_video_paths, frame_name) for frame_name in os.listdir(args.source_video_paths)]
+    logger.info("There are {} videos for extracting poses".format(len(video_paths)))
+    logger.info('LOAD: DW Pose Model')
+    dwpose_model = DWposeDetector()
+    results_vis = []
+    for i, file_path in enumerate(video_paths):
+        logger.info(f"{i}/{len(video_paths)}, {file_path}")
+        videoCapture = cv2.VideoCapture(file_path)
+        while videoCapture.isOpened():
+            # get a frame
+            ret, frame = videoCapture.read()
+            if ret:
+                pose = dw_func(i, frame, dwpose_model)
+                results_vis.append(pose)
+            else:
+                break
+        logger.info(f'all frames in {file_path} have been read.')
+        videoCapture.release()
+    # added
+    # results_vis = results_vis[8:]
+    print(len(results_vis))
+    ref_name = args.ref_name
+    save_motion = args.saved_pose_dir
+    os.system(f'rm -rf {save_motion}');
+    os.makedirs(save_motion, exist_ok=True)
+    save_warp = args.saved_pose_dir
+    # os.makedirs(save_warp, exist_ok=True)
+    ref_frame = cv2.imread(ref_name, cv2.IMREAD_COLOR)
+    pose_ref = dw_func(i, ref_frame, dwpose_model)
+    bodies = results_vis[0]['bodies']
+    faces = results_vis[0]['faces']
+    hands = results_vis[0]['hands']
+    candidate = bodies['candidate']
+    ref_bodies = pose_ref['bodies']
+    ref_faces = pose_ref['faces']
+    ref_hands = pose_ref['hands']
+    ref_candidate = ref_bodies['candidate']
+    ref_2_x = ref_candidate[2][0]
+    ref_2_y = ref_candidate[2][1]
+    ref_5_x = ref_candidate[5][0]
+    ref_5_y = ref_candidate[5][1]
+    ref_8_x = ref_candidate[8][0]
+    ref_8_y = ref_candidate[8][1]
+    ref_11_x = ref_candidate[11][0]
+    ref_11_y = ref_candidate[11][1]
+    ref_center1 = 0.5*(ref_candidate[2]+ref_candidate[5])
+    ref_center2 = 0.5*(ref_candidate[8]+ref_candidate[11])
+    zero_2_x = candidate[2][0]
+    zero_2_y = candidate[2][1]
+    zero_5_x = candidate[5][0]
+    zero_5_y = candidate[5][1]
+    zero_8_x = candidate[8][0]
+    zero_8_y = candidate[8][1]
+    zero_11_x = candidate[11][0]
+    zero_11_y = candidate[11][1]
+    zero_center1 = 0.5*(candidate[2]+candidate[5])
+    zero_center2 = 0.5*(candidate[8]+candidate[11])
+    x_ratio = (ref_5_x-ref_2_x)/(zero_5_x-zero_2_x)
+    y_ratio = (ref_center2[1]-ref_center1[1])/(zero_center2[1]-zero_center1[1])
+    results_vis[0]['bodies']['candidate'][:,0] *= x_ratio
+    results_vis[0]['bodies']['candidate'][:,1] *= y_ratio
+    results_vis[0]['faces'][:,:,0] *= x_ratio
+    results_vis[0]['faces'][:,:,1] *= y_ratio
+    results_vis[0]['hands'][:,:,0] *= x_ratio
+    results_vis[0]['hands'][:,:,1] *= y_ratio
+    ########neck########
+    l_neck_ref = ((ref_candidate[0][0] - ref_candidate[1][0]) ** 2 + (ref_candidate[0][1] - ref_candidate[1][1]) ** 2) ** 0.5
+    l_neck_0 = ((candidate[0][0] - candidate[1][0]) ** 2 + (candidate[0][1] - candidate[1][1]) ** 2) ** 0.5
+    neck_ratio = l_neck_ref / l_neck_0
+    x_offset_neck = (candidate[1][0]-candidate[0][0])*(1.-neck_ratio)
+    y_offset_neck = (candidate[1][1]-candidate[0][1])*(1.-neck_ratio)
+    results_vis[0]['bodies']['candidate'][0,0] += x_offset_neck
+    results_vis[0]['bodies']['candidate'][0,1] += y_offset_neck
+    results_vis[0]['bodies']['candidate'][14,0] += x_offset_neck
+    results_vis[0]['bodies']['candidate'][14,1] += y_offset_neck
+    results_vis[0]['bodies']['candidate'][15,0] += x_offset_neck
+    results_vis[0]['bodies']['candidate'][15,1] += y_offset_neck
+    results_vis[0]['bodies']['candidate'][16,0] += x_offset_neck
+    results_vis[0]['bodies']['candidate'][16,1] += y_offset_neck
+    results_vis[0]['bodies']['candidate'][17,0] += x_offset_neck
+    results_vis[0]['bodies']['candidate'][17,1] += y_offset_neck
+    ########shoulder2########
+    l_shoulder2_ref = ((ref_candidate[2][0] - ref_candidate[1][0]) ** 2 + (ref_candidate[2][1] - ref_candidate[1][1]) ** 2) ** 0.5
+    l_shoulder2_0 = ((candidate[2][0] - candidate[1][0]) ** 2 + (candidate[2][1] - candidate[1][1]) ** 2) ** 0.5
+    shoulder2_ratio = l_shoulder2_ref / l_shoulder2_0
+    x_offset_shoulder2 = (candidate[1][0]-candidate[2][0])*(1.-shoulder2_ratio)
+    y_offset_shoulder2 = (candidate[1][1]-candidate[2][1])*(1.-shoulder2_ratio)
+    results_vis[0]['bodies']['candidate'][2,0] += x_offset_shoulder2
+    results_vis[0]['bodies']['candidate'][2,1] += y_offset_shoulder2
+    results_vis[0]['bodies']['candidate'][3,0] += x_offset_shoulder2
+    results_vis[0]['bodies']['candidate'][3,1] += y_offset_shoulder2
+    results_vis[0]['bodies']['candidate'][4,0] += x_offset_shoulder2
+    results_vis[0]['bodies']['candidate'][4,1] += y_offset_shoulder2
+    results_vis[0]['hands'][1,:,0] += x_offset_shoulder2
+    results_vis[0]['hands'][1,:,1] += y_offset_shoulder2
+    ########shoulder5########
+    l_shoulder5_ref = ((ref_candidate[5][0] - ref_candidate[1][0]) ** 2 + (ref_candidate[5][1] - ref_candidate[1][1]) ** 2) ** 0.5
+    l_shoulder5_0 = ((candidate[5][0] - candidate[1][0]) ** 2 + (candidate[5][1] - candidate[1][1]) ** 2) ** 0.5
+    shoulder5_ratio = l_shoulder5_ref / l_shoulder5_0
+    x_offset_shoulder5 = (candidate[1][0]-candidate[5][0])*(1.-shoulder5_ratio)
+    y_offset_shoulder5 = (candidate[1][1]-candidate[5][1])*(1.-shoulder5_ratio)
+    results_vis[0]['bodies']['candidate'][5,0] += x_offset_shoulder5
+    results_vis[0]['bodies']['candidate'][5,1] += y_offset_shoulder5
+    results_vis[0]['bodies']['candidate'][6,0] += x_offset_shoulder5
+    results_vis[0]['bodies']['candidate'][6,1] += y_offset_shoulder5
+    results_vis[0]['bodies']['candidate'][7,0] += x_offset_shoulder5
+    results_vis[0]['bodies']['candidate'][7,1] += y_offset_shoulder5
+    results_vis[0]['hands'][0,:,0] += x_offset_shoulder5
+    results_vis[0]['hands'][0,:,1] += y_offset_shoulder5
+    ########arm3########
+    l_arm3_ref = ((ref_candidate[3][0] - ref_candidate[2][0]) ** 2 + (ref_candidate[3][1] - ref_candidate[2][1]) ** 2) ** 0.5
+    l_arm3_0 = ((candidate[3][0] - candidate[2][0]) ** 2 + (candidate[3][1] - candidate[2][1]) ** 2) ** 0.5
+    arm3_ratio = l_arm3_ref / l_arm3_0
+    x_offset_arm3 = (candidate[2][0]-candidate[3][0])*(1.-arm3_ratio)
+    y_offset_arm3 = (candidate[2][1]-candidate[3][1])*(1.-arm3_ratio)
+    results_vis[0]['bodies']['candidate'][3,0] += x_offset_arm3
+    results_vis[0]['bodies']['candidate'][3,1] += y_offset_arm3
+    results_vis[0]['bodies']['candidate'][4,0] += x_offset_arm3
+    results_vis[0]['bodies']['candidate'][4,1] += y_offset_arm3
+    results_vis[0]['hands'][1,:,0] += x_offset_arm3
+    results_vis[0]['hands'][1,:,1] += y_offset_arm3
+    ########arm4########
+    l_arm4_ref = ((ref_candidate[4][0] - ref_candidate[3][0]) ** 2 + (ref_candidate[4][1] - ref_candidate[3][1]) ** 2) ** 0.5
+    l_arm4_0 = ((candidate[4][0] - candidate[3][0]) ** 2 + (candidate[4][1] - candidate[3][1]) ** 2) ** 0.5
+    arm4_ratio = l_arm4_ref / l_arm4_0
+    x_offset_arm4 = (candidate[3][0]-candidate[4][0])*(1.-arm4_ratio)
+    y_offset_arm4 = (candidate[3][1]-candidate[4][1])*(1.-arm4_ratio)
+    results_vis[0]['bodies']['candidate'][4,0] += x_offset_arm4
+    results_vis[0]['bodies']['candidate'][4,1] += y_offset_arm4
+    results_vis[0]['hands'][1,:,0] += x_offset_arm4
+    results_vis[0]['hands'][1,:,1] += y_offset_arm4
+    ########arm6########
+    l_arm6_ref = ((ref_candidate[6][0] - ref_candidate[5][0]) ** 2 + (ref_candidate[6][1] - ref_candidate[5][1]) ** 2) ** 0.5
+    l_arm6_0 = ((candidate[6][0] - candidate[5][0]) ** 2 + (candidate[6][1] - candidate[5][1]) ** 2) ** 0.5
+    arm6_ratio = l_arm6_ref / l_arm6_0
+    x_offset_arm6 = (candidate[5][0]-candidate[6][0])*(1.-arm6_ratio)
+    y_offset_arm6 = (candidate[5][1]-candidate[6][1])*(1.-arm6_ratio)
+    results_vis[0]['bodies']['candidate'][6,0] += x_offset_arm6
+    results_vis[0]['bodies']['candidate'][6,1] += y_offset_arm6
+    results_vis[0]['bodies']['candidate'][7,0] += x_offset_arm6
+    results_vis[0]['bodies']['candidate'][7,1] += y_offset_arm6
+    results_vis[0]['hands'][0,:,0] += x_offset_arm6
+    results_vis[0]['hands'][0,:,1] += y_offset_arm6
+    ########arm7########
+    l_arm7_ref = ((ref_candidate[7][0] - ref_candidate[6][0]) ** 2 + (ref_candidate[7][1] - ref_candidate[6][1]) ** 2) ** 0.5
+    l_arm7_0 = ((candidate[7][0] - candidate[6][0]) ** 2 + (candidate[7][1] - candidate[6][1]) ** 2) ** 0.5
+    arm7_ratio = l_arm7_ref / l_arm7_0
+    x_offset_arm7 = (candidate[6][0]-candidate[7][0])*(1.-arm7_ratio)
+    y_offset_arm7 = (candidate[6][1]-candidate[7][1])*(1.-arm7_ratio)
+    results_vis[0]['bodies']['candidate'][7,0] += x_offset_arm7
+    results_vis[0]['bodies']['candidate'][7,1] += y_offset_arm7
+    results_vis[0]['hands'][0,:,0] += x_offset_arm7
+    results_vis[0]['hands'][0,:,1] += y_offset_arm7
+    ########head14########
+    l_head14_ref = ((ref_candidate[14][0] - ref_candidate[0][0]) ** 2 + (ref_candidate[14][1] - ref_candidate[0][1]) ** 2) ** 0.5
+    l_head14_0 = ((candidate[14][0] - candidate[0][0]) ** 2 + (candidate[14][1] - candidate[0][1]) ** 2) ** 0.5
+    head14_ratio = l_head14_ref / l_head14_0
+    x_offset_head14 = (candidate[0][0]-candidate[14][0])*(1.-head14_ratio)
+    y_offset_head14 = (candidate[0][1]-candidate[14][1])*(1.-head14_ratio)
+    results_vis[0]['bodies']['candidate'][14,0] += x_offset_head14
+    results_vis[0]['bodies']['candidate'][14,1] += y_offset_head14
+    results_vis[0]['bodies']['candidate'][16,0] += x_offset_head14
+    results_vis[0]['bodies']['candidate'][16,1] += y_offset_head14
+    ########head15########
+    l_head15_ref = ((ref_candidate[15][0] - ref_candidate[0][0]) ** 2 + (ref_candidate[15][1] - ref_candidate[0][1]) ** 2) ** 0.5
+    l_head15_0 = ((candidate[15][0] - candidate[0][0]) ** 2 + (candidate[15][1] - candidate[0][1]) ** 2) ** 0.5
+    head15_ratio = l_head15_ref / l_head15_0
+    x_offset_head15 = (candidate[0][0]-candidate[15][0])*(1.-head15_ratio)
+    y_offset_head15 = (candidate[0][1]-candidate[15][1])*(1.-head15_ratio)
+    results_vis[0]['bodies']['candidate'][15,0] += x_offset_head15
+    results_vis[0]['bodies']['candidate'][15,1] += y_offset_head15
+    results_vis[0]['bodies']['candidate'][17,0] += x_offset_head15
+    results_vis[0]['bodies']['candidate'][17,1] += y_offset_head15
+    ########head16########
+    l_head16_ref = ((ref_candidate[16][0] - ref_candidate[14][0]) ** 2 + (ref_candidate[16][1] - ref_candidate[14][1]) ** 2) ** 0.5
+    l_head16_0 = ((candidate[16][0] - candidate[14][0]) ** 2 + (candidate[16][1] - candidate[14][1]) ** 2) ** 0.5
+    head16_ratio = l_head16_ref / l_head16_0
+    x_offset_head16 = (candidate[14][0]-candidate[16][0])*(1.-head16_ratio)
+    y_offset_head16 = (candidate[14][1]-candidate[16][1])*(1.-head16_ratio)
+    results_vis[0]['bodies']['candidate'][16,0] += x_offset_head16
+    results_vis[0]['bodies']['candidate'][16,1] += y_offset_head16
+    ########head17########
+    l_head17_ref = ((ref_candidate[17][0] - ref_candidate[15][0]) ** 2 + (ref_candidate[17][1] - ref_candidate[15][1]) ** 2) ** 0.5
+    l_head17_0 = ((candidate[17][0] - candidate[15][0]) ** 2 + (candidate[17][1] - candidate[15][1]) ** 2) ** 0.5
+    head17_ratio = l_head17_ref / l_head17_0
+    x_offset_head17 = (candidate[15][0]-candidate[17][0])*(1.-head17_ratio)
+    y_offset_head17 = (candidate[15][1]-candidate[17][1])*(1.-head17_ratio)
+    results_vis[0]['bodies']['candidate'][17,0] += x_offset_head17
+    results_vis[0]['bodies']['candidate'][17,1] += y_offset_head17
+    ########MovingAverage########
+    ########left leg########
+    l_ll1_ref = ((ref_candidate[8][0] - ref_candidate[9][0]) ** 2 + (ref_candidate[8][1] - ref_candidate[9][1]) ** 2) ** 0.5
+    l_ll1_0 = ((candidate[8][0] - candidate[9][0]) ** 2 + (candidate[8][1] - candidate[9][1]) ** 2) ** 0.5
+    ll1_ratio = l_ll1_ref / l_ll1_0
+    x_offset_ll1 = (candidate[9][0]-candidate[8][0])*(ll1_ratio-1.)
+    y_offset_ll1 = (candidate[9][1]-candidate[8][1])*(ll1_ratio-1.)
+    results_vis[0]['bodies']['candidate'][9,0] += x_offset_ll1
+    results_vis[0]['bodies']['candidate'][9,1] += y_offset_ll1
+    results_vis[0]['bodies']['candidate'][10,0] += x_offset_ll1
+    results_vis[0]['bodies']['candidate'][10,1] += y_offset_ll1
+    results_vis[0]['bodies']['candidate'][19,0] += x_offset_ll1
+    results_vis[0]['bodies']['candidate'][19,1] += y_offset_ll1
+    l_ll2_ref = ((ref_candidate[9][0] - ref_candidate[10][0]) ** 2 + (ref_candidate[9][1] - ref_candidate[10][1]) ** 2) ** 0.5
+    l_ll2_0 = ((candidate[9][0] - candidate[10][0]) ** 2 + (candidate[9][1] - candidate[10][1]) ** 2) ** 0.5
+    ll2_ratio = l_ll2_ref / l_ll2_0
+    x_offset_ll2 = (candidate[10][0]-candidate[9][0])*(ll2_ratio-1.)
+    y_offset_ll2 = (candidate[10][1]-candidate[9][1])*(ll2_ratio-1.)
+    results_vis[0]['bodies']['candidate'][10,0] += x_offset_ll2
+    results_vis[0]['bodies']['candidate'][10,1] += y_offset_ll2
+    results_vis[0]['bodies']['candidate'][19,0] += x_offset_ll2
+    results_vis[0]['bodies']['candidate'][19,1] += y_offset_ll2
+    ########right leg########
+    l_rl1_ref = ((ref_candidate[11][0] - ref_candidate[12][0]) ** 2 + (ref_candidate[11][1] - ref_candidate[12][1]) ** 2) ** 0.5
+    l_rl1_0 = ((candidate[11][0] - candidate[12][0]) ** 2 + (candidate[11][1] - candidate[12][1]) ** 2) ** 0.5
+    rl1_ratio = l_rl1_ref / l_rl1_0
+    x_offset_rl1 = (candidate[12][0]-candidate[11][0])*(rl1_ratio-1.)
+    y_offset_rl1 = (candidate[12][1]-candidate[11][1])*(rl1_ratio-1.)
+    results_vis[0]['bodies']['candidate'][12,0] += x_offset_rl1
+    results_vis[0]['bodies']['candidate'][12,1] += y_offset_rl1
+    results_vis[0]['bodies']['candidate'][13,0] += x_offset_rl1
+    results_vis[0]['bodies']['candidate'][13,1] += y_offset_rl1
+    results_vis[0]['bodies']['candidate'][18,0] += x_offset_rl1
+    results_vis[0]['bodies']['candidate'][18,1] += y_offset_rl1
+    l_rl2_ref = ((ref_candidate[12][0] - ref_candidate[13][0]) ** 2 + (ref_candidate[12][1] - ref_candidate[13][1]) ** 2) ** 0.5
+    l_rl2_0 = ((candidate[12][0] - candidate[13][0]) ** 2 + (candidate[12][1] - candidate[13][1]) ** 2) ** 0.5
+    rl2_ratio = l_rl2_ref / l_rl2_0
+    x_offset_rl2 = (candidate[13][0]-candidate[12][0])*(rl2_ratio-1.)
+    y_offset_rl2 = (candidate[13][1]-candidate[12][1])*(rl2_ratio-1.)
+    results_vis[0]['bodies']['candidate'][13,0] += x_offset_rl2
+    results_vis[0]['bodies']['candidate'][13,1] += y_offset_rl2
+    results_vis[0]['bodies']['candidate'][18,0] += x_offset_rl2
+    results_vis[0]['bodies']['candidate'][18,1] += y_offset_rl2
+    offset = ref_candidate[1] - results_vis[0]['bodies']['candidate'][1]
+    results_vis[0]['bodies']['candidate'] += offset[np.newaxis, :]
+    results_vis[0]['faces'] += offset[np.newaxis, np.newaxis, :]
+    results_vis[0]['hands'] += offset[np.newaxis, np.newaxis, :]
+    for i in range(1, len(results_vis)):
+        results_vis[i]['bodies']['candidate'][:,0] *= x_ratio
+        results_vis[i]['bodies']['candidate'][:,1] *= y_ratio
+        results_vis[i]['faces'][:,:,0] *= x_ratio
+        results_vis[i]['faces'][:,:,1] *= y_ratio
+        results_vis[i]['hands'][:,:,0] *= x_ratio
+        results_vis[i]['hands'][:,:,1] *= y_ratio
+        ########neck########
+        x_offset_neck = (results_vis[i]['bodies']['candidate'][1][0]-results_vis[i]['bodies']['candidate'][0][0])*(1.-neck_ratio)
+        y_offset_neck = (results_vis[i]['bodies']['candidate'][1][1]-results_vis[i]['bodies']['candidate'][0][1])*(1.-neck_ratio)
+        results_vis[i]['bodies']['candidate'][0,0] += x_offset_neck
+        results_vis[i]['bodies']['candidate'][0,1] += y_offset_neck
+        results_vis[i]['bodies']['candidate'][14,0] += x_offset_neck
+        results_vis[i]['bodies']['candidate'][14,1] += y_offset_neck
+        results_vis[i]['bodies']['candidate'][15,0] += x_offset_neck
+        results_vis[i]['bodies']['candidate'][15,1] += y_offset_neck
+        results_vis[i]['bodies']['candidate'][16,0] += x_offset_neck
+        results_vis[i]['bodies']['candidate'][16,1] += y_offset_neck
+        results_vis[i]['bodies']['candidate'][17,0] += x_offset_neck
+        results_vis[i]['bodies']['candidate'][17,1] += y_offset_neck
+        ########shoulder2########
+        x_offset_shoulder2 = (results_vis[i]['bodies']['candidate'][1][0]-results_vis[i]['bodies']['candidate'][2][0])*(1.-shoulder2_ratio)
+        y_offset_shoulder2 = (results_vis[i]['bodies']['candidate'][1][1]-results_vis[i]['bodies']['candidate'][2][1])*(1.-shoulder2_ratio)
+        results_vis[i]['bodies']['candidate'][2,0] += x_offset_shoulder2
+        results_vis[i]['bodies']['candidate'][2,1] += y_offset_shoulder2
+        results_vis[i]['bodies']['candidate'][3,0] += x_offset_shoulder2
+        results_vis[i]['bodies']['candidate'][3,1] += y_offset_shoulder2
+        results_vis[i]['bodies']['candidate'][4,0] += x_offset_shoulder2
+        results_vis[i]['bodies']['candidate'][4,1] += y_offset_shoulder2
+        results_vis[i]['hands'][1,:,0] += x_offset_shoulder2
+        results_vis[i]['hands'][1,:,1] += y_offset_shoulder2
+        ########shoulder5########
+        x_offset_shoulder5 = (results_vis[i]['bodies']['candidate'][1][0]-results_vis[i]['bodies']['candidate'][5][0])*(1.-shoulder5_ratio)
+        y_offset_shoulder5 = (results_vis[i]['bodies']['candidate'][1][1]-results_vis[i]['bodies']['candidate'][5][1])*(1.-shoulder5_ratio)
+        results_vis[i]['bodies']['candidate'][5,0] += x_offset_shoulder5
+        results_vis[i]['bodies']['candidate'][5,1] += y_offset_shoulder5
+        results_vis[i]['bodies']['candidate'][6,0] += x_offset_shoulder5
+        results_vis[i]['bodies']['candidate'][6,1] += y_offset_shoulder5
+        results_vis[i]['bodies']['candidate'][7,0] += x_offset_shoulder5
+        results_vis[i]['bodies']['candidate'][7,1] += y_offset_shoulder5
+        results_vis[i]['hands'][0,:,0] += x_offset_shoulder5
+        results_vis[i]['hands'][0,:,1] += y_offset_shoulder5
+        ########arm3########
+        x_offset_arm3 = (results_vis[i]['bodies']['candidate'][2][0]-results_vis[i]['bodies']['candidate'][3][0])*(1.-arm3_ratio)
+        y_offset_arm3 = (results_vis[i]['bodies']['candidate'][2][1]-results_vis[i]['bodies']['candidate'][3][1])*(1.-arm3_ratio)
+        results_vis[i]['bodies']['candidate'][3,0] += x_offset_arm3
+        results_vis[i]['bodies']['candidate'][3,1] += y_offset_arm3
+        results_vis[i]['bodies']['candidate'][4,0] += x_offset_arm3
+        results_vis[i]['bodies']['candidate'][4,1] += y_offset_arm3
+        results_vis[i]['hands'][1,:,0] += x_offset_arm3
+        results_vis[i]['hands'][1,:,1] += y_offset_arm3
+        ########arm4########
+        x_offset_arm4 = (results_vis[i]['bodies']['candidate'][3][0]-results_vis[i]['bodies']['candidate'][4][0])*(1.-arm4_ratio)
+        y_offset_arm4 = (results_vis[i]['bodies']['candidate'][3][1]-results_vis[i]['bodies']['candidate'][4][1])*(1.-arm4_ratio)
+        results_vis[i]['bodies']['candidate'][4,0] += x_offset_arm4
+        results_vis[i]['bodies']['candidate'][4,1] += y_offset_arm4
+        results_vis[i]['hands'][1,:,0] += x_offset_arm4
+        results_vis[i]['hands'][1,:,1] += y_offset_arm4
+        ########arm6########
+        x_offset_arm6 = (results_vis[i]['bodies']['candidate'][5][0]-results_vis[i]['bodies']['candidate'][6][0])*(1.-arm6_ratio)
+        y_offset_arm6 = (results_vis[i]['bodies']['candidate'][5][1]-results_vis[i]['bodies']['candidate'][6][1])*(1.-arm6_ratio)
+        results_vis[i]['bodies']['candidate'][6,0] += x_offset_arm6
+        results_vis[i]['bodies']['candidate'][6,1] += y_offset_arm6
+        results_vis[i]['bodies']['candidate'][7,0] += x_offset_arm6
+        results_vis[i]['bodies']['candidate'][7,1] += y_offset_arm6
+        results_vis[i]['hands'][0,:,0] += x_offset_arm6
+        results_vis[i]['hands'][0,:,1] += y_offset_arm6
+        ########arm7########
+        x_offset_arm7 = (results_vis[i]['bodies']['candidate'][6][0]-results_vis[i]['bodies']['candidate'][7][0])*(1.-arm7_ratio)
+        y_offset_arm7 = (results_vis[i]['bodies']['candidate'][6][1]-results_vis[i]['bodies']['candidate'][7][1])*(1.-arm7_ratio)
+        results_vis[i]['bodies']['candidate'][7,0] += x_offset_arm7
+        results_vis[i]['bodies']['candidate'][7,1] += y_offset_arm7
+        results_vis[i]['hands'][0,:,0] += x_offset_arm7
+        results_vis[i]['hands'][0,:,1] += y_offset_arm7
+        ########head14########
+        x_offset_head14 = (results_vis[i]['bodies']['candidate'][0][0]-results_vis[i]['bodies']['candidate'][14][0])*(1.-head14_ratio)
+        y_offset_head14 = (results_vis[i]['bodies']['candidate'][0][1]-results_vis[i]['bodies']['candidate'][14][1])*(1.-head14_ratio)
+        results_vis[i]['bodies']['candidate'][14,0] += x_offset_head14
+        results_vis[i]['bodies']['candidate'][14,1] += y_offset_head14
+        results_vis[i]['bodies']['candidate'][16,0] += x_offset_head14
+        results_vis[i]['bodies']['candidate'][16,1] += y_offset_head14
+        ########head15########
+        x_offset_head15 = (results_vis[i]['bodies']['candidate'][0][0]-results_vis[i]['bodies']['candidate'][15][0])*(1.-head15_ratio)
+        y_offset_head15 = (results_vis[i]['bodies']['candidate'][0][1]-results_vis[i]['bodies']['candidate'][15][1])*(1.-head15_ratio)
+        results_vis[i]['bodies']['candidate'][15,0] += x_offset_head15
+        results_vis[i]['bodies']['candidate'][15,1] += y_offset_head15
+        results_vis[i]['bodies']['candidate'][17,0] += x_offset_head15
+        results_vis[i]['bodies']['candidate'][17,1] += y_offset_head15
+        ########head16########
+        x_offset_head16 = (results_vis[i]['bodies']['candidate'][14][0]-results_vis[i]['bodies']['candidate'][16][0])*(1.-head16_ratio)
+        y_offset_head16 = (results_vis[i]['bodies']['candidate'][14][1]-results_vis[i]['bodies']['candidate'][16][1])*(1.-head16_ratio)
+        results_vis[i]['bodies']['candidate'][16,0] += x_offset_head16
+        results_vis[i]['bodies']['candidate'][16,1] += y_offset_head16
+        ########head17########
+        x_offset_head17 = (results_vis[i]['bodies']['candidate'][15][0]-results_vis[i]['bodies']['candidate'][17][0])*(1.-head17_ratio)
+        y_offset_head17 = (results_vis[i]['bodies']['candidate'][15][1]-results_vis[i]['bodies']['candidate'][17][1])*(1.-head17_ratio)
+        results_vis[i]['bodies']['candidate'][17,0] += x_offset_head17
+        results_vis[i]['bodies']['candidate'][17,1] += y_offset_head17
+        # ########MovingAverage########
+        ########left leg########
+        x_offset_ll1 = (results_vis[i]['bodies']['candidate'][9][0]-results_vis[i]['bodies']['candidate'][8][0])*(ll1_ratio-1.)
+        y_offset_ll1 = (results_vis[i]['bodies']['candidate'][9][1]-results_vis[i]['bodies']['candidate'][8][1])*(ll1_ratio-1.)
+        results_vis[i]['bodies']['candidate'][9,0] += x_offset_ll1
+        results_vis[i]['bodies']['candidate'][9,1] += y_offset_ll1
+        results_vis[i]['bodies']['candidate'][10,0] += x_offset_ll1
+        results_vis[i]['bodies']['candidate'][10,1] += y_offset_ll1
+        results_vis[i]['bodies']['candidate'][19,0] += x_offset_ll1
+        results_vis[i]['bodies']['candidate'][19,1] += y_offset_ll1
+        x_offset_ll2 = (results_vis[i]['bodies']['candidate'][10][0]-results_vis[i]['bodies']['candidate'][9][0])*(ll2_ratio-1.)
+        y_offset_ll2 = (results_vis[i]['bodies']['candidate'][10][1]-results_vis[i]['bodies']['candidate'][9][1])*(ll2_ratio-1.)
+        results_vis[i]['bodies']['candidate'][10,0] += x_offset_ll2
+        results_vis[i]['bodies']['candidate'][10,1] += y_offset_ll2
+        results_vis[i]['bodies']['candidate'][19,0] += x_offset_ll2
+        results_vis[i]['bodies']['candidate'][19,1] += y_offset_ll2
+        ########right leg########
+        x_offset_rl1 = (results_vis[i]['bodies']['candidate'][12][0]-results_vis[i]['bodies']['candidate'][11][0])*(rl1_ratio-1.)
+        y_offset_rl1 = (results_vis[i]['bodies']['candidate'][12][1]-results_vis[i]['bodies']['candidate'][11][1])*(rl1_ratio-1.)
+        results_vis[i]['bodies']['candidate'][12,0] += x_offset_rl1
+        results_vis[i]['bodies']['candidate'][12,1] += y_offset_rl1
+        results_vis[i]['bodies']['candidate'][13,0] += x_offset_rl1
+        results_vis[i]['bodies']['candidate'][13,1] += y_offset_rl1
+        results_vis[i]['bodies']['candidate'][18,0] += x_offset_rl1
+        results_vis[i]['bodies']['candidate'][18,1] += y_offset_rl1
+        x_offset_rl2 = (results_vis[i]['bodies']['candidate'][13][0]-results_vis[i]['bodies']['candidate'][12][0])*(rl2_ratio-1.)
+        y_offset_rl2 = (results_vis[i]['bodies']['candidate'][13][1]-results_vis[i]['bodies']['candidate'][12][1])*(rl2_ratio-1.)
+        results_vis[i]['bodies']['candidate'][13,0] += x_offset_rl2
+        results_vis[i]['bodies']['candidate'][13,1] += y_offset_rl2
+        results_vis[i]['bodies']['candidate'][18,0] += x_offset_rl2
+        results_vis[i]['bodies']['candidate'][18,1] += y_offset_rl2
+        results_vis[i]['bodies']['candidate'] += offset[np.newaxis, :]
+        results_vis[i]['faces'] += offset[np.newaxis, np.newaxis, :]
+        results_vis[i]['hands'] += offset[np.newaxis, np.newaxis, :]
+    for i in range(len(results_vis)):
+        dwpose_woface, dwpose_wface = draw_pose(results_vis[i], H=768, W=512)
+        img_path = save_motion+'/' + str(i).zfill(4) + '.jpg'
+        cv2.imwrite(img_path, dwpose_woface)
+    dwpose_woface, dwpose_wface = draw_pose(pose_ref, H=768, W=512)
+    img_path = save_warp+'/' + 'ref_pose.jpg'
+    cv2.imwrite(img_path, dwpose_woface)
+logger = get_logger('dw pose extraction')
+if __name__=='__main__':
+    def parse_args():
+        parser = argparse.ArgumentParser(description="Simple example of a training script.")
+        parser.add_argument("--ref_name", type=str, default="data/images/IMG_20240514_104337.jpg",)
+        parser.add_argument("--source_video_paths", type=str, default="data/videos/source_video.mp4",)
+        parser.add_argument("--saved_pose_dir", type=str, default="data/saved_pose/IMG_20240514_104337",)
+        args = parser.parse_args()
+        return args
+    args = parse_args()
+    mp_main(args)

UniAnimate/test_func/save_targer_keys.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import os
+import sys
+import json
+import torch
+import imageio
+import numpy as np
+import os.path as osp
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-2]))
+from thop import profile
+from ptflops import get_model_complexity_info
+import artist.data as data
+from tools.modules.config import cfg
+from tools.modules.unet.util import *
+from utils.config import Config as pConfig
+from utils.registry_class import ENGINE, MODEL
+def save_temporal_key():
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    model = MODEL.build(cfg.UNet)
+    temp_name = ''
+    temp_key_list = []
+    spth = 'workspace/module_list/UNetSD_I2V_vs_Text_temporal_key_list.json'
+    for name, module in model.named_modules():
+        if isinstance(module, (TemporalTransformer, TemporalTransformer_attemask, TemporalAttentionBlock, TemporalAttentionMultiBlock, TemporalConvBlock_v2, TemporalConvBlock)):
+            temp_name = name
+            print(f'Model: {name}')
+        elif isinstance(module, (ResidualBlock, ResBlock, SpatialTransformer, Upsample, Downsample)):
+            temp_name = ''
+        if hasattr(module, 'weight'):
+            if temp_name != '' and (temp_name in name):
+                temp_key_list.append(name)
+                print(f'{name}')
+        # print(name)
+    save_module_list = []
+    for k, p in model.named_parameters():
+        for item in temp_key_list:
+            if item in k:
+                print(f'{item} --> {k}')
+                save_module_list.append(k)
+    print(int(sum(p.numel() for k, p in model.named_parameters()) / (1024 ** 2)), 'M parameters')
+    # spth = 'workspace/module_list/{}'
+    json.dump(save_module_list, open(spth, 'w'))
+    a = 0
+def save_spatial_key():
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    model = MODEL.build(cfg.UNet)
+    temp_name = ''
+    temp_key_list = []
+    spth = 'workspace/module_list/UNetSD_I2V_HQ_P_spatial_key_list.json'
+    for name, module in model.named_modules():
+        if isinstance(module, (ResidualBlock, ResBlock, SpatialTransformer, Upsample, Downsample)):
+            temp_name = name
+            print(f'Model: {name}')
+        elif isinstance(module, (TemporalTransformer, TemporalTransformer_attemask, TemporalAttentionBlock, TemporalAttentionMultiBlock, TemporalConvBlock_v2, TemporalConvBlock)):
+            temp_name = ''
+        if hasattr(module, 'weight'):
+            if temp_name != '' and (temp_name in name):
+                temp_key_list.append(name)
+                print(f'{name}')
+        # print(name)
+    save_module_list = []
+    for k, p in model.named_parameters():
+        for item in temp_key_list:
+            if item in k:
+                print(f'{item} --> {k}')
+                save_module_list.append(k)
+    print(int(sum(p.numel() for k, p in model.named_parameters()) / (1024 ** 2)), 'M parameters')
+    # spth = 'workspace/module_list/{}'
+    json.dump(save_module_list, open(spth, 'w'))
+    a = 0
+if __name__ == '__main__':
+    # save_temporal_key()
+    save_spatial_key()
+# print([k for (k, _) in self.input_blocks.named_parameters()])

UniAnimate/test_func/test_EndDec.py ADDED Viewed

	@@ -0,0 +1,95 @@

+import os
+import sys
+import torch
+import imageio
+import numpy as np
+import os.path as osp
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-2]))
+from PIL import Image, ImageDraw, ImageFont
+from einops import rearrange
+from tools import *
+import utils.transforms as data
+from utils.seed import setup_seed
+from tools.modules.config import cfg
+from utils.config import Config as pConfig
+from utils.registry_class import ENGINE, DATASETS, AUTO_ENCODER
+def test_enc_dec(gpu=0):
+    setup_seed(0)
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    save_dir = os.path.join('workspace/test_data/autoencoder', cfg.auto_encoder['type'])
+    os.system('rm -rf %s' % (save_dir))
+    os.makedirs(save_dir, exist_ok=True)
+    train_trans = data.Compose([
+        data.CenterCropWide(size=cfg.resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.mean, std=cfg.std)])
+    vit_trans = data.Compose([
+        data.CenterCropWide(size=(cfg.resolution[0], cfg.resolution[0])) if cfg.resolution[0]>cfg.vit_resolution[0] else data.CenterCropWide(size=cfg.vit_resolution),
+        data.Resize(cfg.vit_resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.vit_mean, std=cfg.vit_std)])
+    video_mean = torch.tensor(cfg.mean).view(1, -1, 1, 1) #n c f h w
+    video_std = torch.tensor(cfg.std).view(1, -1, 1, 1) #n c f h w
+    txt_size = cfg.resolution[1]
+    nc = int(38 * (txt_size / 256))
+    font = ImageFont.truetype('data/font/DejaVuSans.ttf', size=13)
+    dataset = DATASETS.build(cfg.vid_dataset, sample_fps=4, transforms=train_trans, vit_transforms=vit_trans)
+    print('There are %d videos' % (len(dataset)))
+    autoencoder = AUTO_ENCODER.build(cfg.auto_encoder)
+    autoencoder.eval() # freeze
+    for param in autoencoder.parameters():
+        param.requires_grad = False
+    autoencoder.to(gpu)
+    for idx, item in enumerate(dataset):
+        local_path = os.path.join(save_dir, '%04d.mp4' % idx)
+        # ref_frame, video_data, caption = item
+        ref_frame, vit_frame, video_data = item[:3]
+        video_data = video_data.to(gpu)
+        image_list = []
+        video_data_list = torch.chunk(video_data, video_data.shape[0]//cfg.chunk_size,dim=0)
+        with torch.no_grad():
+            decode_data = []
+            for chunk_data in video_data_list:
+                latent_z = autoencoder.encode_firsr_stage(chunk_data).detach()
+                # latent_z = get_first_stage_encoding(encoder_posterior).detach()
+                kwargs = {"timesteps": chunk_data.shape[0]}
+                recons_data = autoencoder.decode(latent_z, **kwargs)
+                vis_data = torch.cat([chunk_data, recons_data], dim=2).cpu()
+                vis_data = vis_data.mul_(video_std).add_(video_mean)  # 8x3x16x256x384
+                vis_data = vis_data.cpu()
+                vis_data.clamp_(0, 1)
+                vis_data = vis_data.permute(0, 2, 3, 1)
+                vis_data = [(image.numpy() * 255).astype('uint8') for image in vis_data]
+                image_list.extend(vis_data)
+        num_image = len(image_list)
+        frame_dir = os.path.join(save_dir, 'temp')
+        os.makedirs(frame_dir, exist_ok=True)
+        for idx in range(num_image):
+            tpth = os.path.join(frame_dir, '%04d.png' % (idx+1))
+            cv2.imwrite(tpth, image_list[idx][:,:,::-1], [int(cv2.IMWRITE_JPEG_QUALITY), 100])
+        cmd = f'ffmpeg -y -f image2 -loglevel quiet -framerate 8 -i {frame_dir}/%04d.png -vcodec libx264 -crf 17  -pix_fmt yuv420p {local_path}'
+        os.system(cmd); os.system(f'rm -rf {frame_dir}')
+if __name__ == '__main__':
+    test_enc_dec()

UniAnimate/test_func/test_dataset.py ADDED Viewed

	@@ -0,0 +1,152 @@

+import os
+import sys
+import imageio
+import numpy as np
+import os.path as osp
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-2]))
+from PIL import Image, ImageDraw, ImageFont
+import torchvision.transforms as T
+import utils.transforms as data
+from tools.modules.config import cfg
+from utils.config import Config as pConfig
+from utils.registry_class import ENGINE, DATASETS
+from tools import *
+def test_video_dataset():
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    exp_name = os.path.basename(cfg.cfg_file).split('.')[0]
+    save_dir = os.path.join('workspace', 'test_data/datasets', cfg.vid_dataset['type'], exp_name)
+    os.system('rm -rf %s' % (save_dir))
+    os.makedirs(save_dir, exist_ok=True)
+    train_trans = data.Compose([
+        data.CenterCropWide(size=cfg.resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.mean, std=cfg.std)])
+    vit_trans = T.Compose([
+        data.CenterCropWide(cfg.vit_resolution),
+        T.ToTensor(),
+        T.Normalize(mean=cfg.vit_mean, std=cfg.vit_std)])
+    video_mean = torch.tensor(cfg.mean).view(1, -1, 1, 1) #n c f h w
+    video_std = torch.tensor(cfg.std).view(1, -1, 1, 1) #n c f h w
+    img_mean = torch.tensor(cfg.mean).view(-1, 1, 1) # c f h w
+    img_std = torch.tensor(cfg.std).view(-1, 1, 1) # c f h w
+    vit_mean = torch.tensor(cfg.vit_mean).view(-1, 1, 1) # c f h w
+    vit_std = torch.tensor(cfg.vit_std).view(-1, 1, 1) # c f h w
+    txt_size = cfg.resolution[1]
+    nc = int(38 * (txt_size / 256))
+    font = ImageFont.truetype('data/font/DejaVuSans.ttf', size=13)
+    dataset = DATASETS.build(cfg.vid_dataset, sample_fps=cfg.sample_fps[0], transforms=train_trans, vit_transforms=vit_trans)
+    print('There are %d videos' % (len(dataset)))
+    for idx, item in enumerate(dataset):
+        ref_frame, vit_frame, video_data, caption, video_key = item
+        video_data = video_data.mul_(video_std).add_(video_mean)
+        video_data.clamp_(0, 1)
+        video_data = video_data.permute(0, 2, 3, 1)
+        video_data = [(image.numpy() * 255).astype('uint8') for image in video_data]
+        # Single Image
+        ref_frame = ref_frame.mul_(img_mean).add_(img_std)
+        ref_frame.clamp_(0, 1)
+        ref_frame = ref_frame.permute(1, 2, 0)
+        ref_frame = (ref_frame.numpy() * 255).astype('uint8')
+        # Text image
+        txt_img = Image.new("RGB", (txt_size, txt_size), color="white")
+        draw = ImageDraw.Draw(txt_img)
+        lines = "\n".join(caption[start:start + nc] for start in range(0, len(caption), nc))
+        draw.text((0, 0), lines, fill="black", font=font)
+        txt_img = np.array(txt_img)
+        video_data = [np.concatenate([ref_frame, u, txt_img], axis=1) for u in video_data]
+        spath = os.path.join(save_dir, '%04d.gif' % (idx))
+        imageio.mimwrite(spath, video_data, fps =8)
+        # if idx > 100: break
+def test_vit_image(test_video_flag=True):
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    exp_name = os.path.basename(cfg.cfg_file).split('.')[0]
+    save_dir = os.path.join('workspace', 'test_data/datasets', cfg.img_dataset['type'], exp_name)
+    os.system('rm -rf %s' % (save_dir))
+    os.makedirs(save_dir, exist_ok=True)
+    train_trans = data.Compose([
+        data.CenterCropWide(size=cfg.resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.mean, std=cfg.std)])
+    vit_trans = data.Compose([
+        data.CenterCropWide(cfg.resolution),
+        data.Resize(cfg.vit_resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.vit_mean, std=cfg.vit_std)])
+    img_mean = torch.tensor(cfg.mean).view(-1, 1, 1) # c f h w
+    img_std = torch.tensor(cfg.std).view(-1, 1, 1) # c f h w
+    vit_mean = torch.tensor(cfg.vit_mean).view(-1, 1, 1) # c f h w
+    vit_std = torch.tensor(cfg.vit_std).view(-1, 1, 1) # c f h w
+    txt_size = cfg.resolution[1]
+    nc = int(38 * (txt_size / 256))
+    font = ImageFont.truetype('artist/font/DejaVuSans.ttf', size=13)
+    dataset = DATASETS.build(cfg.img_dataset, transforms=train_trans, vit_transforms=vit_trans)
+    print('There are %d videos' % (len(dataset)))
+    for idx, item in enumerate(dataset):
+        ref_frame, vit_frame, video_data, caption, video_key = item
+        video_data = video_data.mul_(img_std).add_(img_mean)
+        video_data.clamp_(0, 1)
+        video_data = video_data.permute(0, 2, 3, 1)
+        video_data = [(image.numpy() * 255).astype('uint8') for image in video_data]
+        # Single Image
+        vit_frame = vit_frame.mul_(vit_std).add_(vit_mean)
+        vit_frame.clamp_(0, 1)
+        vit_frame = vit_frame.permute(1, 2, 0)
+        vit_frame = (vit_frame.numpy() * 255).astype('uint8')
+        zero_frame = np.zeros((cfg.resolution[1], cfg.resolution[1], 3), dtype=np.uint8)
+        zero_frame[:vit_frame.shape[0], :vit_frame.shape[1], :] = vit_frame
+        # Text image
+        txt_img = Image.new("RGB", (txt_size, txt_size), color="white")
+        draw = ImageDraw.Draw(txt_img)
+        lines = "\n".join(caption[start:start + nc] for start in range(0, len(caption), nc))
+        draw.text((0, 0), lines, fill="black", font=font)
+        txt_img = np.array(txt_img)
+        video_data = [np.concatenate([zero_frame, u, txt_img], axis=1) for u in video_data]
+        spath = os.path.join(save_dir, '%04d.gif' % (idx))
+        imageio.mimwrite(spath, video_data, fps =8)
+        # if idx > 100: break
+if __name__ == '__main__':
+    # test_video_dataset()
+    test_vit_image()

UniAnimate/test_func/test_models.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import os
+import sys
+import torch
+import imageio
+import numpy as np
+import os.path as osp
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-2]))
+from thop import profile
+from ptflops import get_model_complexity_info
+import artist.data as data
+from tools.modules.config import cfg
+from utils.config import Config as pConfig
+from utils.registry_class import ENGINE, MODEL
+def test_model():
+    cfg_update = pConfig(load=True)
+    for k, v in cfg_update.cfg_dict.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    model = MODEL.build(cfg.UNet)
+    print(int(sum(p.numel() for k, p in model.named_parameters()) / (1024 ** 2)), 'M parameters')
+    # state_dict = torch.load('cache/pretrain_model/jiuniu_0600000.pth', map_location='cpu')
+    # model.load_state_dict(state_dict, strict=False)
+    model = model.cuda()
+    x = torch.Tensor(1, 4, 16, 32, 56).cuda()
+    t = torch.Tensor(1).cuda()
+    sims = torch.Tensor(1, 32).cuda()
+    fps = torch.Tensor([8]).cuda()
+    y = torch.Tensor(1, 1, 1024).cuda()
+    image = torch.Tensor(1, 3, 256, 448).cuda()
+    ret = model(x=x, t=t, y=y, ori_img=image, sims=sims, fps=fps)
+    print('Out shape if {}'.format(ret.shape))
+    # flops, params = profile(model=model, inputs=(x, t, y, image, sims, fps))
+    # print('Model: {:.2f} GFLOPs and {:.2f}M parameters'.format(flops/1e9, params/1e6))
+    def prepare_input(resolution):
+        return dict(x=[x, t, y, image, sims, fps])
+    flops, params = get_model_complexity_info(model, (1, 4, 16, 32, 56),
+        input_constructor = prepare_input,
+        as_strings=True, print_per_layer_stat=True)
+    print('      - Flops:  ' + flops)
+    print('      - Params: ' + params)
+if __name__ == '__main__':
+    test_model()

UniAnimate/test_func/test_save_video.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import numpy as np
+import cv2
+cap = cv2.VideoCapture('workspace/img_dir/tst.mp4')
+fourcc = cv2.VideoWriter_fourcc(*'H264')
+ret, frame = cap.read()
+vid_size = frame.shape[:2][::-1]
+out = cv2.VideoWriter('workspace/img_dir/testwrite.mp4',fourcc, 8, vid_size)
+out.write(frame)
+while(cap.isOpened()):
+    ret, frame = cap.read()
+    if not ret: break
+    out.write(frame)
+cap.release()
+out.release()

UniAnimate/tools/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+from .datasets import *
+from .modules import *
+from .inferences import *

UniAnimate/tools/datasets/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .image_dataset import *
2	+ from .video_dataset import *

UniAnimate/tools/datasets/image_dataset.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import os
+import cv2
+import torch
+import random
+import logging
+import tempfile
+import numpy as np
+from copy import copy
+from PIL import Image
+from io import BytesIO
+from torch.utils.data import Dataset
+from utils.registry_class import DATASETS
+@DATASETS.register_class()
+class ImageDataset(Dataset):
+    def __init__(self,
+            data_list,
+            data_dir_list,
+            max_words=1000,
+            vit_resolution=[224, 224],
+            resolution=(384, 256),
+            max_frames=1,
+            transforms=None,
+            vit_transforms=None,
+            **kwargs):
+        self.max_frames = max_frames
+        self.resolution = resolution
+        self.transforms = transforms
+        self.vit_resolution = vit_resolution
+        self.vit_transforms = vit_transforms
+        image_list = []
+        for item_path, data_dir in zip(data_list, data_dir_list):
+            lines = open(item_path, 'r').readlines()
+            lines = [[data_dir, item.strip()] for item in lines]
+            image_list.extend(lines)
+        self.image_list = image_list
+    def __len__(self):
+        return len(self.image_list)
+    def __getitem__(self, index):
+        data_dir, file_path = self.image_list[index]
+        img_key = file_path.split('|||')[0]
+        try:
+            ref_frame, vit_frame, video_data, caption = self._get_image_data(data_dir, file_path)
+        except Exception as e:
+            logging.info('{} get frames failed... with error: {}'.format(img_key, e))
+            caption = ''
+            img_key = ''
+            ref_frame = torch.zeros(3, self.resolution[1], self.resolution[0])
+            vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+            video_data = torch.zeros(self.max_frames, 3, self.resolution[1], self.resolution[0])
+        return ref_frame, vit_frame, video_data, caption, img_key
+    def _get_image_data(self, data_dir, file_path):
+        frame_list = []
+        img_key, caption = file_path.split('|||')
+        file_path = os.path.join(data_dir, img_key)
+        for _ in range(5):
+            try:
+                image = Image.open(file_path)
+                if image.mode != 'RGB':
+                    image = image.convert('RGB')
+                frame_list.append(image)
+                break
+            except Exception as e:
+                logging.info('{} read video frame failed with error: {}'.format(img_key, e))
+                continue
+        video_data = torch.zeros(self.max_frames, 3, self.resolution[1], self.resolution[0])
+        try:
+            if len(frame_list) > 0:
+                mid_frame = frame_list[0]
+                vit_frame = self.vit_transforms(mid_frame)
+                frame_tensor = self.transforms(frame_list)
+                video_data[:len(frame_list), ...] = frame_tensor
+            else:
+                vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+        except:
+            vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+        ref_frame = copy(video_data[0])
+        return ref_frame, vit_frame, video_data, caption

UniAnimate/tools/datasets/video_dataset.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import os
+import cv2
+import json
+import torch
+import random
+import logging
+import tempfile
+import numpy as np
+from copy import copy
+from PIL import Image
+from torch.utils.data import Dataset
+from utils.registry_class import DATASETS
+@DATASETS.register_class()
+class VideoDataset(Dataset):
+    def __init__(self,
+            data_list,
+            data_dir_list,
+            max_words=1000,
+            resolution=(384, 256),
+            vit_resolution=(224, 224),
+            max_frames=16,
+            sample_fps=8,
+            transforms=None,
+            vit_transforms=None,
+            get_first_frame=False,
+            **kwargs):
+        self.max_words = max_words
+        self.max_frames = max_frames
+        self.resolution = resolution
+        self.vit_resolution = vit_resolution
+        self.sample_fps = sample_fps
+        self.transforms = transforms
+        self.vit_transforms = vit_transforms
+        self.get_first_frame = get_first_frame
+        image_list = []
+        for item_path, data_dir in zip(data_list, data_dir_list):
+            lines = open(item_path, 'r').readlines()
+            lines = [[data_dir, item] for item in lines]
+            image_list.extend(lines)
+        self.image_list = image_list
+    def __getitem__(self, index):
+        data_dir, file_path = self.image_list[index]
+        video_key = file_path.split('|||')[0]
+        try:
+            ref_frame, vit_frame, video_data, caption = self._get_video_data(data_dir, file_path)
+        except Exception as e:
+            logging.info('{} get frames failed... with error: {}'.format(video_key, e))
+            caption = ''
+            video_key = ''
+            ref_frame = torch.zeros(3, self.resolution[1], self.resolution[0])
+            vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+            video_data = torch.zeros(self.max_frames, 3, self.resolution[1], self.resolution[0])
+        return ref_frame, vit_frame, video_data, caption, video_key
+    def _get_video_data(self, data_dir, file_path):
+        video_key, caption = file_path.split('|||')
+        file_path = os.path.join(data_dir, video_key)
+        for _ in range(5):
+            try:
+                capture = cv2.VideoCapture(file_path)
+                _fps = capture.get(cv2.CAP_PROP_FPS)
+                _total_frame_num = capture.get(cv2.CAP_PROP_FRAME_COUNT)
+                stride = round(_fps / self.sample_fps)
+                cover_frame_num = (stride * self.max_frames)
+                if _total_frame_num < cover_frame_num + 5:
+                    start_frame = 0
+                    end_frame = _total_frame_num
+                else:
+                    start_frame = random.randint(0, _total_frame_num-cover_frame_num-5)
+                    end_frame = start_frame + cover_frame_num
+                pointer, frame_list = 0, []
+                while(True):
+                    ret, frame = capture.read()
+                    pointer +=1
+                    if (not ret) or (frame is None): break
+                    if pointer < start_frame: continue
+                    if pointer >= end_frame - 1: break
+                    if (pointer - start_frame) % stride == 0:
+                        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                        frame = Image.fromarray(frame)
+                        frame_list.append(frame)
+                break
+            except Exception as e:
+                logging.info('{} read video frame failed with error: {}'.format(video_key, e))
+                continue
+        video_data = torch.zeros(self.max_frames, 3,  self.resolution[1], self.resolution[0])
+        if self.get_first_frame:
+            ref_idx = 0
+        else:
+            ref_idx = int(len(frame_list)/2)
+        try:
+            if len(frame_list)>0:
+                mid_frame = copy(frame_list[ref_idx])
+                vit_frame = self.vit_transforms(mid_frame)
+                frames = self.transforms(frame_list)
+                video_data[:len(frame_list), ...] = frames
+            else:
+                vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+        except:
+            vit_frame = torch.zeros(3, self.vit_resolution[1], self.vit_resolution[0])
+        ref_frame = copy(frames[ref_idx])
+        return ref_frame, vit_frame, video_data, caption
+    def __len__(self):
+        return len(self.image_list)

UniAnimate/tools/inferences/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .inference_unianimate_entrance import *
2	+ from .inference_unianimate_long_entrance import *

UniAnimate/tools/inferences/inference_unianimate_entrance.py ADDED Viewed

	@@ -0,0 +1,483 @@

+'''
+/*
+*Copyright (c) 2021, Alibaba Group;
+*Licensed under the Apache License, Version 2.0 (the "License");
+*you may not use this file except in compliance with the License.
+*You may obtain a copy of the License at
+*   http://www.apache.org/licenses/LICENSE-2.0
+*Unless required by applicable law or agreed to in writing, software
+*distributed under the License is distributed on an "AS IS" BASIS,
+*WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+*See the License for the specific language governing permissions and
+*limitations under the License.
+*/
+'''
+import os
+import re
+import os.path as osp
+import sys
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-4]))
+import json
+import math
+import torch
+import pynvml
+import logging
+import cv2
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+import torch.cuda.amp as amp
+from importlib import reload
+import torch.distributed as dist
+import torch.multiprocessing as mp
+import random
+from einops import rearrange
+import torchvision.transforms as T
+import torchvision.transforms.functional as TF
+from torch.nn.parallel import DistributedDataParallel
+import utils.transforms as data
+from ..modules.config import cfg
+from utils.seed import setup_seed
+from utils.multi_port import find_free_port
+from utils.assign_cfg import assign_signle_cfg
+from utils.distributed import generalized_all_gather, all_reduce
+from utils.video_op import save_i2vgen_video, save_t2vhigen_video_safe, save_video_multiple_conditions_not_gif_horizontal_3col
+from tools.modules.autoencoder import get_first_stage_encoding
+from utils.registry_class import INFER_ENGINE, MODEL, EMBEDDER, AUTO_ENCODER, DIFFUSION
+from copy import copy
+import cv2
+@INFER_ENGINE.register_function()
+def inference_unianimate_entrance(cfg_update,  **kwargs):
+    for k, v in cfg_update.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    if not 'MASTER_ADDR' in os.environ:
+        os.environ['MASTER_ADDR']='localhost'
+        os.environ['MASTER_PORT']= find_free_port()
+    cfg.pmi_rank = int(os.getenv('RANK', 0))
+    cfg.pmi_world_size = int(os.getenv('WORLD_SIZE', 1))
+    if cfg.debug:
+        cfg.gpus_per_machine = 1
+        cfg.world_size = 1
+    else:
+        cfg.gpus_per_machine = torch.cuda.device_count()
+        cfg.world_size = cfg.pmi_world_size * cfg.gpus_per_machine
+    if cfg.world_size == 1:
+        worker(0, cfg, cfg_update)
+    else:
+        mp.spawn(worker, nprocs=cfg.gpus_per_machine, args=(cfg, cfg_update))
+    return cfg
+def make_masked_images(imgs, masks):
+    masked_imgs = []
+    for i, mask in enumerate(masks):
+        # concatenation
+        masked_imgs.append(torch.cat([imgs[i] * (1 - mask), (1 - mask)], dim=1))
+    return torch.stack(masked_imgs, dim=0)
+def load_video_frames(ref_image_path, pose_file_path, train_trans, vit_transforms, train_trans_pose, max_frames=32, frame_interval = 1, resolution=[512, 768], get_first_frame=True, vit_resolution=[224, 224]):
+    for _ in range(5):
+        try:
+            dwpose_all = {}
+            frames_all = {}
+            for ii_index in sorted(os.listdir(pose_file_path)):
+                if ii_index != "ref_pose.jpg":
+                    dwpose_all[ii_index] = Image.open(os.path.join(pose_file_path, ii_index))
+                    frames_all[ii_index] = Image.fromarray(cv2.cvtColor(cv2.imread(ref_image_path), cv2.COLOR_BGR2RGB))
+            pose_ref = Image.open(os.path.join(pose_file_path, "ref_pose.jpg"))
+            # Sample max_frames poses for video generation
+            stride = frame_interval
+            total_frame_num = len(frames_all)
+            cover_frame_num = (stride * (max_frames - 1) + 1)
+            if total_frame_num < cover_frame_num:
+                print(f'_total_frame_num ({total_frame_num}) is smaller than cover_frame_num ({cover_frame_num}), the sampled frame interval is changed')
+                start_frame = 0
+                end_frame = total_frame_num
+                stride = max((total_frame_num - 1) // (max_frames - 1), 1)
+                end_frame = stride * max_frames
+            else:
+                start_frame = 0
+                end_frame = start_frame + cover_frame_num
+            frame_list = []
+            dwpose_list = []
+            random_ref_frame = frames_all[list(frames_all.keys())[0]]
+            if random_ref_frame.mode != 'RGB':
+                random_ref_frame = random_ref_frame.convert('RGB')
+            random_ref_dwpose = pose_ref
+            if random_ref_dwpose.mode != 'RGB':
+                random_ref_dwpose = random_ref_dwpose.convert('RGB')
+            for i_index in range(start_frame, end_frame, stride):
+                if i_index < len(frames_all):  # Check index within bounds
+                    i_key = list(frames_all.keys())[i_index]
+                    i_frame = frames_all[i_key]
+                    if i_frame.mode != 'RGB':
+                        i_frame = i_frame.convert('RGB')
+                    i_dwpose = dwpose_all[i_key]
+                    if i_dwpose.mode != 'RGB':
+                        i_dwpose = i_dwpose.convert('RGB')
+                    frame_list.append(i_frame)
+                    dwpose_list.append(i_dwpose)
+            if frame_list:
+                middle_indix = 0
+                ref_frame = frame_list[middle_indix]
+                vit_frame = vit_transforms(ref_frame)
+                random_ref_frame_tmp = train_trans_pose(random_ref_frame)
+                random_ref_dwpose_tmp = train_trans_pose(random_ref_dwpose)
+                misc_data_tmp = torch.stack([train_trans_pose(ss) for ss in frame_list], dim=0)
+                video_data_tmp = torch.stack([train_trans(ss) for ss in frame_list], dim=0)
+                dwpose_data_tmp = torch.stack([train_trans_pose(ss) for ss in dwpose_list], dim=0)
+                video_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+                dwpose_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+                misc_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+                random_ref_frame_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+                random_ref_dwpose_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+                video_data[:len(frame_list), ...] = video_data_tmp
+                misc_data[:len(frame_list), ...] = misc_data_tmp
+                dwpose_data[:len(frame_list), ...] = dwpose_data_tmp
+                random_ref_frame_data[:, ...] = random_ref_frame_tmp
+                random_ref_dwpose_data[:, ...] = random_ref_dwpose_tmp
+                return vit_frame, video_data, misc_data, dwpose_data, random_ref_frame_data, random_ref_dwpose_data
+        except Exception as e:
+            logging.info(f'Error reading video frame: {e}')
+            continue
+    return None, None, None, None, None, None
+def worker(gpu, cfg, cfg_update):
+    '''
+    Inference worker for each gpu
+    '''
+    for k, v in cfg_update.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    cfg.gpu = gpu
+    cfg.seed = int(cfg.seed)
+    cfg.rank = cfg.pmi_rank * cfg.gpus_per_machine + gpu
+    setup_seed(cfg.seed + cfg.rank)
+    if not cfg.debug:
+        torch.cuda.set_device(gpu)
+        torch.backends.cudnn.benchmark = True
+        if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+            torch.backends.cudnn.benchmark = False
+        dist.init_process_group(backend='nccl', world_size=cfg.world_size, rank=cfg.rank)
+    # [Log] Save logging and make log dir
+    log_dir = generalized_all_gather(cfg.log_dir)[0]
+    inf_name = osp.basename(cfg.cfg_file).split('.')[0]
+    test_model = osp.basename(cfg.test_model).split('.')[0].split('_')[-1]
+    cfg.log_dir = osp.join(cfg.log_dir, '%s' % (inf_name))
+    os.makedirs(cfg.log_dir, exist_ok=True)
+    log_file = osp.join(cfg.log_dir, 'log_%02d.txt' % (cfg.rank))
+    cfg.log_file = log_file
+    reload(logging)
+    logging.basicConfig(
+        level=logging.INFO,
+        format='[%(asctime)s] %(levelname)s: %(message)s',
+        handlers=[
+            logging.FileHandler(filename=log_file),
+            logging.StreamHandler(stream=sys.stdout)])
+    logging.info(cfg)
+    logging.info(f"Running UniAnimate inference on gpu {gpu}")
+    # [Diffusion]
+    diffusion = DIFFUSION.build(cfg.Diffusion)
+    # [Data] Data Transform
+    train_trans = data.Compose([
+        data.Resize(cfg.resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.mean, std=cfg.std)
+        ])
+    train_trans_pose = data.Compose([
+        data.Resize(cfg.resolution),
+        data.ToTensor(),
+        ]
+        )
+    vit_transforms = T.Compose([
+                data.Resize(cfg.vit_resolution),
+                T.ToTensor(),
+                T.Normalize(mean=cfg.vit_mean, std=cfg.vit_std)])
+    # [Model] embedder
+    clip_encoder = EMBEDDER.build(cfg.embedder)
+    clip_encoder.model.to(gpu)
+    with torch.no_grad():
+        _, _, zero_y = clip_encoder(text="")
+    # [Model] auotoencoder
+    autoencoder = AUTO_ENCODER.build(cfg.auto_encoder)
+    autoencoder.eval() # freeze
+    for param in autoencoder.parameters():
+        param.requires_grad = False
+    autoencoder.cuda()
+    # [Model] UNet
+    if "config" in cfg.UNet:
+        cfg.UNet["config"] = cfg
+    cfg.UNet["zero_y"] = zero_y
+    model = MODEL.build(cfg.UNet)
+    state_dict = torch.load(cfg.test_model, map_location='cpu')
+    if 'state_dict' in state_dict:
+        state_dict = state_dict['state_dict']
+    if 'step' in state_dict:
+        resume_step = state_dict['step']
+    else:
+        resume_step = 0
+    status = model.load_state_dict(state_dict, strict=True)
+    logging.info('Load model from {} with status {}'.format(cfg.test_model, status))
+    model = model.to(gpu)
+    model.eval()
+    if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+        model.to(torch.float16)
+    else:
+        model = DistributedDataParallel(model, device_ids=[gpu]) if not cfg.debug else model
+    torch.cuda.empty_cache()
+    test_list = cfg.test_list_path
+    num_videos = len(test_list)
+    logging.info(f'There are {num_videos} videos. with {cfg.round} times')
+    # test_list = [item for item in test_list for _ in range(cfg.round)]
+    test_list = [item for _ in range(cfg.round) for item in test_list]
+    for idx, file_path in enumerate(test_list):
+        cfg.frame_interval, ref_image_key, pose_seq_key = file_path[0], file_path[1], file_path[2]
+        manual_seed = int(cfg.seed + cfg.rank + idx//num_videos)
+        setup_seed(manual_seed)
+        logging.info(f"[{idx}]/[{len(test_list)}] Begin to sample {ref_image_key}, pose sequence from {pose_seq_key} init seed {manual_seed} ...")
+        vit_frame, video_data, misc_data, dwpose_data, random_ref_frame_data, random_ref_dwpose_data = load_video_frames(ref_image_key, pose_seq_key, train_trans, vit_transforms, train_trans_pose, max_frames=cfg.max_frames, frame_interval =cfg.frame_interval, resolution=cfg.resolution)
+        misc_data = misc_data.unsqueeze(0).to(gpu)
+        vit_frame = vit_frame.unsqueeze(0).to(gpu)
+        dwpose_data = dwpose_data.unsqueeze(0).to(gpu)
+        random_ref_frame_data = random_ref_frame_data.unsqueeze(0).to(gpu)
+        random_ref_dwpose_data = random_ref_dwpose_data.unsqueeze(0).to(gpu)
+        ### save for visualization
+        misc_backups = copy(misc_data)
+        frames_num = misc_data.shape[1]
+        misc_backups = rearrange(misc_backups, 'b f c h w -> b c f h w')
+        mv_data_video = []
+        ### local image (first frame)
+        image_local = []
+        if 'local_image' in cfg.video_compositions:
+            frames_num = misc_data.shape[1]
+            bs_vd_local = misc_data.shape[0]
+            image_local = misc_data[:,:1].clone().repeat(1,frames_num,1,1,1)
+            image_local_clone = rearrange(image_local, 'b f c h w -> b c f h w', b = bs_vd_local)
+            image_local = rearrange(image_local, 'b f c h w -> b c f h w', b = bs_vd_local)
+            if hasattr(cfg, "latent_local_image") and cfg.latent_local_image:
+                with torch.no_grad():
+                    temporal_length = frames_num
+                    encoder_posterior = autoencoder.encode(video_data[:,0])
+                    local_image_data = get_first_stage_encoding(encoder_posterior).detach()
+                    image_local = local_image_data.unsqueeze(1).repeat(1,temporal_length,1,1,1) # [10, 16, 4, 64, 40]
+        ### encode the video_data
+        bs_vd = misc_data.shape[0]
+        misc_data = rearrange(misc_data, 'b f c h w -> (b f) c h w')
+        misc_data_list = torch.chunk(misc_data, misc_data.shape[0]//cfg.chunk_size,dim=0)
+        with torch.no_grad():
+            random_ref_frame = []
+            if 'randomref' in cfg.video_compositions:
+                random_ref_frame_clone = rearrange(random_ref_frame_data, 'b f c h w -> b c f h w')
+                if hasattr(cfg, "latent_random_ref") and cfg.latent_random_ref:
+                    temporal_length = random_ref_frame_data.shape[1]
+                    encoder_posterior = autoencoder.encode(random_ref_frame_data[:,0].sub(0.5).div_(0.5))
+                    random_ref_frame_data = get_first_stage_encoding(encoder_posterior).detach()
+                    random_ref_frame_data = random_ref_frame_data.unsqueeze(1).repeat(1,temporal_length,1,1,1) # [10, 16, 4, 64, 40]
+                random_ref_frame = rearrange(random_ref_frame_data, 'b f c h w -> b c f h w')
+            if 'dwpose' in cfg.video_compositions:
+                bs_vd_local = dwpose_data.shape[0]
+                dwpose_data_clone = rearrange(dwpose_data.clone(), 'b f c h w -> b c f h w', b = bs_vd_local)
+                if 'randomref_pose' in cfg.video_compositions:
+                    dwpose_data = torch.cat([random_ref_dwpose_data[:,:1], dwpose_data], dim=1)
+                dwpose_data = rearrange(dwpose_data, 'b f c h w -> b c f h w', b = bs_vd_local)
+            y_visual = []
+            if 'image' in cfg.video_compositions:
+                with torch.no_grad():
+                    vit_frame = vit_frame.squeeze(1)
+                    y_visual = clip_encoder.encode_image(vit_frame).unsqueeze(1) # [60, 1024]
+                    y_visual0 = y_visual.clone()
+        with amp.autocast(enabled=True):
+            pynvml.nvmlInit()
+            handle=pynvml.nvmlDeviceGetHandleByIndex(0)
+            meminfo=pynvml.nvmlDeviceGetMemoryInfo(handle)
+            cur_seed = torch.initial_seed()
+            logging.info(f"Current seed {cur_seed} ...")
+            noise = torch.randn([1, 4, cfg.max_frames, int(cfg.resolution[1]/cfg.scale), int(cfg.resolution[0]/cfg.scale)])
+            noise = noise.to(gpu)
+            if hasattr(cfg.Diffusion, "noise_strength"):
+                b, c, f, _, _= noise.shape
+                offset_noise = torch.randn(b, c, f, 1, 1, device=noise.device)
+                noise = noise + cfg.Diffusion.noise_strength * offset_noise
+            # add a noise prior
+            noise = diffusion.q_sample(random_ref_frame.clone(), getattr(cfg, "noise_prior_value", 949), noise=noise)
+            # construct model inputs (CFG)
+            full_model_kwargs=[{
+                                        'y': None,
+                                        "local_image": None if len(image_local) == 0 else image_local[:],
+                                        'image': None if len(y_visual) == 0 else y_visual0[:],
+                                        'dwpose': None if len(dwpose_data) == 0 else dwpose_data[:],
+                                        'randomref': None if len(random_ref_frame) == 0 else random_ref_frame[:],
+                                       },
+                                       {
+                                        'y': None,
+                                        "local_image": None,
+                                        'image': None,
+                                        'randomref': None,
+                                        'dwpose': None,
+                                       }]
+            # for visualization
+            full_model_kwargs_vis =[{
+                                        'y': None,
+                                        "local_image": None if len(image_local) == 0 else image_local_clone[:],
+                                        'image': None,
+                                        'dwpose': None if len(dwpose_data_clone) == 0 else dwpose_data_clone[:],
+                                        'randomref': None if len(random_ref_frame) == 0 else random_ref_frame_clone[:, :3],
+                                       },
+                                       {
+                                        'y': None,
+                                        "local_image": None,
+                                        'image': None,
+                                        'randomref': None,
+                                        'dwpose': None,
+                                       }]
+            partial_keys = [
+                    ['image', 'randomref', "dwpose"],
+                ]
+            if hasattr(cfg, "partial_keys") and cfg.partial_keys:
+                partial_keys = cfg.partial_keys
+            for partial_keys_one in partial_keys:
+                model_kwargs_one = prepare_model_kwargs(partial_keys = partial_keys_one,
+                                    full_model_kwargs = full_model_kwargs,
+                                    use_fps_condition = cfg.use_fps_condition)
+                model_kwargs_one_vis = prepare_model_kwargs(partial_keys = partial_keys_one,
+                                    full_model_kwargs = full_model_kwargs_vis,
+                                    use_fps_condition = cfg.use_fps_condition)
+                noise_one = noise
+                if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+                    clip_encoder.cpu() # add this line
+                    autoencoder.cpu() # add this line
+                    torch.cuda.empty_cache() # add this line
+                video_data = diffusion.ddim_sample_loop(
+                    noise=noise_one,
+                    model=model.eval(),
+                    model_kwargs=model_kwargs_one,
+                    guide_scale=cfg.guide_scale,
+                    ddim_timesteps=cfg.ddim_timesteps,
+                    eta=0.0)
+                if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+                    # if run forward of  autoencoder or clip_encoder second times, load them again
+                    clip_encoder.cuda()
+                    autoencoder.cuda()
+                video_data = 1. / cfg.scale_factor * video_data
+                video_data = rearrange(video_data, 'b c f h w -> (b f) c h w')
+                chunk_size = min(cfg.decoder_bs, video_data.shape[0])
+                video_data_list = torch.chunk(video_data, video_data.shape[0]//chunk_size, dim=0)
+                decode_data = []
+                for vd_data in video_data_list:
+                    gen_frames = autoencoder.decode(vd_data)
+                    decode_data.append(gen_frames)
+                video_data = torch.cat(decode_data, dim=0)
+                video_data = rearrange(video_data, '(b f) c h w -> b c f h w', b = cfg.batch_size).float()
+                text_size = cfg.resolution[-1]
+                cap_name = re.sub(r'[^\w\s]', '', ref_image_key.split("/")[-1].split('.')[0]) # .replace(' ', '_')
+                name = f'seed_{cur_seed}'
+                for ii in partial_keys_one:
+                    name = name + "_" + ii
+                file_name = f'rank_{cfg.world_size:02d}_{cfg.rank:02d}_{idx:02d}_{name}_{cap_name}_{cfg.resolution[1]}x{cfg.resolution[0]}.mp4'
+                local_path = os.path.join(cfg.log_dir, f'{file_name}')
+                os.makedirs(os.path.dirname(local_path), exist_ok=True)
+                captions = "human"
+                del model_kwargs_one_vis[0][list(model_kwargs_one_vis[0].keys())[0]]
+                del model_kwargs_one_vis[1][list(model_kwargs_one_vis[1].keys())[0]]
+                save_video_multiple_conditions_not_gif_horizontal_3col(local_path, video_data.cpu(), model_kwargs_one_vis, misc_backups,
+                                                cfg.mean, cfg.std, nrow=1, save_fps=cfg.save_fps)
+                # try:
+                #     save_t2vhigen_video_safe(local_path, video_data.cpu(), captions, cfg.mean, cfg.std, text_size)
+                #     logging.info('Save video to dir %s:' % (local_path))
+                # except Exception as e:
+                #     logging.info(f'Step: save text or video error with {e}')
+    logging.info('Congratulations! The inference is completed!')
+    # synchronize to finish some processes
+    if not cfg.debug:
+        torch.cuda.synchronize()
+        dist.barrier()
+def prepare_model_kwargs(partial_keys, full_model_kwargs, use_fps_condition=False):
+    if use_fps_condition is True:
+        partial_keys.append('fps')
+    partial_model_kwargs = [{}, {}]
+    for partial_key in partial_keys:
+        partial_model_kwargs[0][partial_key] = full_model_kwargs[0][partial_key]
+        partial_model_kwargs[1][partial_key] = full_model_kwargs[1][partial_key]
+    return partial_model_kwargs

UniAnimate/tools/inferences/inference_unianimate_long_entrance.py ADDED Viewed

	@@ -0,0 +1,508 @@

+'''
+/*
+*Copyright (c) 2021, Alibaba Group;
+*Licensed under the Apache License, Version 2.0 (the "License");
+*you may not use this file except in compliance with the License.
+*You may obtain a copy of the License at
+*   http://www.apache.org/licenses/LICENSE-2.0
+*Unless required by applicable law or agreed to in writing, software
+*distributed under the License is distributed on an "AS IS" BASIS,
+*WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+*See the License for the specific language governing permissions and
+*limitations under the License.
+*/
+'''
+import os
+import re
+import os.path as osp
+import sys
+sys.path.insert(0, '/'.join(osp.realpath(__file__).split('/')[:-4]))
+import json
+import math
+import torch
+import pynvml
+import logging
+import cv2
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+import torch.cuda.amp as amp
+from importlib import reload
+import torch.distributed as dist
+import torch.multiprocessing as mp
+import random
+from einops import rearrange
+import torchvision.transforms as T
+import torchvision.transforms.functional as TF
+from torch.nn.parallel import DistributedDataParallel
+import utils.transforms as data
+from ..modules.config import cfg
+from utils.seed import setup_seed
+from utils.multi_port import find_free_port
+from utils.assign_cfg import assign_signle_cfg
+from utils.distributed import generalized_all_gather, all_reduce
+from utils.video_op import save_i2vgen_video, save_t2vhigen_video_safe, save_video_multiple_conditions_not_gif_horizontal_3col
+from tools.modules.autoencoder import get_first_stage_encoding
+from utils.registry_class import INFER_ENGINE, MODEL, EMBEDDER, AUTO_ENCODER, DIFFUSION
+from copy import copy
+import cv2
+@INFER_ENGINE.register_function()
+def inference_unianimate_long_entrance(cfg_update,  **kwargs):
+    for k, v in cfg_update.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    if not 'MASTER_ADDR' in os.environ:
+        os.environ['MASTER_ADDR']='localhost'
+        os.environ['MASTER_PORT']= find_free_port()
+    cfg.pmi_rank = int(os.getenv('RANK', 0))
+    cfg.pmi_world_size = int(os.getenv('WORLD_SIZE', 1))
+    if cfg.debug:
+        cfg.gpus_per_machine = 1
+        cfg.world_size = 1
+    else:
+        cfg.gpus_per_machine = torch.cuda.device_count()
+        cfg.world_size = cfg.pmi_world_size * cfg.gpus_per_machine
+    if cfg.world_size == 1:
+        worker(0, cfg, cfg_update)
+    else:
+        mp.spawn(worker, nprocs=cfg.gpus_per_machine, args=(cfg, cfg_update))
+    return cfg
+def make_masked_images(imgs, masks):
+    masked_imgs = []
+    for i, mask in enumerate(masks):
+        # concatenation
+        masked_imgs.append(torch.cat([imgs[i] * (1 - mask), (1 - mask)], dim=1))
+    return torch.stack(masked_imgs, dim=0)
+def load_video_frames(ref_image_path, pose_file_path, train_trans, vit_transforms, train_trans_pose, max_frames=32, frame_interval = 1, resolution=[512, 768], get_first_frame=True, vit_resolution=[224, 224]):
+    for _ in range(5):
+        try:
+            dwpose_all = {}
+            frames_all = {}
+            for ii_index in sorted(os.listdir(pose_file_path)):
+                if ii_index != "ref_pose.jpg":
+                    dwpose_all[ii_index] = Image.open(pose_file_path+"/"+ii_index)
+                    frames_all[ii_index] = Image.fromarray(cv2.cvtColor(cv2.imread(ref_image_path),cv2.COLOR_BGR2RGB))
+                    # frames_all[ii_index] = Image.open(ref_image_path)
+            pose_ref = Image.open(os.path.join(pose_file_path, "ref_pose.jpg"))
+            first_eq_ref = False
+            # sample max_frames poses for video generation
+            stride = frame_interval
+            _total_frame_num = len(frames_all)
+            if max_frames == "None":
+                max_frames = (_total_frame_num-1)//frame_interval + 1
+            cover_frame_num = (stride * (max_frames-1)+1)
+            if _total_frame_num < cover_frame_num:
+                print('_total_frame_num is smaller than cover_frame_num, the sampled frame interval is changed')
+                start_frame = 0   # we set start_frame = 0 because the pose alignment is performed on the first frame
+                end_frame = _total_frame_num
+                stride = max((_total_frame_num-1//(max_frames-1)),1)
+                end_frame = stride*max_frames
+            else:
+                start_frame = 0  # we set start_frame = 0 because the pose alignment is performed on the first frame
+                end_frame = start_frame + cover_frame_num
+            frame_list = []
+            dwpose_list = []
+            random_ref_frame = frames_all[list(frames_all.keys())[0]]
+            if random_ref_frame.mode != 'RGB':
+                random_ref_frame = random_ref_frame.convert('RGB')
+            random_ref_dwpose = pose_ref
+            if random_ref_dwpose.mode != 'RGB':
+                random_ref_dwpose = random_ref_dwpose.convert('RGB')
+            for i_index in range(start_frame, end_frame, stride):
+                if i_index == start_frame and first_eq_ref:
+                    i_key = list(frames_all.keys())[i_index]
+                    i_frame = frames_all[i_key]
+                    if i_frame.mode != 'RGB':
+                        i_frame = i_frame.convert('RGB')
+                    i_dwpose = frames_pose_ref
+                    if i_dwpose.mode != 'RGB':
+                        i_dwpose = i_dwpose.convert('RGB')
+                    frame_list.append(i_frame)
+                    dwpose_list.append(i_dwpose)
+                else:
+                    # added
+                    if first_eq_ref:
+                        i_index = i_index - stride
+                    i_key = list(frames_all.keys())[i_index]
+                    i_frame = frames_all[i_key]
+                    if i_frame.mode != 'RGB':
+                        i_frame = i_frame.convert('RGB')
+                    i_dwpose = dwpose_all[i_key]
+                    if i_dwpose.mode != 'RGB':
+                        i_dwpose = i_dwpose.convert('RGB')
+                    frame_list.append(i_frame)
+                    dwpose_list.append(i_dwpose)
+            have_frames = len(frame_list)>0
+            middle_indix = 0
+            if have_frames:
+                ref_frame = frame_list[middle_indix]
+                vit_frame = vit_transforms(ref_frame)
+                random_ref_frame_tmp = train_trans_pose(random_ref_frame)
+                random_ref_dwpose_tmp = train_trans_pose(random_ref_dwpose)
+                misc_data_tmp = torch.stack([train_trans_pose(ss) for ss in frame_list], dim=0)
+                video_data_tmp = torch.stack([train_trans(ss) for ss in frame_list], dim=0)
+                dwpose_data_tmp = torch.stack([train_trans_pose(ss) for ss in dwpose_list], dim=0)
+            video_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+            dwpose_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+            misc_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+            random_ref_frame_data = torch.zeros(max_frames, 3, resolution[1], resolution[0]) # [32, 3, 512, 768]
+            random_ref_dwpose_data = torch.zeros(max_frames, 3, resolution[1], resolution[0])
+            if have_frames:
+                video_data[:len(frame_list), ...] = video_data_tmp
+                misc_data[:len(frame_list), ...] = misc_data_tmp
+                dwpose_data[:len(frame_list), ...] = dwpose_data_tmp
+                random_ref_frame_data[:,...] = random_ref_frame_tmp
+                random_ref_dwpose_data[:,...] = random_ref_dwpose_tmp
+            break
+        except Exception as e:
+            logging.info('{} read video frame failed with error: {}'.format(pose_file_path, e))
+            continue
+    return vit_frame, video_data, misc_data, dwpose_data, random_ref_frame_data, random_ref_dwpose_data, max_frames
+def worker(gpu, cfg, cfg_update):
+    '''
+    Inference worker for each gpu
+    '''
+    for k, v in cfg_update.items():
+        if isinstance(v, dict) and k in cfg:
+            cfg[k].update(v)
+        else:
+            cfg[k] = v
+    cfg.gpu = gpu
+    cfg.seed = int(cfg.seed)
+    cfg.rank = cfg.pmi_rank * cfg.gpus_per_machine + gpu
+    setup_seed(cfg.seed + cfg.rank)
+    if not cfg.debug:
+        torch.cuda.set_device(gpu)
+        torch.backends.cudnn.benchmark = True
+        if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+            torch.backends.cudnn.benchmark = False
+        dist.init_process_group(backend='nccl', world_size=cfg.world_size, rank=cfg.rank)
+    # [Log] Save logging and make log dir
+    log_dir = generalized_all_gather(cfg.log_dir)[0]
+    inf_name = osp.basename(cfg.cfg_file).split('.')[0]
+    test_model = osp.basename(cfg.test_model).split('.')[0].split('_')[-1]
+    cfg.log_dir = osp.join(cfg.log_dir, '%s' % (inf_name))
+    os.makedirs(cfg.log_dir, exist_ok=True)
+    log_file = osp.join(cfg.log_dir, 'log_%02d.txt' % (cfg.rank))
+    cfg.log_file = log_file
+    reload(logging)
+    logging.basicConfig(
+        level=logging.INFO,
+        format='[%(asctime)s] %(levelname)s: %(message)s',
+        handlers=[
+            logging.FileHandler(filename=log_file),
+            logging.StreamHandler(stream=sys.stdout)])
+    logging.info(cfg)
+    logging.info(f"Running UniAnimate inference on gpu {gpu}")
+    # [Diffusion]
+    diffusion = DIFFUSION.build(cfg.Diffusion)
+    # [Data] Data Transform
+    train_trans = data.Compose([
+        data.Resize(cfg.resolution),
+        data.ToTensor(),
+        data.Normalize(mean=cfg.mean, std=cfg.std)
+        ])
+    train_trans_pose = data.Compose([
+        data.Resize(cfg.resolution),
+        data.ToTensor(),
+        ]
+        )
+    vit_transforms = T.Compose([
+                data.Resize(cfg.vit_resolution),
+                T.ToTensor(),
+                T.Normalize(mean=cfg.vit_mean, std=cfg.vit_std)])
+    # [Model] embedder
+    clip_encoder = EMBEDDER.build(cfg.embedder)
+    clip_encoder.model.to(gpu)
+    with torch.no_grad():
+        _, _, zero_y = clip_encoder(text="")
+    # [Model] auotoencoder
+    autoencoder = AUTO_ENCODER.build(cfg.auto_encoder)
+    autoencoder.eval() # freeze
+    for param in autoencoder.parameters():
+        param.requires_grad = False
+    autoencoder.cuda()
+    # [Model] UNet
+    if "config" in cfg.UNet:
+        cfg.UNet["config"] = cfg
+    cfg.UNet["zero_y"] = zero_y
+    model = MODEL.build(cfg.UNet)
+    state_dict = torch.load(cfg.test_model, map_location='cpu')
+    if 'state_dict' in state_dict:
+        state_dict = state_dict['state_dict']
+    if 'step' in state_dict:
+        resume_step = state_dict['step']
+    else:
+        resume_step = 0
+    status = model.load_state_dict(state_dict, strict=True)
+    logging.info('Load model from {} with status {}'.format(cfg.test_model, status))
+    model = model.to(gpu)
+    model.eval()
+    if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+        model.to(torch.float16)
+    else:
+        model = DistributedDataParallel(model, device_ids=[gpu]) if not cfg.debug else model
+    torch.cuda.empty_cache()
+    test_list = cfg.test_list_path
+    num_videos = len(test_list)
+    logging.info(f'There are {num_videos} videos. with {cfg.round} times')
+    test_list = [item for _ in range(cfg.round) for item in test_list]
+    for idx, file_path in enumerate(test_list):
+        cfg.frame_interval, ref_image_key, pose_seq_key = file_path[0], file_path[1], file_path[2]
+        manual_seed = int(cfg.seed + cfg.rank + idx//num_videos)
+        setup_seed(manual_seed)
+        logging.info(f"[{idx}]/[{len(test_list)}] Begin to sample {ref_image_key}, pose sequence from {pose_seq_key} init seed {manual_seed} ...")
+        vit_frame, video_data, misc_data, dwpose_data, random_ref_frame_data, random_ref_dwpose_data, max_frames = load_video_frames(ref_image_key, pose_seq_key, train_trans, vit_transforms, train_trans_pose, max_frames=cfg.max_frames, frame_interval =cfg.frame_interval, resolution=cfg.resolution)
+        cfg.max_frames_new = max_frames
+        misc_data = misc_data.unsqueeze(0).to(gpu)
+        vit_frame = vit_frame.unsqueeze(0).to(gpu)
+        dwpose_data = dwpose_data.unsqueeze(0).to(gpu)
+        random_ref_frame_data = random_ref_frame_data.unsqueeze(0).to(gpu)
+        random_ref_dwpose_data = random_ref_dwpose_data.unsqueeze(0).to(gpu)
+        ### save for visualization
+        misc_backups = copy(misc_data)
+        frames_num = misc_data.shape[1]
+        misc_backups = rearrange(misc_backups, 'b f c h w -> b c f h w')
+        mv_data_video = []
+        ### local image (first frame)
+        image_local = []
+        if 'local_image' in cfg.video_compositions:
+            frames_num = misc_data.shape[1]
+            bs_vd_local = misc_data.shape[0]
+            image_local = misc_data[:,:1].clone().repeat(1,frames_num,1,1,1)
+            image_local_clone = rearrange(image_local, 'b f c h w -> b c f h w', b = bs_vd_local)
+            image_local = rearrange(image_local, 'b f c h w -> b c f h w', b = bs_vd_local)
+            if hasattr(cfg, "latent_local_image") and cfg.latent_local_image:
+                with torch.no_grad():
+                    temporal_length = frames_num
+                    encoder_posterior = autoencoder.encode(video_data[:,0])
+                    local_image_data = get_first_stage_encoding(encoder_posterior).detach()
+                    image_local = local_image_data.unsqueeze(1).repeat(1,temporal_length,1,1,1) # [10, 16, 4, 64, 40]
+        ### encode the video_data
+        bs_vd = misc_data.shape[0]
+        misc_data = rearrange(misc_data, 'b f c h w -> (b f) c h w')
+        misc_data_list = torch.chunk(misc_data, misc_data.shape[0]//cfg.chunk_size,dim=0)
+        with torch.no_grad():
+            random_ref_frame = []
+            if 'randomref' in cfg.video_compositions:
+                random_ref_frame_clone = rearrange(random_ref_frame_data, 'b f c h w -> b c f h w')
+                if hasattr(cfg, "latent_random_ref") and cfg.latent_random_ref:
+                    temporal_length = random_ref_frame_data.shape[1]
+                    encoder_posterior = autoencoder.encode(random_ref_frame_data[:,0].sub(0.5).div_(0.5))
+                    random_ref_frame_data = get_first_stage_encoding(encoder_posterior).detach()
+                    random_ref_frame_data = random_ref_frame_data.unsqueeze(1).repeat(1,temporal_length,1,1,1) # [10, 16, 4, 64, 40]
+                random_ref_frame = rearrange(random_ref_frame_data, 'b f c h w -> b c f h w')
+            if 'dwpose' in cfg.video_compositions:
+                bs_vd_local = dwpose_data.shape[0]
+                dwpose_data_clone = rearrange(dwpose_data.clone(), 'b f c h w -> b c f h w', b = bs_vd_local)
+                if 'randomref_pose' in cfg.video_compositions:
+                    dwpose_data = torch.cat([random_ref_dwpose_data[:,:1], dwpose_data], dim=1)
+                dwpose_data = rearrange(dwpose_data, 'b f c h w -> b c f h w', b = bs_vd_local)
+            y_visual = []
+            if 'image' in cfg.video_compositions:
+                with torch.no_grad():
+                    vit_frame = vit_frame.squeeze(1)
+                    y_visual = clip_encoder.encode_image(vit_frame).unsqueeze(1) # [60, 1024]
+                    y_visual0 = y_visual.clone()
+        with amp.autocast(enabled=True):
+            pynvml.nvmlInit()
+            handle=pynvml.nvmlDeviceGetHandleByIndex(0)
+            meminfo=pynvml.nvmlDeviceGetMemoryInfo(handle)
+            cur_seed = torch.initial_seed()
+            logging.info(f"Current seed {cur_seed} ..., cfg.max_frames_new: {cfg.max_frames_new} ....")
+            noise = torch.randn([1, 4, cfg.max_frames_new, int(cfg.resolution[1]/cfg.scale), int(cfg.resolution[0]/cfg.scale)])
+            noise = noise.to(gpu)
+            # add a noise prior
+            noise = diffusion.q_sample(random_ref_frame.clone(), getattr(cfg, "noise_prior_value", 939), noise=noise)
+            if hasattr(cfg.Diffusion, "noise_strength"):
+                b, c, f, _, _= noise.shape
+                offset_noise = torch.randn(b, c, f, 1, 1, device=noise.device)
+                noise = noise + cfg.Diffusion.noise_strength * offset_noise
+            # construct model inputs (CFG)
+            full_model_kwargs=[{
+                                        'y': None,
+                                        "local_image": None if len(image_local) == 0 else image_local[:],
+                                        'image': None if len(y_visual) == 0 else y_visual0[:],
+                                        'dwpose': None if len(dwpose_data) == 0 else dwpose_data[:],
+                                        'randomref': None if len(random_ref_frame) == 0 else random_ref_frame[:],
+                                       },
+                                       {
+                                        'y': None,
+                                        "local_image": None,
+                                        'image': None,
+                                        'randomref': None,
+                                        'dwpose': None,
+                                       }]
+            # for visualization
+            full_model_kwargs_vis =[{
+                                        'y': None,
+                                        "local_image": None if len(image_local) == 0 else image_local_clone[:],
+                                        'image': None,
+                                        'dwpose': None if len(dwpose_data_clone) == 0 else dwpose_data_clone[:],
+                                        'randomref': None if len(random_ref_frame) == 0 else random_ref_frame_clone[:, :3],
+                                       },
+                                       {
+                                        'y': None,
+                                        "local_image": None,
+                                        'image': None,
+                                        'randomref': None,
+                                        'dwpose': None,
+                                       }]
+            partial_keys = [
+                    ['image', 'randomref', "dwpose"],
+                ]
+            if hasattr(cfg, "partial_keys") and cfg.partial_keys:
+                partial_keys = cfg.partial_keys
+            for partial_keys_one in partial_keys:
+                model_kwargs_one = prepare_model_kwargs(partial_keys = partial_keys_one,
+                                    full_model_kwargs = full_model_kwargs,
+                                    use_fps_condition = cfg.use_fps_condition)
+                model_kwargs_one_vis = prepare_model_kwargs(partial_keys = partial_keys_one,
+                                    full_model_kwargs = full_model_kwargs_vis,
+                                    use_fps_condition = cfg.use_fps_condition)
+                noise_one = noise
+                if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+                    clip_encoder.cpu() # add this line
+                    autoencoder.cpu() # add this line
+                    torch.cuda.empty_cache() # add this line
+                video_data = diffusion.ddim_sample_loop(
+                    noise=noise_one,
+                    context_size=cfg.context_size,
+                    context_stride=cfg.context_stride,
+                    context_overlap=cfg.context_overlap,
+                    model=model.eval(),
+                    model_kwargs=model_kwargs_one,
+                    guide_scale=cfg.guide_scale,
+                    ddim_timesteps=cfg.ddim_timesteps,
+                    eta=0.0,
+                    context_batch_size=getattr(cfg, "context_batch_size", 1)
+                    )
+                if hasattr(cfg, "CPU_CLIP_VAE") and cfg.CPU_CLIP_VAE:
+                    # if run forward of  autoencoder or clip_encoder second times, load them again
+                    clip_encoder.cuda()
+                    autoencoder.cuda()
+                video_data = 1. / cfg.scale_factor * video_data # [1, 4, h, w]
+                video_data = rearrange(video_data, 'b c f h w -> (b f) c h w')
+                chunk_size = min(cfg.decoder_bs, video_data.shape[0])
+                video_data_list = torch.chunk(video_data, video_data.shape[0]//chunk_size, dim=0)
+                decode_data = []
+                for vd_data in video_data_list:
+                    gen_frames = autoencoder.decode(vd_data)
+                    decode_data.append(gen_frames)
+                video_data = torch.cat(decode_data, dim=0)
+                video_data = rearrange(video_data, '(b f) c h w -> b c f h w', b = cfg.batch_size).float()
+                text_size = cfg.resolution[-1]
+                cap_name = re.sub(r'[^\w\s]', '', ref_image_key.split("/")[-1].split('.')[0]) # .replace(' ', '_')
+                name = f'seed_{cur_seed}'
+                for ii in partial_keys_one:
+                    name = name + "_" + ii
+                file_name = f'rank_{cfg.world_size:02d}_{cfg.rank:02d}_{idx:02d}_{name}_{cap_name}_{cfg.resolution[1]}x{cfg.resolution[0]}.mp4'
+                local_path = os.path.join(cfg.log_dir, f'{file_name}')
+                os.makedirs(os.path.dirname(local_path), exist_ok=True)
+                captions = "human"
+                del model_kwargs_one_vis[0][list(model_kwargs_one_vis[0].keys())[0]]
+                del model_kwargs_one_vis[1][list(model_kwargs_one_vis[1].keys())[0]]
+                save_video_multiple_conditions_not_gif_horizontal_3col(local_path, video_data.cpu(), model_kwargs_one_vis, misc_backups,
+                                                cfg.mean, cfg.std, nrow=1, save_fps=cfg.save_fps)
+                # try:
+                #     save_t2vhigen_video_safe(local_path, video_data.cpu(), captions, cfg.mean, cfg.std, text_size)
+                #     logging.info('Save video to dir %s:' % (local_path))
+                # except Exception as e:
+                #     logging.info(f'Step: save text or video error with {e}')
+    logging.info('Congratulations! The inference is completed!')
+    # synchronize to finish some processes
+    if not cfg.debug:
+        torch.cuda.synchronize()
+        dist.barrier()
+def prepare_model_kwargs(partial_keys, full_model_kwargs, use_fps_condition=False):
+    if use_fps_condition is True:
+        partial_keys.append('fps')
+    partial_model_kwargs = [{}, {}]
+    for partial_key in partial_keys:
+        partial_model_kwargs[0][partial_key] = full_model_kwargs[0][partial_key]
+        partial_model_kwargs[1][partial_key] = full_model_kwargs[1][partial_key]
+    return partial_model_kwargs

UniAnimate/tools/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from .clip_embedder import FrozenOpenCLIPEmbedder
+from .autoencoder import DiagonalGaussianDistribution, AutoencoderKL
+from .clip_embedder import *
+from .autoencoder import *
+from .unet import *
+from .diffusions import *
+from .embedding_manager import *

UniAnimate/tools/modules/autoencoder.py ADDED Viewed

	@@ -0,0 +1,690 @@

+import torch
+import logging
+import collections
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from utils.registry_class import AUTO_ENCODER,DISTRIBUTION
+def nonlinearity(x):
+    # swish
+    return x*torch.sigmoid(x)
+def Normalize(in_channels, num_groups=32):
+    return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+@torch.no_grad()
+def get_first_stage_encoding(encoder_posterior, scale_factor=0.18215):
+    if isinstance(encoder_posterior, DiagonalGaussianDistribution):
+        z = encoder_posterior.sample()
+    elif isinstance(encoder_posterior, torch.Tensor):
+        z = encoder_posterior
+    else:
+        raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
+    return scale_factor * z
+@AUTO_ENCODER.register_class()
+class AutoencoderKL(nn.Module):
+    def __init__(self,
+                 ddconfig,
+                 embed_dim,
+                 pretrained=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 colorize_nlabels=None,
+                 monitor=None,
+                 ema_decay=None,
+                 learn_logvar=False,
+                 use_vid_decoder=False,
+                 **kwargs):
+        super().__init__()
+        self.learn_logvar = learn_logvar
+        self.image_key = image_key
+        self.encoder = Encoder(**ddconfig)
+        self.decoder = Decoder(**ddconfig)
+        assert ddconfig["double_z"]
+        self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
+        self.embed_dim = embed_dim
+        if colorize_nlabels is not None:
+            assert type(colorize_nlabels)==int
+            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
+        if monitor is not None:
+            self.monitor = monitor
+        self.use_ema = ema_decay is not None
+        if pretrained is not None:
+            self.init_from_ckpt(pretrained, ignore_keys=ignore_keys)
+    def init_from_ckpt(self, path, ignore_keys=list()):
+        sd = torch.load(path, map_location="cpu")["state_dict"]
+        keys = list(sd.keys())
+        sd_new = collections.OrderedDict()
+        for k in keys:
+            if k.find('first_stage_model') >= 0:
+                k_new = k.split('first_stage_model.')[-1]
+                sd_new[k_new] = sd[k]
+        self.load_state_dict(sd_new, strict=True)
+        logging.info(f"Restored from {path}")
+    def on_train_batch_end(self, *args, **kwargs):
+        if self.use_ema:
+            self.model_ema(self)
+    def encode(self, x):
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def encode_firsr_stage(self, x, scale_factor=1.0):
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        z = get_first_stage_encoding(posterior, scale_factor)
+        return z
+    def encode_ms(self, x):
+        hs = self.encoder(x, True)
+        h = hs[-1]
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        hs[-1] = h
+        return hs
+    def decode(self, z, **kwargs):
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z, **kwargs)
+        return dec
+    def forward(self, input, sample_posterior=True):
+        posterior = self.encode(input)
+        if sample_posterior:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        dec = self.decode(z)
+        return dec, posterior
+    def get_input(self, batch, k):
+        x = batch[k]
+        if len(x.shape) == 3:
+            x = x[..., None]
+        x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float()
+        return x
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    @torch.no_grad()
+    def log_images(self, batch, only_inputs=False, log_ema=False, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.image_key)
+        x = x.to(self.device)
+        if not only_inputs:
+            xrec, posterior = self(x)
+            if x.shape[1] > 3:
+                # colorize with random projection
+                assert xrec.shape[1] > 3
+                x = self.to_rgb(x)
+                xrec = self.to_rgb(xrec)
+            log["samples"] = self.decode(torch.randn_like(posterior.sample()))
+            log["reconstructions"] = xrec
+            if log_ema or self.use_ema:
+                with self.ema_scope():
+                    xrec_ema, posterior_ema = self(x)
+                    if x.shape[1] > 3:
+                        # colorize with random projection
+                        assert xrec_ema.shape[1] > 3
+                        xrec_ema = self.to_rgb(xrec_ema)
+                    log["samples_ema"] = self.decode(torch.randn_like(posterior_ema.sample()))
+                    log["reconstructions_ema"] = xrec_ema
+        log["inputs"] = x
+        return log
+    def to_rgb(self, x):
+        assert self.image_key == "segmentation"
+        if not hasattr(self, "colorize"):
+            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
+        x = F.conv2d(x, weight=self.colorize)
+        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
+        return x
+@AUTO_ENCODER.register_class()
+class AutoencoderVideo(AutoencoderKL):
+    def __init__(self,
+                 ddconfig,
+                 embed_dim,
+                 pretrained=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 colorize_nlabels=None,
+                 monitor=None,
+                 ema_decay=None,
+                 use_vid_decoder=True,
+                 learn_logvar=False,
+                 **kwargs):
+        use_vid_decoder = True
+        super().__init__(ddconfig, embed_dim, pretrained, ignore_keys, image_key, colorize_nlabels, monitor, ema_decay, learn_logvar, use_vid_decoder, **kwargs)
+    def decode(self, z, **kwargs):
+        # z = self.post_quant_conv(z)
+        dec = self.decoder(z, **kwargs)
+        return dec
+    def encode(self, x):
+        h = self.encoder(x)
+        # moments = self.quant_conv(h)
+        moments = h
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+class IdentityFirstStage(torch.nn.Module):
+    def __init__(self, *args, vq_interface=False, **kwargs):
+        self.vq_interface = vq_interface
+        super().__init__()
+    def encode(self, x, *args, **kwargs):
+        return x
+    def decode(self, x, *args, **kwargs):
+        return x
+    def quantize(self, x, *args, **kwargs):
+        if self.vq_interface:
+            return x, None, [None, None, None]
+        return x
+    def forward(self, x, *args, **kwargs):
+        return x
+@DISTRIBUTION.register_class()
+class DiagonalGaussianDistribution(object):
+    def __init__(self, parameters, deterministic=False):
+        self.parameters = parameters
+        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
+        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
+        self.deterministic = deterministic
+        self.std = torch.exp(0.5 * self.logvar)
+        self.var = torch.exp(self.logvar)
+        if self.deterministic:
+            self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device)
+    def sample(self):
+        x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device)
+        return x
+    def kl(self, other=None):
+        if self.deterministic:
+            return torch.Tensor([0.])
+        else:
+            if other is None:
+                return 0.5 * torch.sum(torch.pow(self.mean, 2)
+                                       + self.var - 1.0 - self.logvar,
+                                       dim=[1, 2, 3])
+            else:
+                return 0.5 * torch.sum(
+                    torch.pow(self.mean - other.mean, 2) / other.var
+                    + self.var / other.var - 1.0 - self.logvar + other.logvar,
+                    dim=[1, 2, 3])
+    def nll(self, sample, dims=[1,2,3]):
+        if self.deterministic:
+            return torch.Tensor([0.])
+        logtwopi = np.log(2.0 * np.pi)
+        return 0.5 * torch.sum(
+            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
+            dim=dims)
+    def mode(self):
+        return self.mean
+# -------------------------------modules--------------------------------
+class Downsample(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            # no asymmetric padding in torch conv, must do it ourselves
+            self.conv = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=0)
+    def forward(self, x):
+        if self.with_conv:
+            pad = (0,1,0,1)
+            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+            x = self.conv(x)
+        else:
+            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
+        return x
+class ResnetBlock(nn.Module):
+    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
+                 dropout, temb_channels=512):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.use_conv_shortcut = conv_shortcut
+        self.norm1 = Normalize(in_channels)
+        self.conv1 = torch.nn.Conv2d(in_channels,
+                                     out_channels,
+                                     kernel_size=3,
+                                     stride=1,
+                                     padding=1)
+        if temb_channels > 0:
+            self.temb_proj = torch.nn.Linear(temb_channels,
+                                             out_channels)
+        self.norm2 = Normalize(out_channels)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = torch.nn.Conv2d(out_channels,
+                                     out_channels,
+                                     kernel_size=3,
+                                     stride=1,
+                                     padding=1)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = torch.nn.Conv2d(in_channels,
+                                                     out_channels,
+                                                     kernel_size=3,
+                                                     stride=1,
+                                                     padding=1)
+            else:
+                self.nin_shortcut = torch.nn.Conv2d(in_channels,
+                                                    out_channels,
+                                                    kernel_size=1,
+                                                    stride=1,
+                                                    padding=0)
+    def forward(self, x, temb):
+        h = x
+        h = self.norm1(h)
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        if temb is not None:
+            h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]
+        h = self.norm2(h)
+        h = nonlinearity(h)
+        h = self.dropout(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        return x+h
+class AttnBlock(nn.Module):
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+        self.norm = Normalize(in_channels)
+        self.q = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.k = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.v = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.proj_out = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=1,
+                                        stride=1,
+                                        padding=0)
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+        # compute attention
+        b,c,h,w = q.shape
+        q = q.reshape(b,c,h*w)
+        q = q.permute(0,2,1)   # b,hw,c
+        k = k.reshape(b,c,h*w) # b,c,hw
+        w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        w_ = w_ * (int(c)**(-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+        # attend to values
+        v = v.reshape(b,c,h*w)
+        w_ = w_.permute(0,2,1)   # b,hw,hw (first hw of k, second of q)
+        h_ = torch.bmm(v,w_)     # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = h_.reshape(b,c,h,w)
+        h_ = self.proj_out(h_)
+        return x+h_
+class AttnBlock(nn.Module):
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+        self.norm = Normalize(in_channels)
+        self.q = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.k = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.v = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.proj_out = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=1,
+                                        stride=1,
+                                        padding=0)
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+        # compute attention
+        b,c,h,w = q.shape
+        q = q.reshape(b,c,h*w)
+        q = q.permute(0,2,1)   # b,hw,c
+        k = k.reshape(b,c,h*w) # b,c,hw
+        w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        w_ = w_ * (int(c)**(-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+        # attend to values
+        v = v.reshape(b,c,h*w)
+        w_ = w_.permute(0,2,1)   # b,hw,hw (first hw of k, second of q)
+        h_ = torch.bmm(v,w_)     # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = h_.reshape(b,c,h,w)
+        h_ = self.proj_out(h_)
+        return x+h_
+class Upsample(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            self.conv = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, x):
+        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
+        if self.with_conv:
+            x = self.conv(x)
+        return x
+class Downsample(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            # no asymmetric padding in torch conv, must do it ourselves
+            self.conv = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=0)
+    def forward(self, x):
+        if self.with_conv:
+            pad = (0,1,0,1)
+            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+            x = self.conv(x)
+        else:
+            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
+        return x
+class Encoder(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
+                 resolution, z_channels, double_z=True, use_linear_attn=False, attn_type="vanilla",
+                 **ignore_kwargs):
+        super().__init__()
+        if use_linear_attn: attn_type = "linear"
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_resolutions = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.resolution = resolution
+        self.in_channels = in_channels
+        # downsampling
+        self.conv_in = torch.nn.Conv2d(in_channels,
+                                       self.ch,
+                                       kernel_size=3,
+                                       stride=1,
+                                       padding=1)
+        curr_res = resolution
+        in_ch_mult = (1,)+tuple(ch_mult)
+        self.in_ch_mult = in_ch_mult
+        self.down = nn.ModuleList()
+        for i_level in range(self.num_resolutions):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = ch*in_ch_mult[i_level]
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks):
+                block.append(ResnetBlock(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout))
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(AttnBlock(block_in))
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if i_level != self.num_resolutions-1:
+                down.downsample = Downsample(block_in, resamp_with_conv)
+                curr_res = curr_res // 2
+            self.down.append(down)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        self.mid.attn_1 = AttnBlock(block_in)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        # end
+        self.norm_out = Normalize(block_in)
+        self.conv_out = torch.nn.Conv2d(block_in,
+                                        2*z_channels if double_z else z_channels,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, x, return_feat=False):
+        # timestep embedding
+        temb = None
+        # downsampling
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_resolutions):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1], temb)
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                hs.append(h)
+            if i_level != self.num_resolutions-1:
+                hs.append(self.down[i_level].downsample(hs[-1]))
+        # middle
+        h = hs[-1]
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # end
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        if return_feat:
+            hs[-1] = h
+            return hs
+        else:
+            return h
+class Decoder(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
+                 resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
+                 attn_type="vanilla", **ignorekwargs):
+        super().__init__()
+        if use_linear_attn: attn_type = "linear"
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_resolutions = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.resolution = resolution
+        self.in_channels = in_channels
+        self.give_pre_end = give_pre_end
+        self.tanh_out = tanh_out
+        # compute in_ch_mult, block_in and curr_res at lowest res
+        in_ch_mult = (1,)+tuple(ch_mult)
+        block_in = ch*ch_mult[self.num_resolutions-1]
+        curr_res = resolution // 2**(self.num_resolutions-1)
+        self.z_shape = (1,z_channels, curr_res, curr_res)
+        # logging.info("Working with z of shape {} = {} dimensions.".format(self.z_shape, np.prod(self.z_shape)))
+        # z to block_in
+        self.conv_in = torch.nn.Conv2d(z_channels,
+                                       block_in,
+                                       kernel_size=3,
+                                       stride=1,
+                                       padding=1)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        self.mid.attn_1 = AttnBlock(block_in)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        # upsampling
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_resolutions)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks+1):
+                block.append(ResnetBlock(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout))
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(AttnBlock(block_in))
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if i_level != 0:
+                up.upsample = Upsample(block_in, resamp_with_conv)
+                curr_res = curr_res * 2
+            self.up.insert(0, up) # prepend to get consistent order
+        # end
+        self.norm_out = Normalize(block_in)
+        self.conv_out = torch.nn.Conv2d(block_in,
+                                        out_ch,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, z, **kwargs):
+        #assert z.shape[1:] == self.z_shape[1:]
+        self.last_z_shape = z.shape
+        # timestep embedding
+        temb = None
+        # z to block_in
+        h = self.conv_in(z)
+        # middle
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # upsampling
+        for i_level in reversed(range(self.num_resolutions)):
+            for i_block in range(self.num_res_blocks+1):
+                h = self.up[i_level].block[i_block](h, temb)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+            if i_level != 0:
+                h = self.up[i_level].upsample(h)
+        # end
+        if self.give_pre_end:
+            return h
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        if self.tanh_out:
+            h = torch.tanh(h)
+        return h

UniAnimate/tools/modules/clip_embedder.py ADDED Viewed

	@@ -0,0 +1,212 @@

+import os
+import torch
+import logging
+import open_clip
+import numpy as np
+import torch.nn as nn
+import torchvision.transforms as T
+from utils.registry_class import EMBEDDER
+@EMBEDDER.register_class()
+class FrozenOpenCLIPEmbedder(nn.Module):
+    """
+    Uses the OpenCLIP transformer encoder for text
+    """
+    LAYERS = [
+        #"pooled",
+        "last",
+        "penultimate"
+    ]
+    def __init__(self, pretrained, arch="ViT-H-14", device="cuda", max_length=77,
+                 freeze=True, layer="last"):
+        super().__init__()
+        assert layer in self.LAYERS
+        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
+        del model.visual
+        self.model = model
+        self.device = device
+        self.max_length = max_length
+        if freeze:
+            self.freeze()
+        self.layer = layer
+        if self.layer == "last":
+            self.layer_idx = 0
+        elif self.layer == "penultimate":
+            self.layer_idx = 1
+        else:
+            raise NotImplementedError()
+    def freeze(self):
+        self.model = self.model.eval()
+        for param in self.parameters():
+            param.requires_grad = False
+    def forward(self, text):
+        tokens = open_clip.tokenize(text)
+        z = self.encode_with_transformer(tokens.to(self.device))
+        return z
+    def encode_with_transformer(self, text):
+        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
+        x = x + self.model.positional_embedding
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.model.ln_final(x)
+        return x
+    def text_transformer_forward(self, x: torch.Tensor, attn_mask = None):
+        for i, r in enumerate(self.model.transformer.resblocks):
+            if i == len(self.model.transformer.resblocks) - self.layer_idx:
+                break
+            if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting():
+                x = checkpoint(r, x, attn_mask)
+            else:
+                x = r(x, attn_mask=attn_mask)
+        return x
+    def encode(self, text):
+        return self(text)
+@EMBEDDER.register_class()
+class FrozenOpenCLIPVisualEmbedder(nn.Module):
+    """
+    Uses the OpenCLIP transformer encoder for text
+    """
+    LAYERS = [
+        #"pooled",
+        "last",
+        "penultimate"
+    ]
+    def __init__(self, pretrained, vit_resolution=(224, 224), arch="ViT-H-14", device="cuda", max_length=77,
+                 freeze=True, layer="last"):
+        super().__init__()
+        assert layer in self.LAYERS
+        model, _, preprocess = open_clip.create_model_and_transforms(
+                arch, device=torch.device('cpu'), pretrained=pretrained)
+        del model.transformer
+        self.model = model
+        data_white = np.ones((vit_resolution[0], vit_resolution[1], 3), dtype=np.uint8)*255
+        self.white_image = preprocess(T.ToPILImage()(data_white)).unsqueeze(0)
+        self.device = device
+        self.max_length = max_length # 77
+        if freeze:
+            self.freeze()
+        self.layer = layer # 'penultimate'
+        if self.layer == "last":
+            self.layer_idx = 0
+        elif self.layer == "penultimate":
+            self.layer_idx = 1
+        else:
+            raise NotImplementedError()
+    def freeze(self):
+        self.model = self.model.eval()
+        for param in self.parameters():
+            param.requires_grad = False
+    def forward(self, image):
+        # tokens = open_clip.tokenize(text)
+        z = self.model.encode_image(image.to(self.device))
+        return z
+    def encode_with_transformer(self, text):
+        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
+        x = x + self.model.positional_embedding
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.model.ln_final(x)
+        return x
+    def text_transformer_forward(self, x: torch.Tensor, attn_mask = None):
+        for i, r in enumerate(self.model.transformer.resblocks):
+            if i == len(self.model.transformer.resblocks) - self.layer_idx:
+                break
+            if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting():
+                x = checkpoint(r, x, attn_mask)
+            else:
+                x = r(x, attn_mask=attn_mask)
+        return x
+    def encode(self, text):
+        return self(text)
+@EMBEDDER.register_class()
+class FrozenOpenCLIPTextVisualEmbedder(nn.Module):
+    """
+    Uses the OpenCLIP transformer encoder for text
+    """
+    LAYERS = [
+        #"pooled",
+        "last",
+        "penultimate"
+    ]
+    def __init__(self, pretrained, arch="ViT-H-14", device="cuda", max_length=77,
+                 freeze=True, layer="last", **kwargs):
+        super().__init__()
+        assert layer in self.LAYERS
+        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
+        self.model = model
+        self.device = device
+        self.max_length = max_length
+        if freeze:
+            self.freeze()
+        self.layer = layer
+        if self.layer == "last":
+            self.layer_idx = 0
+        elif self.layer == "penultimate":
+            self.layer_idx = 1
+        else:
+            raise NotImplementedError()
+    def freeze(self):
+        self.model = self.model.eval()
+        for param in self.parameters():
+            param.requires_grad = False
+    def forward(self, image=None, text=None):
+        xi = self.model.encode_image(image.to(self.device)) if image is not None else None
+        tokens = open_clip.tokenize(text)
+        xt, x = self.encode_with_transformer(tokens.to(self.device))
+        return xi, xt, x
+    def encode_with_transformer(self, text):
+        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
+        x = x + self.model.positional_embedding
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.model.ln_final(x)
+        xt = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.model.text_projection
+        return xt, x
+    def encode_image(self, image):
+        return self.model.visual(image)
+    def text_transformer_forward(self, x: torch.Tensor, attn_mask = None):
+        for i, r in enumerate(self.model.transformer.resblocks):
+            if i == len(self.model.transformer.resblocks) - self.layer_idx:
+                break
+            if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting():
+                x = checkpoint(r, x, attn_mask)
+            else:
+                x = r(x, attn_mask=attn_mask)
+        return x
+    def encode(self, text):
+        return self(text)

UniAnimate/tools/modules/config.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import torch
+import logging
+import os.path as osp
+from datetime import datetime
+from easydict import EasyDict
+import os
+cfg = EasyDict(__name__='Config: VideoLDM Decoder')
+# -------------------------------distributed training--------------------------
+pmi_world_size = int(os.getenv('WORLD_SIZE', 1))
+gpus_per_machine = torch.cuda.device_count()
+world_size = pmi_world_size * gpus_per_machine
+# -----------------------------------------------------------------------------
+# ---------------------------Dataset Parameter---------------------------------
+cfg.mean = [0.5, 0.5, 0.5]
+cfg.std = [0.5, 0.5, 0.5]
+cfg.max_words = 1000
+cfg.num_workers = 8
+cfg.prefetch_factor = 2
+# PlaceHolder
+cfg.resolution = [448, 256]
+cfg.vit_out_dim = 1024
+cfg.vit_resolution = 336
+cfg.depth_clamp = 10.0
+cfg.misc_size = 384
+cfg.depth_std = 20.0
+cfg.save_fps = 8
+cfg.frame_lens = [32, 32, 32, 1]
+cfg.sample_fps = [4, ]
+cfg.vid_dataset = {
+    'type': 'VideoBaseDataset',
+    'data_list': [],
+    'max_words': cfg.max_words,
+    'resolution': cfg.resolution}
+cfg.img_dataset = {
+    'type': 'ImageBaseDataset',
+    'data_list': ['laion_400m',],
+    'max_words': cfg.max_words,
+    'resolution': cfg.resolution}
+cfg.batch_sizes = {
+    str(1):256,
+    str(4):4,
+    str(8):4,
+    str(16):4}
+# -----------------------------------------------------------------------------
+# ---------------------------Mode Parameters-----------------------------------
+# Diffusion
+cfg.Diffusion = {
+    'type': 'DiffusionDDIM',
+    'schedule': 'cosine', # cosine
+    'schedule_param': {
+        'num_timesteps': 1000,
+        'cosine_s': 0.008,
+        'zero_terminal_snr': True,
+    },
+    'mean_type': 'v',           # [v, eps]
+    'loss_type': 'mse',
+    'var_type': 'fixed_small',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1,
+    'ddim_timesteps': 50
+}
+cfg.ddim_timesteps = 50  # official: 250
+cfg.use_div_loss = False
+# classifier-free guidance
+cfg.p_zero = 0.9
+cfg.guide_scale = 3.0
+# clip vision encoder
+cfg.vit_mean = [0.48145466, 0.4578275, 0.40821073]
+cfg.vit_std = [0.26862954, 0.26130258, 0.27577711]
+# sketch
+cfg.sketch_mean = [0.485, 0.456, 0.406]
+cfg.sketch_std = [0.229, 0.224, 0.225]
+# cfg.misc_size = 256
+cfg.depth_std = 20.0
+cfg.depth_clamp = 10.0
+cfg.hist_sigma = 10.0
+# Model
+cfg.scale_factor = 0.18215
+cfg.use_checkpoint = True
+cfg.use_sharded_ddp = False
+cfg.use_fsdp = False
+cfg.use_fp16 = True
+cfg.temporal_attention = True
+cfg.UNet = {
+    'type': 'UNetSD',
+    'in_dim': 4,
+    'dim': 320,
+    'y_dim': cfg.vit_out_dim,
+    'context_dim': 1024,
+    'out_dim': 8,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'attn_scales': [1 / 1, 1 / 2, 1 / 4],
+    'dropout': 0.1,
+    'temporal_attention': cfg.temporal_attention,
+    'temporal_attn_times': 1,
+    'use_checkpoint': cfg.use_checkpoint,
+    'use_fps_condition': False,
+    'use_sim_mask': False
+}
+# auotoencoder from stabel diffusion
+cfg.guidances = []
+cfg.auto_encoder = {
+    'type': 'AutoencoderKL',
+    'ddconfig': {
+        'double_z': True,
+        'z_channels': 4,
+        'resolution': 256,
+        'in_channels': 3,
+        'out_ch': 3,
+        'ch': 128,
+        'ch_mult': [1, 2, 4, 4],
+        'num_res_blocks': 2,
+        'attn_resolutions': [],
+        'dropout': 0.0,
+        'video_kernel_size': [3, 1, 1]
+    },
+    'embed_dim': 4,
+    'pretrained': 'models/v2-1_512-ema-pruned.ckpt'
+}
+# clip embedder
+cfg.embedder = {
+    'type': 'FrozenOpenCLIPEmbedder',
+    'layer': 'penultimate',
+    'pretrained': 'models/open_clip_pytorch_model.bin'
+}
+# -----------------------------------------------------------------------------
+# ---------------------------Training Settings---------------------------------
+# training and optimizer
+cfg.ema_decay = 0.9999
+cfg.num_steps = 600000
+cfg.lr = 5e-5
+cfg.weight_decay = 0.0
+cfg.betas = (0.9, 0.999)
+cfg.eps = 1.0e-8
+cfg.chunk_size = 16
+cfg.decoder_bs = 8
+cfg.alpha = 0.7
+cfg.save_ckp_interval = 1000
+# scheduler
+cfg.warmup_steps = 10
+cfg.decay_mode = 'cosine'
+# acceleration
+cfg.use_ema = True
+if world_size<2:
+    cfg.use_ema = False
+cfg.load_from = None
+# -----------------------------------------------------------------------------
+# ----------------------------Pretrain Settings---------------------------------
+cfg.Pretrain = {
+    'type': 'pretrain_specific_strategies',
+    'fix_weight': False,
+    'grad_scale': 0.2,
+    'resume_checkpoint': 'models/jiuniu_0267000.pth',
+    'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json',
+}
+# -----------------------------------------------------------------------------
+# -----------------------------Visual-------------------------------------------
+# Visual videos
+cfg.viz_interval = 1000
+cfg.visual_train = {
+    'type': 'VisualTrainTextImageToVideo',
+}
+cfg.visual_inference = {
+    'type': 'VisualGeneratedVideos',
+}
+cfg.inference_list_path = ''
+# logging
+cfg.log_interval = 100
+### Default log_dir
+cfg.log_dir = 'outputs/'
+# -----------------------------------------------------------------------------
+# ---------------------------Others--------------------------------------------
+# seed
+cfg.seed = 8888
+cfg.negative_prompt = 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms'
+# -----------------------------------------------------------------------------

UniAnimate/tools/modules/diffusions/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .diffusion_ddim import *

UniAnimate/tools/modules/diffusions/diffusion_ddim.py ADDED Viewed

	@@ -0,0 +1,1121 @@

+import torch
+import math
+from utils.registry_class import DIFFUSION
+from .schedules import beta_schedule, sigma_schedule
+from .losses import kl_divergence, discretized_gaussian_log_likelihood
+# from .dpm_solver import NoiseScheduleVP, model_wrapper_guided_diffusion, model_wrapper, DPM_Solver
+from typing import Callable, List, Optional
+import numpy as np
+def _i(tensor, t, x):
+    r"""Index tensor using t and format the output according to x.
+    """
+    if tensor.device != x.device:
+        tensor = tensor.to(x.device)
+    shape = (x.size(0), ) + (1, ) * (x.ndim - 1)
+    return tensor[t].view(shape).to(x)
+@DIFFUSION.register_class()
+class DiffusionDDIMSR(object):
+    def __init__(self, reverse_diffusion, forward_diffusion, **kwargs):
+        from .diffusion_gauss import GaussianDiffusion
+        self.reverse_diffusion = GaussianDiffusion(sigmas=sigma_schedule(reverse_diffusion.schedule, **reverse_diffusion.schedule_param),
+                                                   prediction_type=reverse_diffusion.mean_type)
+        self.forward_diffusion = GaussianDiffusion(sigmas=sigma_schedule(forward_diffusion.schedule, **forward_diffusion.schedule_param),
+                                                   prediction_type=forward_diffusion.mean_type)
+@DIFFUSION.register_class()
+class DiffusionDPM(object):
+    def __init__(self, forward_diffusion, **kwargs):
+        from .diffusion_gauss import GaussianDiffusion
+        self.forward_diffusion = GaussianDiffusion(sigmas=sigma_schedule(forward_diffusion.schedule, **forward_diffusion.schedule_param),
+            prediction_type=forward_diffusion.mean_type)
+@DIFFUSION.register_class()
+class DiffusionDDIM(object):
+    def __init__(self,
+                 schedule='linear_sd',
+                 schedule_param={},
+                 mean_type='eps',
+                 var_type='learned_range',
+                 loss_type='mse',
+                 epsilon = 1e-12,
+                 rescale_timesteps=False,
+                 noise_strength=0.0,
+                 **kwargs):
+        assert mean_type in ['x0', 'x_{t-1}', 'eps', 'v']
+        assert var_type in ['learned', 'learned_range', 'fixed_large', 'fixed_small']
+        assert loss_type in ['mse', 'rescaled_mse', 'kl', 'rescaled_kl', 'l1', 'rescaled_l1','charbonnier']
+        betas = beta_schedule(schedule, **schedule_param)
+        assert min(betas) > 0 and max(betas) <= 1
+        if not isinstance(betas, torch.DoubleTensor):
+            betas = torch.tensor(betas, dtype=torch.float64)
+        self.betas = betas
+        self.num_timesteps = len(betas)
+        self.mean_type = mean_type # eps
+        self.var_type = var_type # 'fixed_small'
+        self.loss_type = loss_type # mse
+        self.epsilon = epsilon # 1e-12
+        self.rescale_timesteps = rescale_timesteps # False
+        self.noise_strength = noise_strength # 0.0
+        # alphas
+        alphas = 1 - self.betas
+        self.alphas_cumprod = torch.cumprod(alphas, dim=0)
+        self.alphas_cumprod_prev = torch.cat([alphas.new_ones([1]), self.alphas_cumprod[:-1]])
+        self.alphas_cumprod_next = torch.cat([self.alphas_cumprod[1:], alphas.new_zeros([1])])
+        # q(x_t | x_{t-1})
+        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
+        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
+        self.log_one_minus_alphas_cumprod = torch.log(1.0 - self.alphas_cumprod)
+        self.sqrt_recip_alphas_cumprod = torch.sqrt(1.0 / self.alphas_cumprod)
+        self.sqrt_recipm1_alphas_cumprod = torch.sqrt(1.0 / self.alphas_cumprod - 1)
+        # q(x_{t-1} | x_t, x_0)
+        self.posterior_variance = betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
+        self.posterior_log_variance_clipped = torch.log(self.posterior_variance.clamp(1e-20))
+        self.posterior_mean_coef1 = betas * torch.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
+        self.posterior_mean_coef2 = (1.0 - self.alphas_cumprod_prev) * torch.sqrt(alphas) / (1.0 - self.alphas_cumprod)
+    def sample_loss(self, x0, noise=None):
+        if noise is None:
+            noise = torch.randn_like(x0)
+            if self.noise_strength > 0:
+                b, c, f, _, _= x0.shape
+                offset_noise = torch.randn(b, c, f, 1, 1, device=x0.device)
+                noise = noise + self.noise_strength * offset_noise
+        return noise
+    def q_sample(self, x0, t, noise=None):
+        r"""Sample from q(x_t | x_0).
+        """
+        # noise = torch.randn_like(x0) if noise is None else noise
+        noise = self.sample_loss(x0, noise)
+        return _i(self.sqrt_alphas_cumprod, t, x0) * x0 + \
+               _i(self.sqrt_one_minus_alphas_cumprod, t, x0) * noise
+    def q_mean_variance(self, x0, t):
+        r"""Distribution of q(x_t | x_0).
+        """
+        mu = _i(self.sqrt_alphas_cumprod, t, x0) * x0
+        var = _i(1.0 - self.alphas_cumprod, t, x0)
+        log_var = _i(self.log_one_minus_alphas_cumprod, t, x0)
+        return mu, var, log_var
+    def q_posterior_mean_variance(self, x0, xt, t):
+        r"""Distribution of q(x_{t-1} | x_t, x_0).
+        """
+        mu = _i(self.posterior_mean_coef1, t, xt) * x0 + _i(self.posterior_mean_coef2, t, xt) * xt
+        var = _i(self.posterior_variance, t, xt)
+        log_var = _i(self.posterior_log_variance_clipped, t, xt)
+        return mu, var, log_var
+    @torch.no_grad()
+    def p_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None):
+        r"""Sample from p(x_{t-1} | x_t).
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        # predict distribution of p(x_{t-1} | x_t)
+        mu, var, log_var, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+        # random sample (with optional conditional function)
+        noise = torch.randn_like(xt)
+        mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))  # no noise when t == 0
+        if condition_fn is not None:
+            grad = condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+            mu = mu.float() + var * grad.float()
+        xt_1 = mu + mask * torch.exp(0.5 * log_var) * noise
+        return xt_1, x0
+    @torch.no_grad()
+    def p_sample_loop(self, noise, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None):
+        r"""Sample from p(x_{t-1} | x_t) p(x_{t-2} | x_{t-1}) ... p(x_0 | x_1).
+        """
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process
+        for step in torch.arange(self.num_timesteps).flip(0):
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.p_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale)
+        return xt
+    def p_mean_variance(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None):
+        r"""Distribution of p(x_{t-1} | x_t).
+        """
+        # predict distribution
+        if guide_scale is None:
+            out = model(xt, self._scale_timesteps(t), **model_kwargs)
+        else:
+            # classifier-free guidance
+            # (model_kwargs[0]: conditional kwargs; model_kwargs[1]: non-conditional kwargs)
+            assert isinstance(model_kwargs, list) and len(model_kwargs) == 2
+            y_out = model(xt, self._scale_timesteps(t), **model_kwargs[0])
+            u_out = model(xt, self._scale_timesteps(t), **model_kwargs[1])
+            dim = y_out.size(1) if self.var_type.startswith('fixed') else y_out.size(1) // 2
+            out = torch.cat([
+                u_out[:, :dim] + guide_scale * (y_out[:, :dim] - u_out[:, :dim]),
+                y_out[:, dim:]], dim=1) # guide_scale=9.0
+        # compute variance
+        if self.var_type == 'learned':
+            out, log_var = out.chunk(2, dim=1)
+            var = torch.exp(log_var)
+        elif self.var_type == 'learned_range':
+            out, fraction = out.chunk(2, dim=1)
+            min_log_var = _i(self.posterior_log_variance_clipped, t, xt)
+            max_log_var = _i(torch.log(self.betas), t, xt)
+            fraction = (fraction + 1) / 2.0
+            log_var = fraction * max_log_var + (1 - fraction) * min_log_var
+            var = torch.exp(log_var)
+        elif self.var_type == 'fixed_large':
+            var = _i(torch.cat([self.posterior_variance[1:2], self.betas[1:]]), t, xt)
+            log_var = torch.log(var)
+        elif self.var_type == 'fixed_small':
+            var = _i(self.posterior_variance, t, xt)
+            log_var = _i(self.posterior_log_variance_clipped, t, xt)
+        # compute mean and x0
+        if self.mean_type == 'x_{t-1}':
+            mu = out  # x_{t-1}
+            x0 = _i(1.0 / self.posterior_mean_coef1, t, xt) * mu - \
+                 _i(self.posterior_mean_coef2 / self.posterior_mean_coef1, t, xt) * xt
+        elif self.mean_type == 'x0':
+            x0 = out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        elif self.mean_type == 'eps':
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        elif self.mean_type == 'v':
+            x0 = _i(self.sqrt_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_one_minus_alphas_cumprod, t, xt) * out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        # restrict the range of x0
+        if percentile is not None:
+            assert percentile > 0 and percentile <= 1  # e.g., 0.995
+            s = torch.quantile(x0.flatten(1).abs(), percentile, dim=1).clamp_(1.0).view(-1, 1, 1, 1)
+            x0 = torch.min(s, torch.max(-s, x0)) / s
+        elif clamp is not None:
+            x0 = x0.clamp(-clamp, clamp)
+        return mu, var, log_var, x0
+    @torch.no_grad()
+    def ddim_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, ddim_timesteps=20, eta=0.0):
+        r"""Sample from p(x_{t-1} | x_t) using DDIM.
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        stride = self.num_timesteps // ddim_timesteps
+        # predict distribution of p(x_{t-1} | x_t)
+        _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+        if condition_fn is not None:
+            # x0 -> eps
+            alpha = _i(self.alphas_cumprod, t, xt)
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            eps = eps - (1 - alpha).sqrt() * condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+            # eps -> x0
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+        # derive variables
+        eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+              _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+        alphas = _i(self.alphas_cumprod, t, xt)
+        alphas_prev = _i(self.alphas_cumprod, (t - stride).clamp(0), xt)
+        sigmas = eta * torch.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
+        # random sample
+        noise = torch.randn_like(xt)
+        direction = torch.sqrt(1 - alphas_prev - sigmas ** 2) * eps
+        mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))
+        xt_1 = torch.sqrt(alphas_prev) * x0 + direction + mask * sigmas * noise
+        return xt_1, x0
+    @torch.no_grad()
+    def ddim_sample_loop(self, noise, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, ddim_timesteps=20, eta=0.0):
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process (TODO: clamp is inaccurate! Consider replacing the stride by explicit prev/next steps)
+        steps = (1 + torch.arange(0, self.num_timesteps, self.num_timesteps // ddim_timesteps)).clamp(0, self.num_timesteps - 1).flip(0)
+        from tqdm import tqdm
+        for step in tqdm(steps):
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.ddim_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, ddim_timesteps, eta)
+            # from ipdb import set_trace; set_trace()
+        return xt
+    @torch.no_grad()
+    def ddim_reverse_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None, ddim_timesteps=20):
+        r"""Sample from p(x_{t+1} | x_t) using DDIM reverse ODE (deterministic).
+        """
+        stride = self.num_timesteps // ddim_timesteps
+        # predict distribution of p(x_{t-1} | x_t)
+        _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+        # derive variables
+        eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+              _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+        alphas_next = _i(
+            torch.cat([self.alphas_cumprod, self.alphas_cumprod.new_zeros([1])]),
+            (t + stride).clamp(0, self.num_timesteps), xt)
+        # reverse sample
+        mu = torch.sqrt(alphas_next) * x0 + torch.sqrt(1 - alphas_next) * eps
+        return mu, x0
+    @torch.no_grad()
+    def ddim_reverse_sample_loop(self, x0, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None, ddim_timesteps=20):
+        # prepare input
+        b = x0.size(0)
+        xt = x0
+        # reconstruction steps
+        steps = torch.arange(0, self.num_timesteps, self.num_timesteps // ddim_timesteps)
+        for step in steps:
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.ddim_reverse_sample(xt, t, model, model_kwargs, clamp, percentile, guide_scale, ddim_timesteps)
+        return xt
+    @torch.no_grad()
+    def plms_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, plms_timesteps=20):
+        r"""Sample from p(x_{t-1} | x_t) using PLMS.
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        stride = self.num_timesteps // plms_timesteps
+        # function for compute eps
+        def compute_eps(xt, t):
+            # predict distribution of p(x_{t-1} | x_t)
+            _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+            # condition
+            if condition_fn is not None:
+                # x0 -> eps
+                alpha = _i(self.alphas_cumprod, t, xt)
+                eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                      _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+                eps = eps - (1 - alpha).sqrt() * condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+                # eps -> x0
+                x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                     _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+            # derive eps
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            return eps
+        # function for compute x_0 and x_{t-1}
+        def compute_x0(eps, t):
+            # eps -> x0
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+            # deterministic sample
+            alphas_prev = _i(self.alphas_cumprod, (t - stride).clamp(0), xt)
+            direction = torch.sqrt(1 - alphas_prev) * eps
+            mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))
+            xt_1 = torch.sqrt(alphas_prev) * x0 + direction
+            return xt_1, x0
+        # PLMS sample
+        eps = compute_eps(xt, t)
+        if len(eps_cache) == 0:
+            # 2nd order pseudo improved Euler
+            xt_1, x0 = compute_x0(eps, t)
+            eps_next = compute_eps(xt_1, (t - stride).clamp(0))
+            eps_prime = (eps + eps_next) / 2.0
+        elif len(eps_cache) == 1:
+            # 2nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (3 * eps - eps_cache[-1]) / 2.0
+        elif len(eps_cache) == 2:
+            # 3nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (23 * eps - 16 * eps_cache[-1] + 5 * eps_cache[-2]) / 12.0
+        elif len(eps_cache) >= 3:
+            # 4nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (55 * eps - 59 * eps_cache[-1] + 37 * eps_cache[-2] - 9 * eps_cache[-3]) / 24.0
+        xt_1, x0 = compute_x0(eps_prime, t)
+        return xt_1, x0, eps
+    @torch.no_grad()
+    def plms_sample_loop(self, noise, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, plms_timesteps=20):
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process
+        steps = (1 + torch.arange(0, self.num_timesteps, self.num_timesteps // plms_timesteps)).clamp(0, self.num_timesteps - 1).flip(0)
+        eps_cache = []
+        for step in steps:
+            # PLMS sampling step
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _, eps = self.plms_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, plms_timesteps, eps_cache)
+            # update eps cache
+            eps_cache.append(eps)
+            if len(eps_cache) >= 4:
+                eps_cache.pop(0)
+        return xt
+    def loss(self, x0, t, model, model_kwargs={}, noise=None, weight = None, use_div_loss= False, loss_mask=None):
+        # noise = torch.randn_like(x0) if noise is None else noise # [80, 4, 8, 32, 32]
+        noise = self.sample_loss(x0, noise)
+        xt = self.q_sample(x0, t, noise=noise)
+        # compute loss
+        if self.loss_type in ['kl', 'rescaled_kl']:
+            loss, _ = self.variational_lower_bound(x0, xt, t, model, model_kwargs)
+            if self.loss_type == 'rescaled_kl':
+                loss = loss * self.num_timesteps
+        elif self.loss_type in ['mse', 'rescaled_mse', 'l1', 'rescaled_l1']: # self.loss_type: mse
+            out = model(xt, self._scale_timesteps(t), **model_kwargs)
+            # VLB for variation
+            loss_vlb = 0.0
+            if self.var_type in ['learned', 'learned_range']: # self.var_type: 'fixed_small'
+                out, var = out.chunk(2, dim=1)
+                frozen = torch.cat([out.detach(), var], dim=1)  # learn var without affecting the prediction of mean
+                loss_vlb, _ = self.variational_lower_bound(x0, xt, t, model=lambda *args, **kwargs: frozen)
+                if self.loss_type.startswith('rescaled_'):
+                    loss_vlb = loss_vlb * self.num_timesteps / 1000.0
+            # MSE/L1 for x0/eps
+            # target = {'eps': noise, 'x0': x0, 'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0]}[self.mean_type]
+            target = {
+                'eps': noise,
+                'x0': x0,
+                'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0],
+                'v':_i(self.sqrt_alphas_cumprod, t, xt) * noise - _i(self.sqrt_one_minus_alphas_cumprod, t, xt) * x0}[self.mean_type]
+            if loss_mask is not None:
+                loss_mask = loss_mask[:, :, 0, ...].unsqueeze(2)  # just use one channel (all channels are same)
+                loss_mask = loss_mask.permute(0, 2, 1, 3, 4)  # b,c,f,h,w
+                # use masked diffusion
+                loss = (out * loss_mask - target * loss_mask).pow(1 if self.loss_type.endswith('l1') else 2).abs().flatten(1).mean(dim=1)
+            else:
+                loss = (out - target).pow(1 if self.loss_type.endswith('l1') else 2).abs().flatten(1).mean(dim=1)
+            if weight is not None:
+                loss = loss*weight
+            # div loss
+            if use_div_loss and self.mean_type == 'eps' and x0.shape[2]>1:
+                # derive  x0
+                x0_ = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                    _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * out
+                # # derive xt_1, set eta=0 as ddim
+                # alphas_prev = _i(self.alphas_cumprod, (t - 1).clamp(0), xt)
+                # direction = torch.sqrt(1 - alphas_prev) * out
+                # xt_1 = torch.sqrt(alphas_prev) * x0_ + direction
+                # ncfhw, std on f
+                div_loss = 0.001/(x0_.std(dim=2).flatten(1).mean(dim=1)+1e-4)
+                # print(div_loss,loss)
+                loss = loss+div_loss
+            # total loss
+            loss = loss + loss_vlb
+        elif self.loss_type in ['charbonnier']:
+            out = model(xt, self._scale_timesteps(t), **model_kwargs)
+            # VLB for variation
+            loss_vlb = 0.0
+            if self.var_type in ['learned', 'learned_range']:
+                out, var = out.chunk(2, dim=1)
+                frozen = torch.cat([out.detach(), var], dim=1)  # learn var without affecting the prediction of mean
+                loss_vlb, _ = self.variational_lower_bound(x0, xt, t, model=lambda *args, **kwargs: frozen)
+                if self.loss_type.startswith('rescaled_'):
+                    loss_vlb = loss_vlb * self.num_timesteps / 1000.0
+            # MSE/L1 for x0/eps
+            target = {'eps': noise, 'x0': x0, 'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0]}[self.mean_type]
+            loss = torch.sqrt((out - target)**2 + self.epsilon)
+            if weight is not None:
+                loss = loss*weight
+            loss = loss.flatten(1).mean(dim=1)
+            # total loss
+            loss = loss + loss_vlb
+        return loss
+    def variational_lower_bound(self, x0, xt, t, model, model_kwargs={}, clamp=None, percentile=None):
+        # compute groundtruth and predicted distributions
+        mu1, _, log_var1 = self.q_posterior_mean_variance(x0, xt, t)
+        mu2, _, log_var2, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile)
+        # compute KL loss
+        kl = kl_divergence(mu1, log_var1, mu2, log_var2)
+        kl = kl.flatten(1).mean(dim=1) / math.log(2.0)
+        # compute discretized NLL loss (for p(x0 | x1) only)
+        nll = -discretized_gaussian_log_likelihood(x0, mean=mu2, log_scale=0.5 * log_var2)
+        nll = nll.flatten(1).mean(dim=1) / math.log(2.0)
+        # NLL for p(x0 | x1) and KL otherwise
+        vlb = torch.where(t == 0, nll, kl)
+        return vlb, x0
+    @torch.no_grad()
+    def variational_lower_bound_loop(self, x0, model, model_kwargs={}, clamp=None, percentile=None):
+        r"""Compute the entire variational lower bound, measured in bits-per-dim.
+        """
+        # prepare input and output
+        b = x0.size(0)
+        metrics = {'vlb': [], 'mse': [], 'x0_mse': []}
+        # loop
+        for step in torch.arange(self.num_timesteps).flip(0):
+            # compute VLB
+            t = torch.full((b, ), step, dtype=torch.long, device=x0.device)
+            # noise = torch.randn_like(x0)
+            noise = self.sample_loss(x0)
+            xt = self.q_sample(x0, t, noise)
+            vlb, pred_x0 = self.variational_lower_bound(x0, xt, t, model, model_kwargs, clamp, percentile)
+            # predict eps from x0
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            # collect metrics
+            metrics['vlb'].append(vlb)
+            metrics['x0_mse'].append((pred_x0 - x0).square().flatten(1).mean(dim=1))
+            metrics['mse'].append((eps - noise).square().flatten(1).mean(dim=1))
+        metrics = {k: torch.stack(v, dim=1) for k, v in metrics.items()}
+        # compute the prior KL term for VLB, measured in bits-per-dim
+        mu, _, log_var = self.q_mean_variance(x0, t)
+        kl_prior = kl_divergence(mu, log_var, torch.zeros_like(mu), torch.zeros_like(log_var))
+        kl_prior = kl_prior.flatten(1).mean(dim=1) / math.log(2.0)
+        # update metrics
+        metrics['prior_bits_per_dim'] = kl_prior
+        metrics['total_bits_per_dim'] = metrics['vlb'].sum(dim=1) + kl_prior
+        return metrics
+    def _scale_timesteps(self, t):
+        if self.rescale_timesteps:
+            return t.float() * 1000.0 / self.num_timesteps
+        return t
+        #return t.float()
+@DIFFUSION.register_class()
+class DiffusionDDIMLong(object):
+    def __init__(self,
+                 schedule='linear_sd',
+                 schedule_param={},
+                 mean_type='eps',
+                 var_type='learned_range',
+                 loss_type='mse',
+                 epsilon = 1e-12,
+                 rescale_timesteps=False,
+                 noise_strength=0.0,
+                 **kwargs):
+        assert mean_type in ['x0', 'x_{t-1}', 'eps', 'v']
+        assert var_type in ['learned', 'learned_range', 'fixed_large', 'fixed_small']
+        assert loss_type in ['mse', 'rescaled_mse', 'kl', 'rescaled_kl', 'l1', 'rescaled_l1','charbonnier']
+        betas = beta_schedule(schedule, **schedule_param)
+        assert min(betas) > 0 and max(betas) <= 1
+        if not isinstance(betas, torch.DoubleTensor):
+            betas = torch.tensor(betas, dtype=torch.float64)
+        self.betas = betas
+        self.num_timesteps = len(betas)
+        self.mean_type = mean_type # v
+        self.var_type = var_type # 'fixed_small'
+        self.loss_type = loss_type # mse
+        self.epsilon = epsilon # 1e-12
+        self.rescale_timesteps = rescale_timesteps # False
+        self.noise_strength = noise_strength
+        # alphas
+        alphas = 1 - self.betas
+        self.alphas_cumprod = torch.cumprod(alphas, dim=0)
+        self.alphas_cumprod_prev = torch.cat([alphas.new_ones([1]), self.alphas_cumprod[:-1]])
+        self.alphas_cumprod_next = torch.cat([self.alphas_cumprod[1:], alphas.new_zeros([1])])
+        # q(x_t | x_{t-1})
+        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
+        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
+        self.log_one_minus_alphas_cumprod = torch.log(1.0 - self.alphas_cumprod)
+        self.sqrt_recip_alphas_cumprod = torch.sqrt(1.0 / self.alphas_cumprod)
+        self.sqrt_recipm1_alphas_cumprod = torch.sqrt(1.0 / self.alphas_cumprod - 1)
+        # q(x_{t-1} | x_t, x_0)
+        self.posterior_variance = betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
+        self.posterior_log_variance_clipped = torch.log(self.posterior_variance.clamp(1e-20))
+        self.posterior_mean_coef1 = betas * torch.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
+        self.posterior_mean_coef2 = (1.0 - self.alphas_cumprod_prev) * torch.sqrt(alphas) / (1.0 - self.alphas_cumprod)
+    def sample_loss(self, x0, noise=None):
+        if noise is None:
+            noise = torch.randn_like(x0)
+            if self.noise_strength > 0:
+                b, c, f, _, _= x0.shape
+                offset_noise = torch.randn(b, c, f, 1, 1, device=x0.device)
+                noise = noise + self.noise_strength * offset_noise
+        return noise
+    def q_sample(self, x0, t, noise=None):
+        r"""Sample from q(x_t | x_0).
+        """
+        # noise = torch.randn_like(x0) if noise is None else noise
+        noise = self.sample_loss(x0, noise)
+        return _i(self.sqrt_alphas_cumprod, t, x0) * x0 + \
+               _i(self.sqrt_one_minus_alphas_cumprod, t, x0) * noise
+    def q_mean_variance(self, x0, t):
+        r"""Distribution of q(x_t | x_0).
+        """
+        mu = _i(self.sqrt_alphas_cumprod, t, x0) * x0
+        var = _i(1.0 - self.alphas_cumprod, t, x0)
+        log_var = _i(self.log_one_minus_alphas_cumprod, t, x0)
+        return mu, var, log_var
+    def q_posterior_mean_variance(self, x0, xt, t):
+        r"""Distribution of q(x_{t-1} | x_t, x_0).
+        """
+        mu = _i(self.posterior_mean_coef1, t, xt) * x0 + _i(self.posterior_mean_coef2, t, xt) * xt
+        var = _i(self.posterior_variance, t, xt)
+        log_var = _i(self.posterior_log_variance_clipped, t, xt)
+        return mu, var, log_var
+    @torch.no_grad()
+    def p_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None):
+        r"""Sample from p(x_{t-1} | x_t).
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        # predict distribution of p(x_{t-1} | x_t)
+        mu, var, log_var, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+        # random sample (with optional conditional function)
+        noise = torch.randn_like(xt)
+        mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))  # no noise when t == 0
+        if condition_fn is not None:
+            grad = condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+            mu = mu.float() + var * grad.float()
+        xt_1 = mu + mask * torch.exp(0.5 * log_var) * noise
+        return xt_1, x0
+    @torch.no_grad()
+    def p_sample_loop(self, noise, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None):
+        r"""Sample from p(x_{t-1} | x_t) p(x_{t-2} | x_{t-1}) ... p(x_0 | x_1).
+        """
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process
+        for step in torch.arange(self.num_timesteps).flip(0):
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.p_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale)
+        return xt
+    def p_mean_variance(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None, context_size=32, context_stride=1, context_overlap=4, context_batch_size=1):
+        r"""Distribution of p(x_{t-1} | x_t).
+        """
+        noise = xt
+        context_queue = list(
+                    context_scheduler(
+                        0,
+                        31,
+                        noise.shape[2],
+                        context_size=context_size,
+                        context_stride=context_stride,
+                        context_overlap=context_overlap,
+                    )
+                )
+        context_step = min(
+                    context_stride, int(np.ceil(np.log2(noise.shape[2] / context_size))) + 1
+                )
+        # replace the final segment to improve temporal consistency
+        num_frames = noise.shape[2]
+        context_queue[-1] = [
+                e % num_frames
+                for e in range(num_frames - context_size * context_step, num_frames, context_step)
+            ]
+        import math
+        # context_batch_size = 1
+        num_context_batches = math.ceil(len(context_queue) / context_batch_size)
+        global_context = []
+        for i in range(num_context_batches):
+            global_context.append(
+                context_queue[
+                    i * context_batch_size : (i + 1) * context_batch_size
+                ]
+            )
+        noise_pred = torch.zeros_like(noise)
+        noise_pred_uncond = torch.zeros_like(noise)
+        counter = torch.zeros(
+                    (1, 1, xt.shape[2], 1, 1),
+                    device=xt.device,
+                    dtype=xt.dtype,
+                )
+        for i_index, context in enumerate(global_context):
+            latent_model_input = torch.cat([xt[:, :, c] for c in context])
+            bs_context = len(context)
+            model_kwargs_new = [{
+                                    'y': None,
+                                    "local_image": None if not model_kwargs[0].__contains__('local_image') else torch.cat([model_kwargs[0]["local_image"][:, :, c] for c in context]),
+                                    'image':  None if not model_kwargs[0].__contains__('image') else model_kwargs[0]["image"].repeat(bs_context, 1, 1),
+                                    'dwpose':  None if not model_kwargs[0].__contains__('dwpose') else torch.cat([model_kwargs[0]["dwpose"][:, :, [0]+[ii+1 for ii in c]] for c in context]),
+                                    'randomref':  None if not model_kwargs[0].__contains__('randomref') else torch.cat([model_kwargs[0]["randomref"][:, :, c] for c in context]),
+                                    },
+                                    {
+                                    'y': None,
+                                    "local_image": None,
+                                    'image': None,
+                                    'randomref': None,
+                                    'dwpose': None,
+                                    }]
+            if guide_scale is None:
+                out = model(latent_model_input, self._scale_timesteps(t), **model_kwargs)
+                for j, c in enumerate(context):
+                    noise_pred[:, :, c] = noise_pred[:, :, c] + out
+                    counter[:, :, c] = counter[:, :, c] + 1
+            else:
+                # classifier-free guidance
+                # (model_kwargs[0]: conditional kwargs; model_kwargs[1]: non-conditional kwargs)
+                # assert isinstance(model_kwargs, list) and len(model_kwargs) == 2
+                y_out = model(latent_model_input, self._scale_timesteps(t).repeat(bs_context), **model_kwargs_new[0])
+                u_out = model(latent_model_input, self._scale_timesteps(t).repeat(bs_context), **model_kwargs_new[1])
+                dim = y_out.size(1) if self.var_type.startswith('fixed') else y_out.size(1) // 2
+                for j, c in enumerate(context):
+                    noise_pred[:, :, c] = noise_pred[:, :, c] + y_out[j:j+1]
+                    noise_pred_uncond[:, :, c] = noise_pred_uncond[:, :, c] + u_out[j:j+1]
+                    counter[:, :, c] = counter[:, :, c] + 1
+        noise_pred = noise_pred / counter
+        noise_pred_uncond = noise_pred_uncond / counter
+        out = torch.cat([
+                    noise_pred_uncond[:, :dim] + guide_scale * (noise_pred[:, :dim] - noise_pred_uncond[:, :dim]),
+                    noise_pred[:, dim:]], dim=1) # guide_scale=2.5
+        # compute variance
+        if self.var_type == 'learned':
+            out, log_var = out.chunk(2, dim=1)
+            var = torch.exp(log_var)
+        elif self.var_type == 'learned_range':
+            out, fraction = out.chunk(2, dim=1)
+            min_log_var = _i(self.posterior_log_variance_clipped, t, xt)
+            max_log_var = _i(torch.log(self.betas), t, xt)
+            fraction = (fraction + 1) / 2.0
+            log_var = fraction * max_log_var + (1 - fraction) * min_log_var
+            var = torch.exp(log_var)
+        elif self.var_type == 'fixed_large':
+            var = _i(torch.cat([self.posterior_variance[1:2], self.betas[1:]]), t, xt)
+            log_var = torch.log(var)
+        elif self.var_type == 'fixed_small':
+            var = _i(self.posterior_variance, t, xt)
+            log_var = _i(self.posterior_log_variance_clipped, t, xt)
+        # compute mean and x0
+        if self.mean_type == 'x_{t-1}':
+            mu = out  # x_{t-1}
+            x0 = _i(1.0 / self.posterior_mean_coef1, t, xt) * mu - \
+                 _i(self.posterior_mean_coef2 / self.posterior_mean_coef1, t, xt) * xt
+        elif self.mean_type == 'x0':
+            x0 = out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        elif self.mean_type == 'eps':
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        elif self.mean_type == 'v':
+            x0 = _i(self.sqrt_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_one_minus_alphas_cumprod, t, xt) * out
+            mu, _, _ = self.q_posterior_mean_variance(x0, xt, t)
+        # restrict the range of x0
+        if percentile is not None:
+            assert percentile > 0 and percentile <= 1  # e.g., 0.995
+            s = torch.quantile(x0.flatten(1).abs(), percentile, dim=1).clamp_(1.0).view(-1, 1, 1, 1)
+            x0 = torch.min(s, torch.max(-s, x0)) / s
+        elif clamp is not None:
+            x0 = x0.clamp(-clamp, clamp)
+        return mu, var, log_var, x0
+    @torch.no_grad()
+    def ddim_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, ddim_timesteps=20, eta=0.0, context_size=32, context_stride=1, context_overlap=4, context_batch_size=1):
+        r"""Sample from p(x_{t-1} | x_t) using DDIM.
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        stride = self.num_timesteps // ddim_timesteps
+        # predict distribution of p(x_{t-1} | x_t)
+        _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale, context_size, context_stride, context_overlap, context_batch_size)
+        if condition_fn is not None:
+            # x0 -> eps
+            alpha = _i(self.alphas_cumprod, t, xt)
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            eps = eps - (1 - alpha).sqrt() * condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+            # eps -> x0
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+        # derive variables
+        eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+              _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+        alphas = _i(self.alphas_cumprod, t, xt)
+        alphas_prev = _i(self.alphas_cumprod, (t - stride).clamp(0), xt)
+        sigmas = eta * torch.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
+        # random sample
+        noise = torch.randn_like(xt)
+        direction = torch.sqrt(1 - alphas_prev - sigmas ** 2) * eps
+        mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))
+        xt_1 = torch.sqrt(alphas_prev) * x0 + direction + mask * sigmas * noise
+        return xt_1, x0
+    @torch.no_grad()
+    def ddim_sample_loop(self, noise, context_size, context_stride, context_overlap, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, ddim_timesteps=20, eta=0.0, context_batch_size=1):
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process (TODO: clamp is inaccurate! Consider replacing the stride by explicit prev/next steps)
+        steps = (1 + torch.arange(0, self.num_timesteps, self.num_timesteps // ddim_timesteps)).clamp(0, self.num_timesteps - 1).flip(0)
+        from tqdm import tqdm
+        for step in tqdm(steps):
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.ddim_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, ddim_timesteps, eta, context_size=context_size, context_stride=context_stride, context_overlap=context_overlap, context_batch_size=context_batch_size)
+        return xt
+    @torch.no_grad()
+    def ddim_reverse_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None, ddim_timesteps=20):
+        r"""Sample from p(x_{t+1} | x_t) using DDIM reverse ODE (deterministic).
+        """
+        stride = self.num_timesteps // ddim_timesteps
+        # predict distribution of p(x_{t-1} | x_t)
+        _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+        # derive variables
+        eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+              _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+        alphas_next = _i(
+            torch.cat([self.alphas_cumprod, self.alphas_cumprod.new_zeros([1])]),
+            (t + stride).clamp(0, self.num_timesteps), xt)
+        # reverse sample
+        mu = torch.sqrt(alphas_next) * x0 + torch.sqrt(1 - alphas_next) * eps
+        return mu, x0
+    @torch.no_grad()
+    def ddim_reverse_sample_loop(self, x0, model, model_kwargs={}, clamp=None, percentile=None, guide_scale=None, ddim_timesteps=20):
+        # prepare input
+        b = x0.size(0)
+        xt = x0
+        # reconstruction steps
+        steps = torch.arange(0, self.num_timesteps, self.num_timesteps // ddim_timesteps)
+        for step in steps:
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.ddim_reverse_sample(xt, t, model, model_kwargs, clamp, percentile, guide_scale, ddim_timesteps)
+        return xt
+    @torch.no_grad()
+    def plms_sample(self, xt, t, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, plms_timesteps=20):
+        r"""Sample from p(x_{t-1} | x_t) using PLMS.
+            - condition_fn: for classifier-based guidance (guided-diffusion).
+            - guide_scale: for classifier-free guidance (glide/dalle-2).
+        """
+        stride = self.num_timesteps // plms_timesteps
+        # function for compute eps
+        def compute_eps(xt, t):
+            # predict distribution of p(x_{t-1} | x_t)
+            _, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
+            # condition
+            if condition_fn is not None:
+                # x0 -> eps
+                alpha = _i(self.alphas_cumprod, t, xt)
+                eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                      _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+                eps = eps - (1 - alpha).sqrt() * condition_fn(xt, self._scale_timesteps(t), **model_kwargs)
+                # eps -> x0
+                x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                     _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+            # derive eps
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            return eps
+        # function for compute x_0 and x_{t-1}
+        def compute_x0(eps, t):
+            # eps -> x0
+            x0 = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                 _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * eps
+            # deterministic sample
+            alphas_prev = _i(self.alphas_cumprod, (t - stride).clamp(0), xt)
+            direction = torch.sqrt(1 - alphas_prev) * eps
+            mask = t.ne(0).float().view(-1, *((1, ) * (xt.ndim - 1)))
+            xt_1 = torch.sqrt(alphas_prev) * x0 + direction
+            return xt_1, x0
+        # PLMS sample
+        eps = compute_eps(xt, t)
+        if len(eps_cache) == 0:
+            # 2nd order pseudo improved Euler
+            xt_1, x0 = compute_x0(eps, t)
+            eps_next = compute_eps(xt_1, (t - stride).clamp(0))
+            eps_prime = (eps + eps_next) / 2.0
+        elif len(eps_cache) == 1:
+            # 2nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (3 * eps - eps_cache[-1]) / 2.0
+        elif len(eps_cache) == 2:
+            # 3nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (23 * eps - 16 * eps_cache[-1] + 5 * eps_cache[-2]) / 12.0
+        elif len(eps_cache) >= 3:
+            # 4nd order pseudo linear multistep (Adams-Bashforth)
+            eps_prime = (55 * eps - 59 * eps_cache[-1] + 37 * eps_cache[-2] - 9 * eps_cache[-3]) / 24.0
+        xt_1, x0 = compute_x0(eps_prime, t)
+        return xt_1, x0, eps
+    @torch.no_grad()
+    def plms_sample_loop(self, noise, model, model_kwargs={}, clamp=None, percentile=None, condition_fn=None, guide_scale=None, plms_timesteps=20):
+        # prepare input
+        b = noise.size(0)
+        xt = noise
+        # diffusion process
+        steps = (1 + torch.arange(0, self.num_timesteps, self.num_timesteps // plms_timesteps)).clamp(0, self.num_timesteps - 1).flip(0)
+        eps_cache = []
+        for step in steps:
+            # PLMS sampling step
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _, eps = self.plms_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, plms_timesteps, eps_cache)
+            # update eps cache
+            eps_cache.append(eps)
+            if len(eps_cache) >= 4:
+                eps_cache.pop(0)
+        return xt
+    def loss(self, x0, t, model, model_kwargs={}, noise=None, weight = None, use_div_loss= False, loss_mask=None):
+        # noise = torch.randn_like(x0) if noise is None else noise # [80, 4, 8, 32, 32]
+        noise = self.sample_loss(x0, noise)
+        xt = self.q_sample(x0, t, noise=noise)
+        # compute loss
+        if self.loss_type in ['kl', 'rescaled_kl']:
+            loss, _ = self.variational_lower_bound(x0, xt, t, model, model_kwargs)
+            if self.loss_type == 'rescaled_kl':
+                loss = loss * self.num_timesteps
+        elif self.loss_type in ['mse', 'rescaled_mse', 'l1', 'rescaled_l1']: # self.loss_type: mse
+            out = model(xt, self._scale_timesteps(t), **model_kwargs)
+            # VLB for variation
+            loss_vlb = 0.0
+            if self.var_type in ['learned', 'learned_range']: # self.var_type: 'fixed_small'
+                out, var = out.chunk(2, dim=1)
+                frozen = torch.cat([out.detach(), var], dim=1)  # learn var without affecting the prediction of mean
+                loss_vlb, _ = self.variational_lower_bound(x0, xt, t, model=lambda *args, **kwargs: frozen)
+                if self.loss_type.startswith('rescaled_'):
+                    loss_vlb = loss_vlb * self.num_timesteps / 1000.0
+            # MSE/L1 for x0/eps
+            # target = {'eps': noise, 'x0': x0, 'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0]}[self.mean_type]
+            target = {
+                'eps': noise,
+                'x0': x0,
+                'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0],
+                'v':_i(self.sqrt_alphas_cumprod, t, xt) * noise - _i(self.sqrt_one_minus_alphas_cumprod, t, xt) * x0}[self.mean_type]
+            if loss_mask is not None:
+                loss_mask = loss_mask[:, :, 0, ...].unsqueeze(2)  # just use one channel (all channels are same)
+                loss_mask = loss_mask.permute(0, 2, 1, 3, 4)  # b,c,f,h,w
+                # use masked diffusion
+                loss = (out * loss_mask - target * loss_mask).pow(1 if self.loss_type.endswith('l1') else 2).abs().flatten(1).mean(dim=1)
+            else:
+                loss = (out - target).pow(1 if self.loss_type.endswith('l1') else 2).abs().flatten(1).mean(dim=1)
+            if weight is not None:
+                loss = loss*weight
+            # div loss
+            if use_div_loss and self.mean_type == 'eps' and x0.shape[2]>1:
+                # derive  x0
+                x0_ = _i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - \
+                    _i(self.sqrt_recipm1_alphas_cumprod, t, xt) * out
+                # ncfhw, std on f
+                div_loss = 0.001/(x0_.std(dim=2).flatten(1).mean(dim=1)+1e-4)
+                # print(div_loss,loss)
+                loss = loss+div_loss
+            # total loss
+            loss = loss + loss_vlb
+        elif self.loss_type in ['charbonnier']:
+            out = model(xt, self._scale_timesteps(t), **model_kwargs)
+            # VLB for variation
+            loss_vlb = 0.0
+            if self.var_type in ['learned', 'learned_range']:
+                out, var = out.chunk(2, dim=1)
+                frozen = torch.cat([out.detach(), var], dim=1)  # learn var without affecting the prediction of mean
+                loss_vlb, _ = self.variational_lower_bound(x0, xt, t, model=lambda *args, **kwargs: frozen)
+                if self.loss_type.startswith('rescaled_'):
+                    loss_vlb = loss_vlb * self.num_timesteps / 1000.0
+            # MSE/L1 for x0/eps
+            target = {'eps': noise, 'x0': x0, 'x_{t-1}': self.q_posterior_mean_variance(x0, xt, t)[0]}[self.mean_type]
+            loss = torch.sqrt((out - target)**2 + self.epsilon)
+            if weight is not None:
+                loss = loss*weight
+            loss = loss.flatten(1).mean(dim=1)
+            # total loss
+            loss = loss + loss_vlb
+        return loss
+    def variational_lower_bound(self, x0, xt, t, model, model_kwargs={}, clamp=None, percentile=None):
+        # compute groundtruth and predicted distributions
+        mu1, _, log_var1 = self.q_posterior_mean_variance(x0, xt, t)
+        mu2, _, log_var2, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile)
+        # compute KL loss
+        kl = kl_divergence(mu1, log_var1, mu2, log_var2)
+        kl = kl.flatten(1).mean(dim=1) / math.log(2.0)
+        # compute discretized NLL loss (for p(x0 | x1) only)
+        nll = -discretized_gaussian_log_likelihood(x0, mean=mu2, log_scale=0.5 * log_var2)
+        nll = nll.flatten(1).mean(dim=1) / math.log(2.0)
+        # NLL for p(x0 | x1) and KL otherwise
+        vlb = torch.where(t == 0, nll, kl)
+        return vlb, x0
+    @torch.no_grad()
+    def variational_lower_bound_loop(self, x0, model, model_kwargs={}, clamp=None, percentile=None):
+        r"""Compute the entire variational lower bound, measured in bits-per-dim.
+        """
+        # prepare input and output
+        b = x0.size(0)
+        metrics = {'vlb': [], 'mse': [], 'x0_mse': []}
+        # loop
+        for step in torch.arange(self.num_timesteps).flip(0):
+            # compute VLB
+            t = torch.full((b, ), step, dtype=torch.long, device=x0.device)
+            # noise = torch.randn_like(x0)
+            noise = self.sample_loss(x0)
+            xt = self.q_sample(x0, t, noise)
+            vlb, pred_x0 = self.variational_lower_bound(x0, xt, t, model, model_kwargs, clamp, percentile)
+            # predict eps from x0
+            eps = (_i(self.sqrt_recip_alphas_cumprod, t, xt) * xt - x0) / \
+                  _i(self.sqrt_recipm1_alphas_cumprod, t, xt)
+            # collect metrics
+            metrics['vlb'].append(vlb)
+            metrics['x0_mse'].append((pred_x0 - x0).square().flatten(1).mean(dim=1))
+            metrics['mse'].append((eps - noise).square().flatten(1).mean(dim=1))
+        metrics = {k: torch.stack(v, dim=1) for k, v in metrics.items()}
+        # compute the prior KL term for VLB, measured in bits-per-dim
+        mu, _, log_var = self.q_mean_variance(x0, t)
+        kl_prior = kl_divergence(mu, log_var, torch.zeros_like(mu), torch.zeros_like(log_var))
+        kl_prior = kl_prior.flatten(1).mean(dim=1) / math.log(2.0)
+        # update metrics
+        metrics['prior_bits_per_dim'] = kl_prior
+        metrics['total_bits_per_dim'] = metrics['vlb'].sum(dim=1) + kl_prior
+        return metrics
+    def _scale_timesteps(self, t):
+        if self.rescale_timesteps:
+            return t.float() * 1000.0 / self.num_timesteps
+        return t
+        #return t.float()
+def ordered_halving(val):
+    bin_str = f"{val:064b}"
+    bin_flip = bin_str[::-1]
+    as_int = int(bin_flip, 2)
+    return as_int / (1 << 64)
+def context_scheduler(
+    step: int = ...,
+    num_steps: Optional[int] = None,
+    num_frames: int = ...,
+    context_size: Optional[int] = None,
+    context_stride: int = 3,
+    context_overlap: int = 4,
+    closed_loop: bool = False,
+):
+    if num_frames <= context_size:
+        yield list(range(num_frames))
+        return
+    context_stride = min(
+        context_stride, int(np.ceil(np.log2(num_frames / context_size))) + 1
+    )
+    for context_step in 1 << np.arange(context_stride):
+        pad = int(round(num_frames * ordered_halving(step)))
+        for j in range(
+            int(ordered_halving(step) * context_step) + pad,
+            num_frames + pad + (0 if closed_loop else -context_overlap),
+            (context_size * context_step - context_overlap),
+        ):
+            yield [
+                e % num_frames
+                for e in range(j, j + context_size * context_step, context_step)
+            ]

UniAnimate/tools/modules/diffusions/diffusion_gauss.py ADDED Viewed

	@@ -0,0 +1,498 @@

+"""
+GaussianDiffusion wraps operators for denoising diffusion models, including the
+diffusion and denoising processes, as well as the loss evaluation.
+"""
+import torch
+import torchsde
+import random
+from tqdm.auto import trange
+__all__ = ['GaussianDiffusion']
+def _i(tensor, t, x):
+    """
+    Index tensor using t and format the output according to x.
+    """
+    shape = (x.size(0), ) + (1, ) * (x.ndim - 1)
+    return tensor[t.to(tensor.device)].view(shape).to(x.device)
+class BatchedBrownianTree:
+    """
+    A wrapper around torchsde.BrownianTree that enables batches of entropy.
+    """
+    def __init__(self, x, t0, t1, seed=None, **kwargs):
+        t0, t1, self.sign = self.sort(t0, t1)
+        w0 = kwargs.get('w0', torch.zeros_like(x))
+        if seed is None:
+            seed = torch.randint(0, 2 ** 63 - 1, []).item()
+        self.batched = True
+        try:
+            assert len(seed) == x.shape[0]
+            w0 = w0[0]
+        except TypeError:
+            seed = [seed]
+            self.batched = False
+        self.trees = [torchsde.BrownianTree(
+            t0, w0, t1, entropy=s, **kwargs
+        ) for s in seed]
+    @staticmethod
+    def sort(a, b):
+        return (a, b, 1) if a < b else (b, a, -1)
+    def __call__(self, t0, t1):
+        t0, t1, sign = self.sort(t0, t1)
+        w = torch.stack([tree(t0, t1) for tree in self.trees]) * (self.sign * sign)
+        return w if self.batched else w[0]
+class BrownianTreeNoiseSampler:
+    """
+    A noise sampler backed by a torchsde.BrownianTree.
+    Args:
+        x (Tensor): The tensor whose shape, device and dtype to use to generate
+            random samples.
+        sigma_min (float): The low end of the valid interval.
+        sigma_max (float): The high end of the valid interval.
+        seed (int or List[int]): The random seed. If a list of seeds is
+            supplied instead of a single integer, then the noise sampler will
+            use one BrownianTree per batch item, each with its own seed.
+        transform (callable): A function that maps sigma to the sampler's
+            internal timestep.
+    """
+    def __init__(self, x, sigma_min, sigma_max, seed=None, transform=lambda x: x):
+        self.transform = transform
+        t0 = self.transform(torch.as_tensor(sigma_min))
+        t1 = self.transform(torch.as_tensor(sigma_max))
+        self.tree = BatchedBrownianTree(x, t0, t1, seed)
+    def __call__(self, sigma, sigma_next):
+        t0 = self.transform(torch.as_tensor(sigma))
+        t1 = self.transform(torch.as_tensor(sigma_next))
+        return self.tree(t0, t1) / (t1 - t0).abs().sqrt()
+def get_scalings(sigma):
+    c_out = -sigma
+    c_in = 1 / (sigma ** 2 + 1. ** 2) ** 0.5
+    return c_out, c_in
+@torch.no_grad()
+def sample_dpmpp_2m_sde(
+    noise,
+    model,
+    sigmas,
+    eta=1.,
+    s_noise=1.,
+    solver_type='midpoint',
+    show_progress=True
+):
+    """
+    DPM-Solver++ (2M) SDE.
+    """
+    assert solver_type in {'heun', 'midpoint'}
+    x = noise * sigmas[0]
+    sigma_min, sigma_max = sigmas[sigmas > 0].min(), sigmas[sigmas < float('inf')].max()
+    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max)
+    old_denoised = None
+    h_last = None
+    for i in trange(len(sigmas) - 1, disable=not show_progress):
+        if sigmas[i] == float('inf'):
+            # Euler method
+            denoised = model(noise, sigmas[i])
+            x = denoised + sigmas[i + 1] * noise
+        else:
+            _, c_in = get_scalings(sigmas[i])
+            denoised = model(x * c_in, sigmas[i])
+            if sigmas[i + 1] == 0:
+                # Denoising step
+                x = denoised
+            else:
+                # DPM-Solver++(2M) SDE
+                t, s = -sigmas[i].log(), -sigmas[i + 1].log()
+                h = s - t
+                eta_h = eta * h
+                x = sigmas[i + 1] / sigmas[i] * (-eta_h).exp() * x + \
+                    (-h - eta_h).expm1().neg() * denoised
+                if old_denoised is not None:
+                    r = h_last / h
+                    if solver_type == 'heun':
+                        x = x + ((-h - eta_h).expm1().neg() / (-h - eta_h) + 1) * \
+                            (1 / r) * (denoised - old_denoised)
+                    elif solver_type == 'midpoint':
+                        x = x + 0.5 * (-h - eta_h).expm1().neg() * \
+                            (1 / r) * (denoised - old_denoised)
+                x = x + noise_sampler(
+                    sigmas[i],
+                    sigmas[i + 1]
+                ) * sigmas[i + 1] * (-2 * eta_h).expm1().neg().sqrt() * s_noise
+            old_denoised = denoised
+            h_last = h
+    return x
+class GaussianDiffusion(object):
+    def __init__(self, sigmas, prediction_type='eps'):
+        assert prediction_type in {'x0', 'eps', 'v'}
+        self.sigmas = sigmas.float()                        # noise coefficients
+        self.alphas = torch.sqrt(1 - sigmas ** 2).float()   # signal coefficients
+        self.num_timesteps = len(sigmas)
+        self.prediction_type = prediction_type
+    def diffuse(self, x0, t, noise=None):
+        """
+        Add Gaussian noise to signal x0 according to:
+        q(x_t | x_0) = N(x_t | alpha_t x_0, sigma_t^2 I).
+        """
+        noise = torch.randn_like(x0) if noise is None else noise
+        xt = _i(self.alphas, t, x0) * x0 + _i(self.sigmas, t, x0) * noise
+        return xt
+    def denoise(
+        self,
+        xt,
+        t,
+        s,
+        model,
+        model_kwargs={},
+        guide_scale=None,
+        guide_rescale=None,
+        clamp=None,
+        percentile=None
+    ):
+        """
+        Apply one step of denoising from the posterior distribution q(x_s | x_t, x0).
+        Since x0 is not available, estimate the denoising results using the learned
+        distribution p(x_s | x_t, \hat{x}_0 == f(x_t)).
+        """
+        s = t - 1 if s is None else s
+        # hyperparams
+        sigmas = _i(self.sigmas, t, xt)
+        alphas = _i(self.alphas, t, xt)
+        alphas_s = _i(self.alphas, s.clamp(0), xt)
+        alphas_s[s < 0] = 1.
+        sigmas_s = torch.sqrt(1 - alphas_s ** 2)
+        # precompute variables
+        betas = 1 - (alphas / alphas_s) ** 2
+        coef1 = betas * alphas_s / sigmas ** 2
+        coef2 = (alphas * sigmas_s ** 2) / (alphas_s * sigmas ** 2)
+        var = betas * (sigmas_s / sigmas) ** 2
+        log_var = torch.log(var).clamp_(-20, 20)
+        # prediction
+        if guide_scale is None:
+            assert isinstance(model_kwargs, dict)
+            out = model(xt, t=t, **model_kwargs)
+        else:
+            # classifier-free guidance (arXiv:2207.12598)
+            # model_kwargs[0]: conditional kwargs
+            # model_kwargs[1]: non-conditional kwargs
+            assert isinstance(model_kwargs, list) and len(model_kwargs) == 2
+            y_out = model(xt, t=t, **model_kwargs[0])
+            if guide_scale == 1.:
+                out = y_out
+            else:
+                u_out = model(xt, t=t, **model_kwargs[1])
+                out = u_out + guide_scale * (y_out - u_out)
+                # rescale the output according to arXiv:2305.08891
+                if guide_rescale is not None:
+                    assert guide_rescale >= 0 and guide_rescale <= 1
+                    ratio = (y_out.flatten(1).std(dim=1) / (
+                        out.flatten(1).std(dim=1) + 1e-12
+                    )).view((-1, ) + (1, ) * (y_out.ndim - 1))
+                    out *= guide_rescale * ratio + (1 - guide_rescale) * 1.0
+        # compute x0
+        if self.prediction_type == 'x0':
+            x0 = out
+        elif self.prediction_type == 'eps':
+            x0 = (xt - sigmas * out) / alphas
+        elif self.prediction_type == 'v':
+            x0 = alphas * xt - sigmas * out
+        else:
+            raise NotImplementedError(
+                f'prediction_type {self.prediction_type} not implemented'
+            )
+        # restrict the range of x0
+        if percentile is not None:
+            # NOTE: percentile should only be used when data is within range [-1, 1]
+            assert percentile > 0 and percentile <= 1
+            s = torch.quantile(x0.flatten(1).abs(), percentile, dim=1)
+            s = s.clamp_(1.0).view((-1, ) + (1, ) * (xt.ndim - 1))
+            x0 = torch.min(s, torch.max(-s, x0)) / s
+        elif clamp is not None:
+            x0 = x0.clamp(-clamp, clamp)
+        # recompute eps using the restricted x0
+        eps = (xt - alphas * x0) / sigmas
+        # compute mu (mean of posterior distribution) using the restricted x0
+        mu = coef1 * x0 + coef2 * xt
+        return mu, var, log_var, x0, eps
+    @torch.no_grad()
+    def sample(
+        self,
+        noise,
+        model,
+        model_kwargs={},
+        condition_fn=None,
+        guide_scale=None,
+        guide_rescale=None,
+        clamp=None,
+        percentile=None,
+        solver='euler_a',
+        steps=20,
+        t_max=None,
+        t_min=None,
+        discretization=None,
+        discard_penultimate_step=None,
+        return_intermediate=None,
+        show_progress=False,
+        seed=-1,
+        **kwargs
+    ):
+        # sanity check
+        assert isinstance(steps, (int, torch.LongTensor))
+        assert t_max is None or (t_max > 0 and t_max <= self.num_timesteps - 1)
+        assert t_min is None or (t_min >= 0 and t_min < self.num_timesteps - 1)
+        assert discretization in (None, 'leading', 'linspace', 'trailing')
+        assert discard_penultimate_step in (None, True, False)
+        assert return_intermediate in (None, 'x0', 'xt')
+        # function of diffusion solver
+        solver_fn = {
+            # 'heun': sample_heun,
+            'dpmpp_2m_sde': sample_dpmpp_2m_sde
+        }[solver]
+        # options
+        schedule = 'karras' if 'karras' in solver else None
+        discretization = discretization or 'linspace'
+        seed = seed if seed >= 0 else random.randint(0, 2 ** 31)
+        if isinstance(steps, torch.LongTensor):
+            discard_penultimate_step = False
+        if discard_penultimate_step is None:
+            discard_penultimate_step = True if solver in (
+                'dpm2',
+                'dpm2_ancestral',
+                'dpmpp_2m_sde',
+                'dpm2_karras',
+                'dpm2_ancestral_karras',
+                'dpmpp_2m_sde_karras'
+            ) else False
+        # function for denoising xt to get x0
+        intermediates = []
+        def model_fn(xt, sigma):
+            # denoising
+            t = self._sigma_to_t(sigma).repeat(len(xt)).round().long()
+            x0 = self.denoise(
+                xt, t, None, model, model_kwargs, guide_scale, guide_rescale, clamp,
+                percentile
+            )[-2]
+            # collect intermediate outputs
+            if return_intermediate == 'xt':
+                intermediates.append(xt)
+            elif return_intermediate == 'x0':
+                intermediates.append(x0)
+            return x0
+        # get timesteps
+        if isinstance(steps, int):
+            steps += 1 if discard_penultimate_step else 0
+            t_max = self.num_timesteps - 1 if t_max is None else t_max
+            t_min = 0 if t_min is None else t_min
+            # discretize timesteps
+            if discretization == 'leading':
+                steps = torch.arange(
+                    t_min, t_max + 1, (t_max - t_min + 1) / steps
+                ).flip(0)
+            elif discretization == 'linspace':
+                steps = torch.linspace(t_max, t_min, steps)
+            elif discretization == 'trailing':
+                steps = torch.arange(t_max, t_min - 1, -((t_max - t_min + 1) / steps))
+            else:
+                raise NotImplementedError(
+                    f'{discretization} discretization not implemented'
+                )
+            steps = steps.clamp_(t_min, t_max)
+        steps = torch.as_tensor(steps, dtype=torch.float32, device=noise.device)
+        # get sigmas
+        sigmas = self._t_to_sigma(steps)
+        sigmas = torch.cat([sigmas, sigmas.new_zeros([1])])
+        if schedule == 'karras':
+            if sigmas[0] == float('inf'):
+                sigmas = karras_schedule(
+                    n=len(steps) - 1,
+                    sigma_min=sigmas[sigmas > 0].min().item(),
+                    sigma_max=sigmas[sigmas < float('inf')].max().item(),
+                    rho=7.
+                ).to(sigmas)
+                sigmas = torch.cat([
+                    sigmas.new_tensor([float('inf')]), sigmas, sigmas.new_zeros([1])
+                ])
+            else:
+                sigmas = karras_schedule(
+                    n=len(steps),
+                    sigma_min=sigmas[sigmas > 0].min().item(),
+                    sigma_max=sigmas.max().item(),
+                    rho=7.
+                ).to(sigmas)
+                sigmas = torch.cat([sigmas, sigmas.new_zeros([1])])
+        if discard_penultimate_step:
+            sigmas = torch.cat([sigmas[:-2], sigmas[-1:]])
+        # sampling
+        x0 = solver_fn(
+            noise,
+            model_fn,
+            sigmas,
+            show_progress=show_progress,
+            **kwargs
+        )
+        return (x0, intermediates) if return_intermediate is not None else x0
+    @torch.no_grad()
+    def ddim_reverse_sample(
+        self,
+        xt,
+        t,
+        model,
+        model_kwargs={},
+        clamp=None,
+        percentile=None,
+        guide_scale=None,
+        guide_rescale=None,
+        ddim_timesteps=20,
+        reverse_steps=600
+        ):
+        r"""Sample from p(x_{t+1} | x_t) using DDIM reverse ODE (deterministic).
+        """
+        stride = reverse_steps // ddim_timesteps
+        # predict distribution of p(x_{t-1} | x_t)
+        _, _, _, x0, eps = self.denoise(
+                xt, t, None, model, model_kwargs, guide_scale, guide_rescale, clamp,
+                percentile
+            )
+        # derive variables
+        s = (t + stride).clamp(0, reverse_steps-1)
+        # hyperparams
+        sigmas = _i(self.sigmas, t, xt)
+        alphas = _i(self.alphas, t, xt)
+        alphas_s = _i(self.alphas, s.clamp(0), xt)
+        alphas_s[s < 0] = 1.
+        sigmas_s = torch.sqrt(1 - alphas_s ** 2)
+        # reverse sample
+        mu = alphas_s * x0 + sigmas_s * eps
+        return mu, x0
+    @torch.no_grad()
+    def ddim_reverse_sample_loop(
+        self,
+        x0,
+        model,
+        model_kwargs={},
+        clamp=None,
+        percentile=None,
+        guide_scale=None,
+        guide_rescale=None,
+        ddim_timesteps=20,
+        reverse_steps=600
+        ):
+        # prepare input
+        b = x0.size(0)
+        xt = x0
+        # reconstruction steps
+        steps = torch.arange(0, reverse_steps, reverse_steps // ddim_timesteps)
+        for step in steps:
+            t = torch.full((b, ), step, dtype=torch.long, device=xt.device)
+            xt, _ = self.ddim_reverse_sample(xt, t, model, model_kwargs, clamp, percentile, guide_scale, guide_rescale, ddim_timesteps, reverse_steps)
+        return xt
+    def _sigma_to_t(self, sigma):
+        if sigma == float('inf'):
+            t = torch.full_like(sigma, len(self.sigmas) - 1)
+        else:
+            log_sigmas = torch.sqrt(
+                self.sigmas ** 2 / (1 - self.sigmas ** 2)
+            ).log().to(sigma)
+            log_sigma = sigma.log()
+            dists = log_sigma - log_sigmas[:, None]
+            low_idx = dists.ge(0).cumsum(dim=0).argmax(dim=0).clamp(
+                max=log_sigmas.shape[0] - 2
+            )
+            high_idx = low_idx + 1
+            low, high = log_sigmas[low_idx], log_sigmas[high_idx]
+            w = (low - log_sigma) / (low - high)
+            w = w.clamp(0, 1)
+            t = (1 - w) * low_idx + w * high_idx
+            t = t.view(sigma.shape)
+        if t.ndim == 0:
+            t = t.unsqueeze(0)
+        return t
+    def _t_to_sigma(self, t):
+        t = t.float()
+        low_idx, high_idx, w = t.floor().long(), t.ceil().long(), t.frac()
+        log_sigmas = torch.sqrt(self.sigmas ** 2 / (1 - self.sigmas ** 2)).log().to(t)
+        log_sigma = (1 - w) * log_sigmas[low_idx] + w * log_sigmas[high_idx]
+        log_sigma[torch.isnan(log_sigma) | torch.isinf(log_sigma)] = float('inf')
+        return log_sigma.exp()
+    def prev_step(self, model_out, t, xt, inference_steps=50):
+        prev_t = t - self.num_timesteps // inference_steps
+        sigmas = _i(self.sigmas, t, xt)
+        alphas = _i(self.alphas, t, xt)
+        alphas_prev = _i(self.alphas, prev_t.clamp(0), xt)
+        alphas_prev[prev_t < 0] = 1.
+        sigmas_prev = torch.sqrt(1 - alphas_prev ** 2)
+        x0 = alphas * xt - sigmas * model_out
+        eps = (xt - alphas * x0) / sigmas
+        prev_sample = alphas_prev * x0 + sigmas_prev * eps
+        return prev_sample
+    def next_step(self, model_out, t, xt, inference_steps=50):
+        t, next_t = min(t - self.num_timesteps // inference_steps, 999), t
+        sigmas = _i(self.sigmas, t, xt)
+        alphas = _i(self.alphas, t, xt)
+        alphas_next = _i(self.alphas, next_t.clamp(0), xt)
+        alphas_next[next_t < 0] = 1.
+        sigmas_next = torch.sqrt(1 - alphas_next ** 2)
+        x0 = alphas * xt - sigmas * model_out
+        eps = (xt - alphas * x0) / sigmas
+        next_sample = alphas_next * x0 + sigmas_next * eps
+        return next_sample
+    def get_noise_pred_single(self, xt, t, model, model_kwargs):
+        assert isinstance(model_kwargs, dict)
+        out = model(xt, t=t, **model_kwargs)
+        return out

UniAnimate/tools/modules/diffusions/losses.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import torch
+import math
+__all__ = ['kl_divergence', 'discretized_gaussian_log_likelihood']
+def kl_divergence(mu1, logvar1, mu2, logvar2):
+    return 0.5 * (-1.0 + logvar2 - logvar1 + torch.exp(logvar1 - logvar2) + ((mu1 - mu2) ** 2) * torch.exp(-logvar2))
+def standard_normal_cdf(x):
+    r"""A fast approximation of the cumulative distribution function of the standard normal.
+    """
+    return 0.5 * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+def discretized_gaussian_log_likelihood(x0, mean, log_scale):
+    assert x0.shape == mean.shape == log_scale.shape
+    cx = x0 - mean
+    inv_stdv = torch.exp(-log_scale)
+    cdf_plus = standard_normal_cdf(inv_stdv * (cx + 1.0 / 255.0))
+    cdf_min = standard_normal_cdf(inv_stdv * (cx - 1.0 / 255.0))
+    log_cdf_plus = torch.log(cdf_plus.clamp(min=1e-12))
+    log_one_minus_cdf_min = torch.log((1.0 - cdf_min).clamp(min=1e-12))
+    cdf_delta = cdf_plus - cdf_min
+    log_probs = torch.where(
+        x0 < -0.999,
+        log_cdf_plus,
+        torch.where(x0 > 0.999, log_one_minus_cdf_min, torch.log(cdf_delta.clamp(min=1e-12))))
+    assert log_probs.shape == x0.shape
+    return log_probs

UniAnimate/tools/modules/diffusions/schedules.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import math
+import torch
+def beta_schedule(schedule='cosine',
+                   num_timesteps=1000,
+                   zero_terminal_snr=False,
+                   **kwargs):
+    # compute betas
+    betas = {
+        # 'logsnr_cosine_interp': logsnr_cosine_interp_schedule,
+        'linear': linear_schedule,
+        'linear_sd': linear_sd_schedule,
+        'quadratic': quadratic_schedule,
+        'cosine': cosine_schedule
+    }[schedule](num_timesteps, **kwargs)
+    if zero_terminal_snr and abs(betas.max() - 1.0) > 0.0001:
+        betas = rescale_zero_terminal_snr(betas)
+    return betas
+def sigma_schedule(schedule='cosine',
+                   num_timesteps=1000,
+                   zero_terminal_snr=False,
+                   **kwargs):
+    # compute betas
+    betas = {
+        'logsnr_cosine_interp': logsnr_cosine_interp_schedule,
+        'linear': linear_schedule,
+        'linear_sd': linear_sd_schedule,
+        'quadratic': quadratic_schedule,
+        'cosine': cosine_schedule
+    }[schedule](num_timesteps, **kwargs)
+    if schedule == 'logsnr_cosine_interp':
+        sigma = betas
+    else:
+        sigma = betas_to_sigmas(betas)
+    if zero_terminal_snr and abs(sigma.max() - 1.0) > 0.0001:
+        sigma = rescale_zero_terminal_snr(sigma)
+    return sigma
+def linear_schedule(num_timesteps, init_beta, last_beta,  **kwargs):
+    scale = 1000.0 / num_timesteps
+    init_beta = init_beta or scale * 0.0001
+    ast_beta = last_beta or scale * 0.02
+    return torch.linspace(init_beta, last_beta, num_timesteps, dtype=torch.float64)
+def logsnr_cosine_interp_schedule(
+        num_timesteps,
+        scale_min=2,
+        scale_max=4,
+        logsnr_min=-15,
+        logsnr_max=15,
+        **kwargs):
+    return logsnrs_to_sigmas(
+        _logsnr_cosine_interp(num_timesteps, logsnr_min, logsnr_max, scale_min, scale_max))
+def linear_sd_schedule(num_timesteps, init_beta, last_beta,  **kwargs):
+    return torch.linspace(init_beta ** 0.5, last_beta ** 0.5, num_timesteps, dtype=torch.float64) ** 2
+def quadratic_schedule(num_timesteps, init_beta, last_beta,  **kwargs):
+    init_beta = init_beta or 0.0015
+    last_beta = last_beta or 0.0195
+    return torch.linspace(init_beta ** 0.5, last_beta ** 0.5, num_timesteps, dtype=torch.float64) ** 2
+def cosine_schedule(num_timesteps, cosine_s=0.008, **kwargs):
+    betas = []
+    for step in range(num_timesteps):
+        t1 = step / num_timesteps
+        t2 = (step + 1) / num_timesteps
+        fn = lambda u: math.cos((u + cosine_s) / (1 + cosine_s) * math.pi / 2) ** 2
+        betas.append(min(1.0 - fn(t2) / fn(t1), 0.999))
+    return torch.tensor(betas, dtype=torch.float64)
+# def cosine_schedule(n, cosine_s=0.008, **kwargs):
+#     ramp = torch.linspace(0, 1, n + 1)
+#     square_alphas = torch.cos((ramp + cosine_s) / (1 + cosine_s) * torch.pi / 2) ** 2
+#     betas = (1 - square_alphas[1:] / square_alphas[:-1]).clamp(max=0.999)
+#     return betas_to_sigmas(betas)
+def betas_to_sigmas(betas):
+    return torch.sqrt(1 - torch.cumprod(1 - betas, dim=0))
+def sigmas_to_betas(sigmas):
+    square_alphas = 1 - sigmas**2
+    betas = 1 - torch.cat(
+        [square_alphas[:1], square_alphas[1:] / square_alphas[:-1]])
+    return betas
+def sigmas_to_logsnrs(sigmas):
+    square_sigmas = sigmas**2
+    return torch.log(square_sigmas / (1 - square_sigmas))
+def _logsnr_cosine(n, logsnr_min=-15, logsnr_max=15):
+    t_min = math.atan(math.exp(-0.5 * logsnr_min))
+    t_max = math.atan(math.exp(-0.5 * logsnr_max))
+    t = torch.linspace(1, 0, n)
+    logsnrs = -2 * torch.log(torch.tan(t_min + t * (t_max - t_min)))
+    return logsnrs
+def _logsnr_cosine_shifted(n, logsnr_min=-15, logsnr_max=15, scale=2):
+    logsnrs = _logsnr_cosine(n, logsnr_min, logsnr_max)
+    logsnrs += 2 * math.log(1 / scale)
+    return logsnrs
+def karras_schedule(n, sigma_min=0.002, sigma_max=80.0, rho=7.0):
+    ramp = torch.linspace(1, 0, n)
+    min_inv_rho = sigma_min**(1 / rho)
+    max_inv_rho = sigma_max**(1 / rho)
+    sigmas = (max_inv_rho + ramp * (min_inv_rho - max_inv_rho))**rho
+    sigmas = torch.sqrt(sigmas**2 / (1 + sigmas**2))
+    return sigmas
+def _logsnr_cosine_interp(n,
+                          logsnr_min=-15,
+                          logsnr_max=15,
+                          scale_min=2,
+                          scale_max=4):
+    t = torch.linspace(1, 0, n)
+    logsnrs_min = _logsnr_cosine_shifted(n, logsnr_min, logsnr_max, scale_min)
+    logsnrs_max = _logsnr_cosine_shifted(n, logsnr_min, logsnr_max, scale_max)
+    logsnrs = t * logsnrs_min + (1 - t) * logsnrs_max
+    return logsnrs
+def logsnrs_to_sigmas(logsnrs):
+    return torch.sqrt(torch.sigmoid(-logsnrs))
+def rescale_zero_terminal_snr(betas):
+    """
+    Rescale Schedule to Zero Terminal SNR
+    """
+    # Convert betas to alphas_bar_sqrt
+    alphas = 1 - betas
+    alphas_bar = alphas.cumprod(0)
+    alphas_bar_sqrt = alphas_bar.sqrt()
+    # Store old values. 8 alphas_bar_sqrt_0 = a
+    alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone()
+    alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone()
+    # Shift so last timestep is zero.
+    alphas_bar_sqrt -= alphas_bar_sqrt_T
+    # Scale so first timestep is back to old value.
+    alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T)
+    # Convert alphas_bar_sqrt to betas
+    alphas_bar = alphas_bar_sqrt ** 2
+    alphas = alphas_bar[1:] / alphas_bar[:-1]
+    alphas = torch.cat([alphas_bar[0:1], alphas])
+    betas = 1 - alphas
+    return betas

UniAnimate/tools/modules/embedding_manager.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+import open_clip
+from functools import partial
+from utils.registry_class import EMBEDMANAGER
+DEFAULT_PLACEHOLDER_TOKEN = ["*"]
+PROGRESSIVE_SCALE = 2000
+per_img_token_list = [
+    'א', 'ב', 'ג', 'ד', 'ה', 'ו', 'ז', 'ח', 'ט', 'י', 'כ', 'ל', 'מ', 'נ', 'ס', 'ע', 'פ', 'צ', 'ק', 'ר', 'ש', 'ת',
+]
+def get_clip_token_for_string(string):
+    tokens = open_clip.tokenize(string)
+    return tokens[0, 1]
+def get_embedding_for_clip_token(embedder, token):
+    return embedder(token.unsqueeze(0))[0]
+@EMBEDMANAGER.register_class()
+class EmbeddingManager(nn.Module):
+    def __init__(
+            self,
+            embedder,
+            placeholder_strings=None,
+            initializer_words=None,
+            per_image_tokens=False,
+            num_vectors_per_token=1,
+            progressive_words=False,
+            temporal_prompt_length=1,
+            token_dim=1024,
+            **kwargs
+    ):
+        super().__init__()
+        self.string_to_token_dict = {}
+        self.string_to_param_dict = nn.ParameterDict()
+        self.initial_embeddings = nn.ParameterDict() # These should not be optimized
+        self.progressive_words = progressive_words
+        self.progressive_counter = 0
+        self.max_vectors_per_token = num_vectors_per_token
+        get_embedding_for_tkn = partial(get_embedding_for_clip_token, embedder.model.token_embedding.cpu())
+        if per_image_tokens:
+            placeholder_strings.extend(per_img_token_list)
+        for idx, placeholder_string in enumerate(placeholder_strings):
+            token = get_clip_token_for_string(placeholder_string)
+            if initializer_words and idx < len(initializer_words):
+                init_word_token = get_clip_token_for_string(initializer_words[idx])
+                with torch.no_grad():
+                    init_word_embedding = get_embedding_for_tkn(init_word_token)
+                token_params = torch.nn.Parameter(init_word_embedding.unsqueeze(0).repeat(num_vectors_per_token, 1), requires_grad=True)
+                self.initial_embeddings[placeholder_string] = torch.nn.Parameter(init_word_embedding.unsqueeze(0).repeat(num_vectors_per_token, 1), requires_grad=False)
+            else:
+                token_params = torch.nn.Parameter(torch.rand(size=(num_vectors_per_token, token_dim), requires_grad=True))
+            self.string_to_token_dict[placeholder_string] = token
+            self.string_to_param_dict[placeholder_string] = token_params
+    def forward(
+            self,
+            tokenized_text,
+            embedded_text,
+    ):
+        b, n, device = *tokenized_text.shape, tokenized_text.device
+        for placeholder_string, placeholder_token in self.string_to_token_dict.items():
+            placeholder_embedding = self.string_to_param_dict[placeholder_string].to(device)
+            if self.max_vectors_per_token == 1: # If there's only one vector per token, we can do a simple replacement
+                placeholder_idx = torch.where(tokenized_text == placeholder_token.to(device))
+                embedded_text[placeholder_idx] = placeholder_embedding
+            else: # otherwise, need to insert and keep track of changing indices
+                if self.progressive_words:
+                    self.progressive_counter += 1
+                    max_step_tokens = 1 + self.progressive_counter // PROGRESSIVE_SCALE
+                else:
+                    max_step_tokens = self.max_vectors_per_token
+                num_vectors_for_token = min(placeholder_embedding.shape[0], max_step_tokens)
+                placeholder_rows, placeholder_cols = torch.where(tokenized_text == placeholder_token.to(device))
+                if placeholder_rows.nelement() == 0:
+                    continue
+                sorted_cols, sort_idx = torch.sort(placeholder_cols, descending=True)
+                sorted_rows = placeholder_rows[sort_idx]
+                for idx in range(len(sorted_rows)):
+                    row = sorted_rows[idx]
+                    col = sorted_cols[idx]
+                    new_token_row = torch.cat([tokenized_text[row][:col], placeholder_token.repeat(num_vectors_for_token).to(device), tokenized_text[row][col + 1:]], axis=0)[:n]
+                    new_embed_row = torch.cat([embedded_text[row][:col], placeholder_embedding[:num_vectors_for_token], embedded_text[row][col + 1:]], axis=0)[:n]
+                    embedded_text[row]  = new_embed_row
+                    tokenized_text[row] = new_token_row
+        return embedded_text
+    def forward_with_text_img(
+            self,
+            tokenized_text,
+            embedded_text,
+            embedded_img,
+    ):
+        device = tokenized_text.device
+        for placeholder_string, placeholder_token in self.string_to_token_dict.items():
+            placeholder_embedding = self.string_to_param_dict[placeholder_string].to(device)
+            placeholder_idx = torch.where(tokenized_text == placeholder_token.to(device))
+            embedded_text[placeholder_idx] = embedded_text[placeholder_idx] + embedded_img + placeholder_embedding
+        return embedded_text
+    def forward_with_text(
+            self,
+            tokenized_text,
+            embedded_text
+    ):
+        device = tokenized_text.device
+        for placeholder_string, placeholder_token in self.string_to_token_dict.items():
+            placeholder_embedding = self.string_to_param_dict[placeholder_string].to(device)
+            placeholder_idx = torch.where(tokenized_text == placeholder_token.to(device))
+            embedded_text[placeholder_idx] = embedded_text[placeholder_idx] + placeholder_embedding
+        return embedded_text
+    def save(self, ckpt_path):
+        torch.save({"string_to_token": self.string_to_token_dict,
+                    "string_to_param": self.string_to_param_dict}, ckpt_path)
+    def load(self, ckpt_path):
+        ckpt = torch.load(ckpt_path, map_location='cpu')
+        string_to_token = ckpt["string_to_token"]
+        string_to_param = ckpt["string_to_param"]
+        for string, token in string_to_token.items():
+            self.string_to_token_dict[string] = token
+        for string, param in string_to_param.items():
+            self.string_to_param_dict[string] = param
+    def get_embedding_norms_squared(self):
+        all_params = torch.cat(list(self.string_to_param_dict.values()), axis=0) # num_placeholders x embedding_dim
+        param_norm_squared = (all_params * all_params).sum(axis=-1)              # num_placeholders
+        return param_norm_squared
+    def embedding_parameters(self):
+        return self.string_to_param_dict.parameters()
+    def embedding_to_coarse_loss(self):
+        loss = 0.
+        num_embeddings = len(self.initial_embeddings)
+        for key in self.initial_embeddings:
+            optimized = self.string_to_param_dict[key]
+            coarse = self.initial_embeddings[key].clone().to(optimized.device)
+            loss = loss + (optimized - coarse) @ (optimized - coarse).T / num_embeddings
+        return loss

UniAnimate/tools/modules/unet/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .unet_unianimate import *
2	+

UniAnimate/tools/modules/unet/mha_flash.py ADDED Viewed

	@@ -0,0 +1,103 @@

+import torch
+import torch.nn as nn
+import torch.cuda.amp as amp
+import torch.nn.functional as F
+import math
+import os
+import time
+import numpy as np
+import random
+# from flash_attn.flash_attention import FlashAttention
+class FlashAttentionBlock(nn.Module):
+    def __init__(self, dim, context_dim=None, num_heads=None, head_dim=None, batch_size=4):
+        # consider head_dim first, then num_heads
+        num_heads = dim // head_dim if head_dim else num_heads
+        head_dim = dim // num_heads
+        assert num_heads * head_dim == dim
+        super(FlashAttentionBlock, self).__init__()
+        self.dim = dim
+        self.context_dim = context_dim
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.scale = math.pow(head_dim, -0.25)
+        # layers
+        self.norm = nn.GroupNorm(32, dim)
+        self.to_qkv = nn.Conv2d(dim, dim * 3, 1)
+        if context_dim is not None:
+            self.context_kv = nn.Linear(context_dim, dim * 2)
+        self.proj = nn.Conv2d(dim, dim, 1)
+        if self.head_dim <= 128 and (self.head_dim % 8) == 0:
+            new_scale = math.pow(head_dim, -0.5)
+            self.flash_attn = FlashAttention(softmax_scale=None, attention_dropout=0.0)
+        # zero out the last layer params
+        nn.init.zeros_(self.proj.weight)
+        # self.apply(self._init_weight)
+    def _init_weight(self, module):
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=0.15)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Conv2d):
+            module.weight.data.normal_(mean=0.0, std=0.15)
+            if module.bias is not None:
+                module.bias.data.zero_()
+    def forward(self, x, context=None):
+        r"""x:       [B, C, H, W].
+            context: [B, L, C] or None.
+        """
+        identity = x
+        b, c, h, w, n, d = *x.size(), self.num_heads, self.head_dim
+        # compute query, key, value
+        x = self.norm(x)
+        q, k, v = self.to_qkv(x).view(b, n * 3, d, h * w).chunk(3, dim=1)
+        if context is not None:
+            ck, cv = self.context_kv(context).reshape(b, -1, n * 2, d).permute(0, 2, 3, 1).chunk(2, dim=1)
+            k = torch.cat([ck, k], dim=-1)
+            v = torch.cat([cv, v], dim=-1)
+            cq = torch.zeros([b, n, d, 4], dtype=q.dtype, device=q.device)
+            q = torch.cat([q, cq], dim=-1)
+        qkv = torch.cat([q,k,v], dim=1)
+        origin_dtype = qkv.dtype
+        qkv = qkv.permute(0, 3, 1, 2).reshape(b, -1, 3, n, d).half().contiguous()
+        out, _ = self.flash_attn(qkv)
+        out.to(origin_dtype)
+        if context is not None:
+            out = out[:, :-4, :, :]
+        out = out.permute(0, 2, 3, 1).reshape(b, c, h, w)
+        # output
+        x = self.proj(out)
+        return x + identity
+if __name__ == '__main__':
+    batch_size = 8
+    flash_net = FlashAttentionBlock(dim=1280, context_dim=512, num_heads=None, head_dim=64, batch_size=batch_size).cuda()
+    x = torch.randn([batch_size, 1280, 32, 32], dtype=torch.float32).cuda()
+    context = torch.randn([batch_size, 4, 512], dtype=torch.float32).cuda()
+    # context = None
+    flash_net.eval()
+    with amp.autocast(enabled=True):
+        # warm up
+        for i in range(5):
+            y = flash_net(x, context)
+        torch.cuda.synchronize()
+        s1 = time.time()
+        for i in range(10):
+            y = flash_net(x, context)
+        torch.cuda.synchronize()
+        s2 = time.time()
+    print(f'Average cost time {(s2-s1)*1000/10} ms')

UniAnimate/tools/modules/unet/unet_unianimate.py ADDED Viewed

	@@ -0,0 +1,659 @@

+import math
+import torch
+import xformers
+import xformers.ops
+import torch.nn as nn
+from einops import rearrange
+import torch.nn.functional as F
+from rotary_embedding_torch import RotaryEmbedding
+from fairscale.nn.checkpoint import checkpoint_wrapper
+from .util import *
+# from .mha_flash import FlashAttentionBlock
+from utils.registry_class import MODEL
+USE_TEMPORAL_TRANSFORMER = True
+class PreNormattention(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs) + x
+class PreNormattention_qkv(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, q, k, v, **kwargs):
+        return self.fn(self.norm(q), self.norm(k), self.norm(v), **kwargs) + q
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.attend = nn.Softmax(dim = -1)
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+    def forward(self, x):
+        b, n, _, h = *x.shape, self.heads
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
+        dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
+        attn = self.attend(dots)
+        out = einsum('b h i j, b h j d -> b h i d', attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+class Attention_qkv(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.attend = nn.Softmax(dim = -1)
+        self.to_q = nn.Linear(dim, inner_dim, bias = False)
+        self.to_k = nn.Linear(dim, inner_dim, bias = False)
+        self.to_v = nn.Linear(dim, inner_dim, bias = False)
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+    def forward(self, q, k, v):
+        b, n, _, h = *q.shape, self.heads
+        bk = k.shape[0]
+        q = self.to_q(q)
+        k = self.to_k(k)
+        v = self.to_v(v)
+        q = rearrange(q, 'b n (h d) -> b h n d', h = h)
+        k = rearrange(k, 'b n (h d) -> b h n d', b=bk, h = h)
+        v = rearrange(v, 'b n (h d) -> b h n d', b=bk, h = h)
+        dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
+        attn = self.attend(dots)
+        out = einsum('b h i j, b h j d -> b h i d', attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+class PostNormattention(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.norm(self.fn(x, **kwargs) + x)
+class Transformer_v2(nn.Module):
+    def __init__(self, heads=8, dim=2048, dim_head_k=256, dim_head_v=256, dropout_atte = 0.05, mlp_dim=2048, dropout_ffn = 0.05, depth=1):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        self.depth = depth
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNormattention(dim, Attention(dim, heads = heads, dim_head = dim_head_k, dropout = dropout_atte)),
+                FeedForward(dim, mlp_dim, dropout = dropout_ffn),
+            ]))
+    def forward(self, x):
+        for attn, ff in self.layers[:1]:
+            x = attn(x)
+            x = ff(x) + x
+        if self.depth > 1:
+            for attn, ff in self.layers[1:]:
+                x = attn(x)
+                x = ff(x) + x
+        return x
+class DropPath(nn.Module):
+    r"""DropPath but without rescaling and supports optional all-zero and/or all-keep.
+    """
+    def __init__(self, p):
+        super(DropPath, self).__init__()
+        self.p = p
+    def forward(self, *args, zero=None, keep=None):
+        if not self.training:
+            return args[0] if len(args) == 1 else args
+        # params
+        x = args[0]
+        b = x.size(0)
+        n = (torch.rand(b) < self.p).sum()
+        # non-zero and non-keep mask
+        mask = x.new_ones(b, dtype=torch.bool)
+        if keep is not None:
+            mask[keep] = False
+        if zero is not None:
+            mask[zero] = False
+        # drop-path index
+        index = torch.where(mask)[0]
+        index = index[torch.randperm(len(index))[:n]]
+        if zero is not None:
+            index = torch.cat([index, torch.where(zero)[0]], dim=0)
+        # drop-path multiplier
+        multiplier = x.new_ones(b)
+        multiplier[index] = 0.0
+        output = tuple(u * self.broadcast(multiplier, u) for u in args)
+        return output[0] if len(args) == 1 else output
+    def broadcast(self, src, dst):
+        assert src.size(0) == dst.size(0)
+        shape = (dst.size(0), ) + (1, ) * (dst.ndim - 1)
+        return src.view(shape)
+@MODEL.register_class()
+class UNetSD_UniAnimate(nn.Module):
+    def __init__(self,
+                 config=None,
+                 in_dim=4,
+                 dim=512,
+                 y_dim=512,
+                 context_dim=1024,
+                 hist_dim = 156,
+                 concat_dim = 8,
+                 out_dim=6,
+                 dim_mult=[1, 2, 3, 4],
+                 num_heads=None,
+                 head_dim=64,
+                 num_res_blocks=3,
+                 attn_scales=[1 / 2, 1 / 4, 1 / 8],
+                 use_scale_shift_norm=True,
+                 dropout=0.1,
+                 temporal_attn_times=1,
+                 temporal_attention = True,
+                 use_checkpoint=False,
+                 use_image_dataset=False,
+                 use_fps_condition= False,
+                 use_sim_mask = False,
+                 misc_dropout = 0.5,
+                 training=True,
+                 inpainting=True,
+                 p_all_zero=0.1,
+                 p_all_keep=0.1,
+                 zero_y = None,
+                 black_image_feature = None,
+                 adapter_transformer_layers = 1,
+                 num_tokens=4,
+                 **kwargs
+                 ):
+        embed_dim = dim * 4
+        num_heads=num_heads if num_heads else dim//32
+        super(UNetSD_UniAnimate, self).__init__()
+        self.zero_y = zero_y
+        self.black_image_feature = black_image_feature
+        self.cfg = config
+        self.in_dim = in_dim
+        self.dim = dim
+        self.y_dim = y_dim
+        self.context_dim = context_dim
+        self.num_tokens = num_tokens
+        self.hist_dim = hist_dim
+        self.concat_dim = concat_dim
+        self.embed_dim = embed_dim
+        self.out_dim = out_dim
+        self.dim_mult = dim_mult
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.num_res_blocks = num_res_blocks
+        self.attn_scales = attn_scales
+        self.use_scale_shift_norm = use_scale_shift_norm
+        self.temporal_attn_times = temporal_attn_times
+        self.temporal_attention = temporal_attention
+        self.use_checkpoint = use_checkpoint
+        self.use_image_dataset = use_image_dataset
+        self.use_fps_condition = use_fps_condition
+        self.use_sim_mask = use_sim_mask
+        self.training=training
+        self.inpainting = inpainting
+        self.video_compositions = self.cfg.video_compositions
+        self.misc_dropout = misc_dropout
+        self.p_all_zero = p_all_zero
+        self.p_all_keep = p_all_keep
+        use_linear_in_temporal = False
+        transformer_depth = 1
+        disabled_sa = False
+        # params
+        enc_dims = [dim * u for u in [1] + dim_mult]
+        dec_dims = [dim * u for u in [dim_mult[-1]] + dim_mult[::-1]]
+        shortcut_dims = []
+        scale = 1.0
+        self.resolution = config.resolution
+        # embeddings
+        self.time_embed = nn.Sequential(
+            nn.Linear(dim, embed_dim),
+            nn.SiLU(),
+            nn.Linear(embed_dim, embed_dim))
+        if 'image' in self.video_compositions:
+            self.pre_image_condition = nn.Sequential(
+                nn.Linear(self.context_dim, self.context_dim),
+                nn.SiLU(),
+                nn.Linear(self.context_dim, self.context_dim*self.num_tokens))
+        if 'local_image' in self.video_compositions:
+            self.local_image_embedding = nn.Sequential(
+                    nn.Conv2d(3, concat_dim * 4, 3, padding=1),
+                    nn.SiLU(),
+                    nn.AdaptiveAvgPool2d((self.resolution[1]//2, self.resolution[0]//2)),
+                    nn.Conv2d(concat_dim * 4, concat_dim * 4, 3, stride=2, padding=1),
+                    nn.SiLU(),
+                    nn.Conv2d(concat_dim * 4, concat_dim, 3, stride=2, padding=1))
+            self.local_image_embedding_after = Transformer_v2(heads=2, dim=concat_dim, dim_head_k=concat_dim, dim_head_v=concat_dim, dropout_atte = 0.05, mlp_dim=concat_dim, dropout_ffn = 0.05, depth=adapter_transformer_layers)
+        if 'dwpose' in self.video_compositions:
+            self.dwpose_embedding = nn.Sequential(
+                    nn.Conv2d(3, concat_dim * 4, 3, padding=1),
+                    nn.SiLU(),
+                    nn.AdaptiveAvgPool2d((self.resolution[1]//2, self.resolution[0]//2)),
+                    nn.Conv2d(concat_dim * 4, concat_dim * 4, 3, stride=2, padding=1),
+                    nn.SiLU(),
+                    nn.Conv2d(concat_dim * 4, concat_dim, 3, stride=2, padding=1))
+            self.dwpose_embedding_after = Transformer_v2(heads=2, dim=concat_dim, dim_head_k=concat_dim, dim_head_v=concat_dim, dropout_atte = 0.05, mlp_dim=concat_dim, dropout_ffn = 0.05, depth=adapter_transformer_layers)
+        if 'randomref_pose' in self.video_compositions:
+            randomref_dim = 4
+            self.randomref_pose2_embedding = nn.Sequential(
+                    nn.Conv2d(3, concat_dim * 4, 3, padding=1),
+                    nn.SiLU(),
+                    nn.AdaptiveAvgPool2d((self.resolution[1]//2, self.resolution[0]//2)),
+                    nn.Conv2d(concat_dim * 4, concat_dim * 4, 3, stride=2, padding=1),
+                    nn.SiLU(),
+                    nn.Conv2d(concat_dim * 4, concat_dim+randomref_dim, 3, stride=2, padding=1))
+            self.randomref_pose2_embedding_after = Transformer_v2(heads=2, dim=concat_dim+randomref_dim, dim_head_k=concat_dim+randomref_dim, dim_head_v=concat_dim+randomref_dim, dropout_atte = 0.05, mlp_dim=concat_dim+randomref_dim, dropout_ffn = 0.05, depth=adapter_transformer_layers)
+        if 'randomref' in self.video_compositions:
+            randomref_dim = 4
+            self.randomref_embedding2 = nn.Sequential(
+                    nn.Conv2d(randomref_dim, concat_dim * 4, 3, padding=1),
+                    nn.SiLU(),
+                    nn.Conv2d(concat_dim * 4, concat_dim * 4, 3, stride=1, padding=1),
+                    nn.SiLU(),
+                    nn.Conv2d(concat_dim * 4, concat_dim+randomref_dim, 3, stride=1, padding=1))
+            self.randomref_embedding_after2 = Transformer_v2(heads=2, dim=concat_dim+randomref_dim, dim_head_k=concat_dim+randomref_dim, dim_head_v=concat_dim+randomref_dim, dropout_atte = 0.05, mlp_dim=concat_dim+randomref_dim, dropout_ffn = 0.05, depth=adapter_transformer_layers)
+        ### Condition Dropout
+        self.misc_dropout = DropPath(misc_dropout)
+        if temporal_attention and not USE_TEMPORAL_TRANSFORMER:
+            self.rotary_emb = RotaryEmbedding(min(32, head_dim))
+            self.time_rel_pos_bias = RelativePositionBias(heads = num_heads, max_distance = 32) # realistically will not be able to generate that many frames of video... yet
+        if self.use_fps_condition:
+            self.fps_embedding = nn.Sequential(
+                nn.Linear(dim, embed_dim),
+                nn.SiLU(),
+                nn.Linear(embed_dim, embed_dim))
+            nn.init.zeros_(self.fps_embedding[-1].weight)
+            nn.init.zeros_(self.fps_embedding[-1].bias)
+        # encoder
+        self.input_blocks = nn.ModuleList()
+        self.pre_image = nn.Sequential()
+        init_block = nn.ModuleList([nn.Conv2d(self.in_dim + concat_dim, dim, 3, padding=1)])
+        #### need an initial temporal attention?
+        if temporal_attention:
+            if USE_TEMPORAL_TRANSFORMER:
+                init_block.append(TemporalTransformer(dim, num_heads, head_dim, depth=transformer_depth, context_dim=context_dim,
+                                disable_self_attn=disabled_sa, use_linear=use_linear_in_temporal, multiply_zero=use_image_dataset))
+            else:
+                init_block.append(TemporalAttentionMultiBlock(dim, num_heads, head_dim, rotary_emb=self.rotary_emb, temporal_attn_times=temporal_attn_times, use_image_dataset=use_image_dataset))
+        self.input_blocks.append(init_block)
+        shortcut_dims.append(dim)
+        for i, (in_dim, out_dim) in enumerate(zip(enc_dims[:-1], enc_dims[1:])):
+            for j in range(num_res_blocks):
+                block = nn.ModuleList([ResBlock(in_dim, embed_dim, dropout, out_channels=out_dim, use_scale_shift_norm=False, use_image_dataset=use_image_dataset,)])
+                if scale in attn_scales:
+                    block.append(
+                            SpatialTransformer(
+                                out_dim, out_dim // head_dim, head_dim, depth=1, context_dim=self.context_dim,
+                                disable_self_attn=False, use_linear=True
+                            )
+                    )
+                    if self.temporal_attention:
+                        if USE_TEMPORAL_TRANSFORMER:
+                            block.append(TemporalTransformer(out_dim, out_dim // head_dim, head_dim, depth=transformer_depth, context_dim=context_dim,
+                                disable_self_attn=disabled_sa, use_linear=use_linear_in_temporal, multiply_zero=use_image_dataset))
+                        else:
+                            block.append(TemporalAttentionMultiBlock(out_dim, num_heads, head_dim, rotary_emb = self.rotary_emb, use_image_dataset=use_image_dataset, use_sim_mask=use_sim_mask, temporal_attn_times=temporal_attn_times))
+                in_dim = out_dim
+                self.input_blocks.append(block)
+                shortcut_dims.append(out_dim)
+                # downsample
+                if i != len(dim_mult) - 1 and j == num_res_blocks - 1:
+                    downsample = Downsample(
+                        out_dim, True, dims=2, out_channels=out_dim
+                    )
+                    shortcut_dims.append(out_dim)
+                    scale /= 2.0
+                    self.input_blocks.append(downsample)
+        # middle
+        self.middle_block = nn.ModuleList([
+            ResBlock(out_dim, embed_dim, dropout, use_scale_shift_norm=False, use_image_dataset=use_image_dataset,),
+            SpatialTransformer(
+                out_dim, out_dim // head_dim, head_dim, depth=1, context_dim=self.context_dim,
+                disable_self_attn=False, use_linear=True
+            )])
+        if self.temporal_attention:
+            if USE_TEMPORAL_TRANSFORMER:
+                self.middle_block.append(
+                 TemporalTransformer(
+                            out_dim, out_dim // head_dim, head_dim, depth=transformer_depth, context_dim=context_dim,
+                            disable_self_attn=disabled_sa, use_linear=use_linear_in_temporal,
+                            multiply_zero=use_image_dataset,
+                        )
+                )
+            else:
+                self.middle_block.append(TemporalAttentionMultiBlock(out_dim, num_heads, head_dim, rotary_emb =  self.rotary_emb, use_image_dataset=use_image_dataset, use_sim_mask=use_sim_mask, temporal_attn_times=temporal_attn_times))
+        self.middle_block.append(ResBlock(out_dim, embed_dim, dropout, use_scale_shift_norm=False))
+        # decoder
+        self.output_blocks = nn.ModuleList()
+        for i, (in_dim, out_dim) in enumerate(zip(dec_dims[:-1], dec_dims[1:])):
+            for j in range(num_res_blocks + 1):
+                block = nn.ModuleList([ResBlock(in_dim + shortcut_dims.pop(), embed_dim, dropout, out_dim, use_scale_shift_norm=False, use_image_dataset=use_image_dataset, )])
+                if scale in attn_scales:
+                    block.append(
+                        SpatialTransformer(
+                            out_dim, out_dim // head_dim, head_dim, depth=1, context_dim=1024,
+                            disable_self_attn=False, use_linear=True
+                        )
+                    )
+                    if self.temporal_attention:
+                        if USE_TEMPORAL_TRANSFORMER:
+                            block.append(
+                                TemporalTransformer(
+                                    out_dim, out_dim // head_dim, head_dim, depth=transformer_depth, context_dim=context_dim,
+                                    disable_self_attn=disabled_sa, use_linear=use_linear_in_temporal, multiply_zero=use_image_dataset
+                                    )
+                            )
+                        else:
+                            block.append(TemporalAttentionMultiBlock(out_dim, num_heads, head_dim, rotary_emb =self.rotary_emb, use_image_dataset=use_image_dataset, use_sim_mask=use_sim_mask, temporal_attn_times=temporal_attn_times))
+                in_dim = out_dim
+                # upsample
+                if i != len(dim_mult) - 1 and j == num_res_blocks:
+                    upsample = Upsample(out_dim, True, dims=2.0, out_channels=out_dim)
+                    scale *= 2.0
+                    block.append(upsample)
+                self.output_blocks.append(block)
+        # head
+        self.out = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Conv2d(out_dim, self.out_dim, 3, padding=1))
+        # zero out the last layer params
+        nn.init.zeros_(self.out[-1].weight)
+    def forward(self,
+        x,
+        t,
+        y = None,
+        depth = None,
+        image = None,
+        motion = None,
+        local_image = None,
+        single_sketch = None,
+        masked = None,
+        canny = None,
+        sketch = None,
+        dwpose = None,
+        randomref = None,
+        histogram = None,
+        fps = None,
+        video_mask = None,
+        focus_present_mask = None,
+        prob_focus_present = 0.,  # probability at which a given batch sample will focus on the present (0. is all off, 1. is completely arrested attention across time)
+        mask_last_frame_num = 0  # mask last frame num
+        ):
+        assert self.inpainting or masked is None, 'inpainting is not supported'
+        batch, c, f, h, w= x.shape
+        frames = f
+        device = x.device
+        self.batch = batch
+        #### image and video joint training, if mask_last_frame_num is set, prob_focus_present will be ignored
+        if mask_last_frame_num > 0:
+            focus_present_mask = None
+            video_mask[-mask_last_frame_num:] = False
+        else:
+            focus_present_mask = default(focus_present_mask, lambda: prob_mask_like((batch,), prob_focus_present, device = device))
+        if self.temporal_attention and not USE_TEMPORAL_TRANSFORMER:
+            time_rel_pos_bias = self.time_rel_pos_bias(x.shape[2], device = x.device)
+        else:
+            time_rel_pos_bias = None
+        # all-zero and all-keep masks
+        zero = torch.zeros(batch, dtype=torch.bool).to(x.device)
+        keep = torch.zeros(batch, dtype=torch.bool).to(x.device)
+        if self.training:
+            nzero = (torch.rand(batch) < self.p_all_zero).sum()
+            nkeep = (torch.rand(batch) < self.p_all_keep).sum()
+            index = torch.randperm(batch)
+            zero[index[0:nzero]] = True
+            keep[index[nzero:nzero + nkeep]] = True
+        assert not (zero & keep).any()
+        misc_dropout = partial(self.misc_dropout, zero = zero, keep = keep)
+        concat = x.new_zeros(batch, self.concat_dim, f, h, w)
+        # local_image_embedding (first frame)
+        if local_image is not None:
+            local_image = rearrange(local_image, 'b c f h w -> (b f) c h w')
+            local_image = self.local_image_embedding(local_image)
+            h = local_image.shape[2]
+            local_image = self.local_image_embedding_after(rearrange(local_image, '(b f) c h w -> (b h w) f c', b = batch))
+            local_image = rearrange(local_image, '(b h w) f c -> b c f h w', b = batch, h = h)
+            concat = concat + misc_dropout(local_image)
+        if dwpose is not None:
+            if 'randomref_pose' in self.video_compositions:
+                dwpose_random_ref = dwpose[:,:,:1].clone()
+                dwpose = dwpose[:,:,1:]
+            dwpose = rearrange(dwpose, 'b c f h w -> (b f) c h w')
+            dwpose = self.dwpose_embedding(dwpose)
+            h = dwpose.shape[2]
+            dwpose = self.dwpose_embedding_after(rearrange(dwpose, '(b f) c h w -> (b h w) f c', b = batch))
+            dwpose = rearrange(dwpose, '(b h w) f c -> b c f h w', b = batch, h = h)
+            concat = concat + misc_dropout(dwpose)
+        randomref_b = x.new_zeros(batch, self.concat_dim+4, 1, h, w)
+        if randomref is not None:
+            randomref = rearrange(randomref[:,:,:1,], 'b c f h w -> (b f) c h w')
+            randomref = self.randomref_embedding2(randomref)
+            h = randomref.shape[2]
+            randomref = self.randomref_embedding_after2(rearrange(randomref, '(b f) c h w -> (b h w) f c', b = batch))
+            if 'randomref_pose' in self.video_compositions:
+                dwpose_random_ref = rearrange(dwpose_random_ref, 'b c f h w -> (b f) c h w')
+                dwpose_random_ref = self.randomref_pose2_embedding(dwpose_random_ref)
+                dwpose_random_ref = self.randomref_pose2_embedding_after(rearrange(dwpose_random_ref, '(b f) c h w -> (b h w) f c', b = batch))
+                randomref = randomref + dwpose_random_ref
+            randomref_a = rearrange(randomref, '(b h w) f c -> b c f h w', b = batch, h = h)
+            randomref_b = randomref_b + randomref_a
+        x = torch.cat([randomref_b, torch.cat([x, concat], dim=1)], dim=2)
+        x = rearrange(x, 'b c f h w -> (b f) c h w')
+        x = self.pre_image(x)
+        x = rearrange(x, '(b f) c h w -> b c f h w', b = batch)
+        # embeddings
+        if self.use_fps_condition and fps is not None:
+            e = self.time_embed(sinusoidal_embedding(t, self.dim)) + self.fps_embedding(sinusoidal_embedding(fps, self.dim))
+        else:
+            e = self.time_embed(sinusoidal_embedding(t, self.dim))
+        context = x.new_zeros(batch, 0, self.context_dim)
+        if image is not None:
+            y_context = self.zero_y.repeat(batch, 1, 1)
+            context = torch.cat([context, y_context], dim=1)
+            image_context = misc_dropout(self.pre_image_condition(image).view(-1, self.num_tokens, self.context_dim))  # torch.cat([y[:,:-1,:], self.pre_image_condition(y[:,-1:,:]) ], dim=1)
+            context = torch.cat([context, image_context], dim=1)
+        else:
+            y_context = self.zero_y.repeat(batch, 1, 1)
+            context = torch.cat([context, y_context], dim=1)
+            image_context = torch.zeros_like(self.zero_y.repeat(batch, 1, 1))[:,:self.num_tokens]
+            context = torch.cat([context, image_context], dim=1)
+        # repeat f times for spatial e and context
+        e = e.repeat_interleave(repeats=f+1, dim=0)
+        context = context.repeat_interleave(repeats=f+1, dim=0)
+        ## always in shape (b f) c h w, except for temporal layer
+        x = rearrange(x, 'b c f h w -> (b f) c h w')
+        # encoder
+        xs = []
+        for block in self.input_blocks:
+            x = self._forward_single(block, x, e, context, time_rel_pos_bias, focus_present_mask, video_mask)
+            xs.append(x)
+        # middle
+        for block in self.middle_block:
+            x = self._forward_single(block, x, e, context, time_rel_pos_bias,focus_present_mask, video_mask)
+        # decoder
+        for block in self.output_blocks:
+            x = torch.cat([x, xs.pop()], dim=1)
+            x = self._forward_single(block, x, e, context, time_rel_pos_bias,focus_present_mask, video_mask, reference=xs[-1] if len(xs) > 0 else None)
+        # head
+        x = self.out(x)
+        # reshape back to (b c f h w)
+        x = rearrange(x, '(b f) c h w -> b c f h w', b = batch)
+        return x[:,:,1:]
+    def _forward_single(self, module, x, e, context, time_rel_pos_bias, focus_present_mask, video_mask, reference=None):
+        if isinstance(module, ResidualBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = x.contiguous()
+            x = module(x, e, reference)
+        elif isinstance(module, ResBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = x.contiguous()
+            x = module(x, e, self.batch)
+        elif isinstance(module, SpatialTransformer):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = module(x, context)
+        elif isinstance(module, TemporalTransformer):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = self.batch)
+            x = module(x, context)
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        elif isinstance(module, CrossAttention):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = module(x, context)
+        elif isinstance(module, MemoryEfficientCrossAttention):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = module(x, context)
+        elif isinstance(module, BasicTransformerBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = module(x, context)
+        elif isinstance(module, FeedForward):
+            x = module(x, context)
+        elif isinstance(module, Upsample):
+            x = module(x)
+        elif isinstance(module, Downsample):
+            x = module(x)
+        elif isinstance(module, Resample):
+            x = module(x, reference)
+        elif isinstance(module, TemporalAttentionBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = self.batch)
+            x = module(x, time_rel_pos_bias, focus_present_mask, video_mask)
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        elif isinstance(module, TemporalAttentionMultiBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = self.batch)
+            x = module(x, time_rel_pos_bias, focus_present_mask, video_mask)
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        elif isinstance(module, InitTemporalConvBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = self.batch)
+            x = module(x)
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        elif isinstance(module, TemporalConvBlock):
+            module = checkpoint_wrapper(module) if self.use_checkpoint else module
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = self.batch)
+            x = module(x)
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        elif isinstance(module, nn.ModuleList):
+            for block in module:
+                x = self._forward_single(block,  x, e, context, time_rel_pos_bias, focus_present_mask, video_mask, reference)
+        else:
+            x = module(x)
+        return x

UniAnimate/tools/modules/unet/util.py ADDED Viewed

	@@ -0,0 +1,1741 @@

+import math
+import torch
+import xformers
+import open_clip
+import xformers.ops
+import torch.nn as nn
+from torch import einsum
+from einops import rearrange
+from functools import partial
+import torch.nn.functional as F
+import torch.nn.init as init
+from rotary_embedding_torch import RotaryEmbedding
+from fairscale.nn.checkpoint import checkpoint_wrapper
+# from .mha_flash import FlashAttentionBlock
+from utils.registry_class import MODEL
+### load all keys started with prefix and replace them with new_prefix
+def load_Block(state, prefix, new_prefix=None):
+    if new_prefix is None:
+        new_prefix = prefix
+    state_dict = {}
+    state = {key:value for key,value in state.items() if prefix in key}
+    for key,value in state.items():
+        new_key = key.replace(prefix, new_prefix)
+        state_dict[new_key]=value
+    return state_dict
+def load_2d_pretrained_state_dict(state,cfg):
+    new_state_dict = {}
+    dim = cfg.unet_dim
+    num_res_blocks = cfg.unet_res_blocks
+    temporal_attention = cfg.temporal_attention
+    temporal_conv = cfg.temporal_conv
+    dim_mult = cfg.unet_dim_mult
+    attn_scales = cfg.unet_attn_scales
+    # params
+    enc_dims = [dim * u for u in [1] + dim_mult]
+    dec_dims = [dim * u for u in [dim_mult[-1]] + dim_mult[::-1]]
+    shortcut_dims = []
+    scale = 1.0
+    #embeddings
+    state_dict = load_Block(state,prefix=f'time_embedding')
+    new_state_dict.update(state_dict)
+    state_dict = load_Block(state,prefix=f'y_embedding')
+    new_state_dict.update(state_dict)
+    state_dict = load_Block(state,prefix=f'context_embedding')
+    new_state_dict.update(state_dict)
+    encoder_idx = 0
+    ### init block
+    state_dict = load_Block(state,prefix=f'encoder.{encoder_idx}',new_prefix=f'encoder.{encoder_idx}.0')
+    new_state_dict.update(state_dict)
+    encoder_idx += 1
+    shortcut_dims.append(dim)
+    for i, (in_dim, out_dim) in enumerate(zip(enc_dims[:-1], enc_dims[1:])):
+        for j in range(num_res_blocks):
+            # residual (+attention) blocks
+            idx = 0
+            idx_ = 0
+            # residual (+attention) blocks
+            state_dict = load_Block(state,prefix=f'encoder.{encoder_idx}.{idx}',new_prefix=f'encoder.{encoder_idx}.{idx_}')
+            new_state_dict.update(state_dict)
+            idx += 1
+            idx_ = 2
+            if scale in attn_scales:
+                # block.append(AttentionBlock(out_dim, context_dim, num_heads, head_dim))
+                state_dict = load_Block(state,prefix=f'encoder.{encoder_idx}.{idx}',new_prefix=f'encoder.{encoder_idx}.{idx_}')
+                new_state_dict.update(state_dict)
+                # if temporal_attention:
+                #     block.append(TemporalAttentionBlock(out_dim, num_heads, head_dim, rotary_emb = self.rotary_emb))
+            in_dim = out_dim
+            encoder_idx += 1
+            shortcut_dims.append(out_dim)
+            # downsample
+            if i != len(dim_mult) - 1 and j == num_res_blocks - 1:
+                # downsample = ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 0.5, dropout)
+                state_dict = load_Block(state,prefix=f'encoder.{encoder_idx}',new_prefix=f'encoder.{encoder_idx}.0')
+                new_state_dict.update(state_dict)
+                shortcut_dims.append(out_dim)
+                scale /= 2.0
+                encoder_idx += 1
+    # middle
+    # self.middle = nn.ModuleList([
+    #     ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 'none'),
+    #     TemporalConvBlock(out_dim),
+    #     AttentionBlock(out_dim, context_dim, num_heads, head_dim)])
+    # if temporal_attention:
+    #     self.middle.append(TemporalAttentionBlock(out_dim, num_heads, head_dim, rotary_emb =  self.rotary_emb))
+    # elif temporal_conv:
+    #     self.middle.append(TemporalConvBlock(out_dim,dropout=dropout))
+    # self.middle.append(ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 'none'))
+    # self.middle.append(TemporalConvBlock(out_dim))
+    # middle
+    middle_idx = 0
+    # self.middle = nn.ModuleList([
+    #     ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 1.0, dropout),
+    #     AttentionBlock(out_dim, context_dim, num_heads, head_dim)])
+    state_dict = load_Block(state,prefix=f'middle.{middle_idx}')
+    new_state_dict.update(state_dict)
+    middle_idx += 2
+    state_dict = load_Block(state,prefix=f'middle.1',new_prefix=f'middle.{middle_idx}')
+    new_state_dict.update(state_dict)
+    middle_idx += 1
+    for _ in range(cfg.temporal_attn_times):
+        # self.middle.append(TemporalAttentionBlock(out_dim, num_heads, head_dim, rotary_emb =  self.rotary_emb))
+        middle_idx += 1
+    # self.middle.append(ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 1.0, dropout))
+    state_dict = load_Block(state,prefix=f'middle.2',new_prefix=f'middle.{middle_idx}')
+    new_state_dict.update(state_dict)
+    middle_idx += 2
+    decoder_idx = 0
+    for i, (in_dim, out_dim) in enumerate(zip(dec_dims[:-1], dec_dims[1:])):
+        for j in range(num_res_blocks + 1):
+            idx = 0
+            idx_ = 0
+            # residual (+attention) blocks
+            # block = nn.ModuleList([ResidualBlock(in_dim + shortcut_dims.pop(), embed_dim, out_dim, use_scale_shift_norm, 1.0, dropout)])
+            state_dict = load_Block(state,prefix=f'decoder.{decoder_idx}.{idx}',new_prefix=f'decoder.{decoder_idx}.{idx_}')
+            new_state_dict.update(state_dict)
+            idx += 1
+            idx_ += 2
+            if scale in attn_scales:
+                # block.append(AttentionBlock(out_dim, context_dim, num_heads, head_dim))
+                state_dict = load_Block(state,prefix=f'decoder.{decoder_idx}.{idx}',new_prefix=f'decoder.{decoder_idx}.{idx_}')
+                new_state_dict.update(state_dict)
+                idx += 1
+                idx_ += 1
+                for _ in range(cfg.temporal_attn_times):
+                #     block.append(TemporalAttentionBlock(out_dim, num_heads, head_dim, rotary_emb =  self.rotary_emb))
+                    idx_ +=1
+            in_dim = out_dim
+            # upsample
+            if i != len(dim_mult) - 1 and j == num_res_blocks:
+                # upsample = ResidualBlock(out_dim, embed_dim, out_dim, use_scale_shift_norm, 2.0, dropout)
+                state_dict = load_Block(state,prefix=f'decoder.{decoder_idx}.{idx}',new_prefix=f'decoder.{decoder_idx}.{idx_}')
+                new_state_dict.update(state_dict)
+                idx += 1
+                idx_ += 2
+                scale *= 2.0
+                # block.append(upsample)
+            # self.decoder.append(block)
+            decoder_idx += 1
+    # head
+    # self.head = nn.Sequential(
+    #     nn.GroupNorm(32, out_dim),
+    #     nn.SiLU(),
+    #     nn.Conv3d(out_dim, self.out_dim, (1,3,3), padding=(0,1,1)))
+    state_dict = load_Block(state,prefix=f'head')
+    new_state_dict.update(state_dict)
+    return new_state_dict
+def sinusoidal_embedding(timesteps, dim):
+    # check input
+    half = dim // 2
+    timesteps = timesteps.float()
+    # compute sinusoidal embedding
+    sinusoid = torch.outer(
+        timesteps,
+        torch.pow(10000, -torch.arange(half).to(timesteps).div(half)))
+    x = torch.cat([torch.cos(sinusoid), torch.sin(sinusoid)], dim=1)
+    if dim % 2 != 0:
+        x = torch.cat([x, torch.zeros_like(x[:, :1])], dim=1)
+    return x
+def exists(x):
+    return x is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if callable(d) else d
+def prob_mask_like(shape, prob, device):
+    if prob == 1:
+        return torch.ones(shape, device = device, dtype = torch.bool)
+    elif prob == 0:
+        return torch.zeros(shape, device = device, dtype = torch.bool)
+    else:
+        mask = torch.zeros(shape, device = device).float().uniform_(0, 1) < prob
+        ### aviod mask all, which will cause find_unused_parameters error
+        if mask.all():
+            mask[0]=False
+        return mask
+class MemoryEfficientCrossAttention(nn.Module):
+    # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223
+    def __init__(self, query_dim, max_bs=4096, context_dim=None, heads=8, dim_head=64, dropout=0.0):
+        super().__init__()
+        inner_dim = dim_head * heads
+        context_dim = default(context_dim, query_dim)
+        self.max_bs = max_bs
+        self.heads = heads
+        self.dim_head = dim_head
+        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
+        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout))
+        self.attention_op: Optional[Any] = None
+    def forward(self, x, context=None, mask=None):
+        q = self.to_q(x)
+        context = default(context, x)
+        k = self.to_k(context)
+        v = self.to_v(context)
+        b, _, _ = q.shape
+        q, k, v = map(
+            lambda t: t.unsqueeze(3)
+            .reshape(b, t.shape[1], self.heads, self.dim_head)
+            .permute(0, 2, 1, 3)
+            .reshape(b * self.heads, t.shape[1], self.dim_head)
+            .contiguous(),
+            (q, k, v),
+        )
+        # actually compute the attention, what we cannot get enough of
+        if q.shape[0] > self.max_bs:
+            q_list = torch.chunk(q, q.shape[0] // self.max_bs, dim=0)
+            k_list = torch.chunk(k, k.shape[0] // self.max_bs, dim=0)
+            v_list = torch.chunk(v, v.shape[0] // self.max_bs, dim=0)
+            out_list = []
+            for q_1, k_1, v_1 in zip(q_list, k_list, v_list):
+                out = xformers.ops.memory_efficient_attention(
+                    q_1, k_1, v_1, attn_bias=None, op=self.attention_op)
+                out_list.append(out)
+            out = torch.cat(out_list, dim=0)
+        else:
+            out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
+        if exists(mask):
+            raise NotImplementedError
+        out = (
+            out.unsqueeze(0)
+            .reshape(b, self.heads, out.shape[1], self.dim_head)
+            .permute(0, 2, 1, 3)
+            .reshape(b, out.shape[1], self.heads * self.dim_head)
+        )
+        return self.to_out(out)
+class RelativePositionBias(nn.Module):
+    def __init__(
+        self,
+        heads = 8,
+        num_buckets = 32,
+        max_distance = 128
+    ):
+        super().__init__()
+        self.num_buckets = num_buckets
+        self.max_distance = max_distance
+        self.relative_attention_bias = nn.Embedding(num_buckets, heads)
+    @staticmethod
+    def _relative_position_bucket(relative_position, num_buckets = 32, max_distance = 128):
+        ret = 0
+        n = -relative_position
+        num_buckets //= 2
+        ret += (n < 0).long() * num_buckets
+        n = torch.abs(n)
+        max_exact = num_buckets // 2
+        is_small = n < max_exact
+        val_if_large = max_exact + (
+            torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)
+        ).long()
+        val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))
+        ret += torch.where(is_small, n, val_if_large)
+        return ret
+    def forward(self, n, device):
+        q_pos = torch.arange(n, dtype = torch.long, device = device)
+        k_pos = torch.arange(n, dtype = torch.long, device = device)
+        rel_pos = rearrange(k_pos, 'j -> 1 j') - rearrange(q_pos, 'i -> i 1')
+        rp_bucket = self._relative_position_bucket(rel_pos, num_buckets = self.num_buckets, max_distance = self.max_distance)
+        values = self.relative_attention_bias(rp_bucket)
+        return rearrange(values, 'i j h -> h i j')
+class SpatialTransformer(nn.Module):
+    """
+    Transformer block for image-like data.
+    First, project the input (aka embedding)
+    and reshape to b, t, d.
+    Then apply standard transformer action.
+    Finally, reshape to image
+    NEW: use_linear for more efficiency instead of the 1x1 convs
+    """
+    def __init__(self, in_channels, n_heads, d_head,
+                 depth=1, dropout=0., context_dim=None,
+                 disable_self_attn=False, use_linear=False,
+                 use_checkpoint=True):
+        super().__init__()
+        if exists(context_dim) and not isinstance(context_dim, list):
+            context_dim = [context_dim]
+        self.in_channels = in_channels
+        inner_dim = n_heads * d_head
+        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        if not use_linear:
+            self.proj_in = nn.Conv2d(in_channels,
+                                     inner_dim,
+                                     kernel_size=1,
+                                     stride=1,
+                                     padding=0)
+        else:
+            self.proj_in = nn.Linear(in_channels, inner_dim)
+        self.transformer_blocks = nn.ModuleList(
+            [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
+                                   disable_self_attn=disable_self_attn, checkpoint=use_checkpoint)
+                for d in range(depth)]
+        )
+        if not use_linear:
+            self.proj_out = zero_module(nn.Conv2d(inner_dim,
+                                                  in_channels,
+                                                  kernel_size=1,
+                                                  stride=1,
+                                                  padding=0))
+        else:
+            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
+        self.use_linear = use_linear
+    def forward(self, x, context=None):
+        # note: if no context is given, cross-attention defaults to self-attention
+        if not isinstance(context, list):
+            context = [context]
+        b, c, h, w = x.shape
+        x_in = x
+        x = self.norm(x)
+        if not self.use_linear:
+            x = self.proj_in(x)
+        x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
+        if self.use_linear:
+            x = self.proj_in(x)
+        for i, block in enumerate(self.transformer_blocks):
+            x = block(x, context=context[i])
+        if self.use_linear:
+            x = self.proj_out(x)
+        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
+        if not self.use_linear:
+            x = self.proj_out(x)
+        return x + x_in
+class SpatialTransformerWithAdapter(nn.Module):
+    """
+    Transformer block for image-like data.
+    First, project the input (aka embedding)
+    and reshape to b, t, d.
+    Then apply standard transformer action.
+    Finally, reshape to image
+    NEW: use_linear for more efficiency instead of the 1x1 convs
+    """
+    def __init__(self, in_channels, n_heads, d_head,
+                 depth=1, dropout=0., context_dim=None,
+                 disable_self_attn=False, use_linear=False,
+                 use_checkpoint=True,
+                 adapter_list=[], adapter_position_list=['', 'parallel', ''],
+                 adapter_hidden_dim=None):
+        super().__init__()
+        if exists(context_dim) and not isinstance(context_dim, list):
+            context_dim = [context_dim]
+        self.in_channels = in_channels
+        inner_dim = n_heads * d_head
+        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        if not use_linear:
+            self.proj_in = nn.Conv2d(in_channels,
+                                     inner_dim,
+                                     kernel_size=1,
+                                     stride=1,
+                                     padding=0)
+        else:
+            self.proj_in = nn.Linear(in_channels, inner_dim)
+        self.transformer_blocks = nn.ModuleList(
+            [BasicTransformerBlockWithAdapter(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
+                                   disable_self_attn=disable_self_attn, checkpoint=use_checkpoint,
+                                   adapter_list=adapter_list, adapter_position_list=adapter_position_list,
+                                   adapter_hidden_dim=adapter_hidden_dim)
+                for d in range(depth)]
+        )
+        if not use_linear:
+            self.proj_out = zero_module(nn.Conv2d(inner_dim,
+                                                  in_channels,
+                                                  kernel_size=1,
+                                                  stride=1,
+                                                  padding=0))
+        else:
+            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
+        self.use_linear = use_linear
+    def forward(self, x, context=None):
+        # note: if no context is given, cross-attention defaults to self-attention
+        if not isinstance(context, list):
+            context = [context]
+        b, c, h, w = x.shape
+        x_in = x
+        x = self.norm(x)
+        if not self.use_linear:
+            x = self.proj_in(x)
+        x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
+        if self.use_linear:
+            x = self.proj_in(x)
+        for i, block in enumerate(self.transformer_blocks):
+            x = block(x, context=context[i])
+        if self.use_linear:
+            x = self.proj_out(x)
+        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
+        if not self.use_linear:
+            x = self.proj_out(x)
+        return x + x_in
+import os
+_ATTN_PRECISION = os.environ.get("ATTN_PRECISION", "fp32")
+class CrossAttention(nn.Module):
+    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):
+        super().__init__()
+        inner_dim = dim_head * heads
+        context_dim = default(context_dim, query_dim)
+        self.scale = dim_head ** -0.5
+        self.heads = heads
+        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
+        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, query_dim),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x, context=None, mask=None):
+        h = self.heads
+        q = self.to_q(x)
+        context = default(context, x)
+        k = self.to_k(context)
+        v = self.to_v(context)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
+        # force cast to fp32 to avoid overflowing
+        if _ATTN_PRECISION =="fp32":
+            with torch.autocast(enabled=False, device_type = 'cuda'):
+                q, k = q.float(), k.float()
+                sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
+        else:
+            sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
+        del q, k
+        if exists(mask):
+            mask = rearrange(mask, 'b ... -> b (...)')
+            max_neg_value = -torch.finfo(sim.dtype).max
+            mask = repeat(mask, 'b j -> (b h) () j', h=h)
+            sim.masked_fill_(~mask, max_neg_value)
+        # attention, what we cannot get enough of
+        sim = sim.softmax(dim=-1)
+        out = torch.einsum('b i j, b j d -> b i d', sim, v)
+        out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
+        return self.to_out(out)
+class Adapter(nn.Module):
+    def __init__(self, in_dim, hidden_dim, condition_dim=None):
+        super().__init__()
+        self.down_linear = nn.Linear(in_dim, hidden_dim)
+        self.up_linear = nn.Linear(hidden_dim, in_dim)
+        self.condition_dim = condition_dim
+        if condition_dim is not None:
+            self.condition_linear = nn.Linear(condition_dim, in_dim)
+        init.zeros_(self.up_linear.weight)
+        init.zeros_(self.up_linear.bias)
+    def forward(self, x, condition=None, condition_lam=1):
+        x_in = x
+        if self.condition_dim is not None and condition is not None:
+            x = x + condition_lam * self.condition_linear(condition)
+        x = self.down_linear(x)
+        x = F.gelu(x)
+        x = self.up_linear(x)
+        x += x_in
+        return x
+class MemoryEfficientCrossAttention_attemask(nn.Module):
+    # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223
+    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0):
+        super().__init__()
+        inner_dim = dim_head * heads
+        context_dim = default(context_dim, query_dim)
+        self.heads = heads
+        self.dim_head = dim_head
+        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
+        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
+        self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout))
+        self.attention_op: Optional[Any] = None
+    def forward(self, x, context=None, mask=None):
+        q = self.to_q(x)
+        context = default(context, x)
+        k = self.to_k(context)
+        v = self.to_v(context)
+        b, _, _ = q.shape
+        q, k, v = map(
+            lambda t: t.unsqueeze(3)
+            .reshape(b, t.shape[1], self.heads, self.dim_head)
+            .permute(0, 2, 1, 3)
+            .reshape(b * self.heads, t.shape[1], self.dim_head)
+            .contiguous(),
+            (q, k, v),
+        )
+        # actually compute the attention, what we cannot get enough of
+        out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=xformers.ops.LowerTriangularMask(), op=self.attention_op)
+        if exists(mask):
+            raise NotImplementedError
+        out = (
+            out.unsqueeze(0)
+            .reshape(b, self.heads, out.shape[1], self.dim_head)
+            .permute(0, 2, 1, 3)
+            .reshape(b, out.shape[1], self.heads * self.dim_head)
+        )
+        return self.to_out(out)
+class BasicTransformerBlock_attemask(nn.Module):
+    # ATTENTION_MODES = {
+    #     "softmax": CrossAttention,  # vanilla attention
+    #     "softmax-xformers": MemoryEfficientCrossAttention
+    # }
+    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
+                 disable_self_attn=False):
+        super().__init__()
+        # attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax"
+        # assert attn_mode in self.ATTENTION_MODES
+        # attn_cls = CrossAttention
+        attn_cls = MemoryEfficientCrossAttention_attemask
+        self.disable_self_attn = disable_self_attn
+        self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
+                              context_dim=context_dim if self.disable_self_attn else None)  # is a self-attention if not self.disable_self_attn
+        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
+        self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim,
+                              heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none
+        self.norm1 = nn.LayerNorm(dim)
+        self.norm2 = nn.LayerNorm(dim)
+        self.norm3 = nn.LayerNorm(dim)
+        self.checkpoint = checkpoint
+    def forward_(self, x, context=None):
+        return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
+    def forward(self, x, context=None):
+        x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
+        x = self.attn2(self.norm2(x), context=context) + x
+        x = self.ff(self.norm3(x)) + x
+        return x
+class BasicTransformerBlockWithAdapter(nn.Module):
+    # ATTENTION_MODES = {
+    #     "softmax": CrossAttention,  # vanilla attention
+    #     "softmax-xformers": MemoryEfficientCrossAttention
+    # }
+    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True, disable_self_attn=False,
+                adapter_list=[], adapter_position_list=['parallel', 'parallel', 'parallel'], adapter_hidden_dim=None, adapter_condition_dim=None
+                ):
+        super().__init__()
+        # attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax"
+        # assert attn_mode in self.ATTENTION_MODES
+        # attn_cls = CrossAttention
+        attn_cls = MemoryEfficientCrossAttention
+        self.disable_self_attn = disable_self_attn
+        self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
+                              context_dim=context_dim if self.disable_self_attn else None)  # is a self-attention if not self.disable_self_attn
+        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
+        self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim,
+                              heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none
+        self.norm1 = nn.LayerNorm(dim)
+        self.norm2 = nn.LayerNorm(dim)
+        self.norm3 = nn.LayerNorm(dim)
+        self.checkpoint = checkpoint
+        # adapter
+        self.adapter_list = adapter_list
+        self.adapter_position_list = adapter_position_list
+        hidden_dim = dim//2 if not adapter_hidden_dim else adapter_hidden_dim
+        if "self_attention" in adapter_list:
+            self.attn_adapter = Adapter(dim, hidden_dim, adapter_condition_dim)
+        if "cross_attention" in adapter_list:
+            self.cross_attn_adapter = Adapter(dim, hidden_dim, adapter_condition_dim)
+        if "feedforward" in adapter_list:
+            self.ff_adapter = Adapter(dim, hidden_dim, adapter_condition_dim)
+    def forward_(self, x, context=None, adapter_condition=None, adapter_condition_lam=1):
+        return checkpoint(self._forward, (x, context, adapter_condition, adapter_condition_lam), self.parameters(), self.checkpoint)
+    def forward(self, x, context=None, adapter_condition=None, adapter_condition_lam=1):
+        if "self_attention" in self.adapter_list:
+            if self.adapter_position_list[0] == 'parallel':
+                # parallel
+                x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + self.attn_adapter(x, adapter_condition, adapter_condition_lam)
+            elif self.adapter_position_list[0] == 'serial':
+                # serial
+                x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
+                x = self.attn_adapter(x, adapter_condition, adapter_condition_lam)
+        else:
+            x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
+        if "cross_attention" in self.adapter_list:
+            if self.adapter_position_list[1] == 'parallel':
+                # parallel
+                x = self.attn2(self.norm2(x), context=context) + self.cross_attn_adapter(x, adapter_condition, adapter_condition_lam)
+            elif self.adapter_position_list[1] == 'serial':
+                x = self.attn2(self.norm2(x), context=context) + x
+                x = self.cross_attn_adapter(x, adapter_condition, adapter_condition_lam)
+        else:
+            x = self.attn2(self.norm2(x), context=context) + x
+        if "feedforward" in self.adapter_list:
+            if self.adapter_position_list[2] == 'parallel':
+                x = self.ff(self.norm3(x)) + self.ff_adapter(x, adapter_condition, adapter_condition_lam)
+            elif self.adapter_position_list[2] == 'serial':
+                x = self.ff(self.norm3(x)) + x
+                x = self.ff_adapter(x, adapter_condition, adapter_condition_lam)
+        else:
+            x = self.ff(self.norm3(x)) + x
+        return x
+class BasicTransformerBlock(nn.Module):
+    # ATTENTION_MODES = {
+    #     "softmax": CrossAttention,  # vanilla attention
+    #     "softmax-xformers": MemoryEfficientCrossAttention
+    # }
+    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
+                 disable_self_attn=False):
+        super().__init__()
+        # attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax"
+        # assert attn_mode in self.ATTENTION_MODES
+        # attn_cls = CrossAttention
+        attn_cls = MemoryEfficientCrossAttention
+        self.disable_self_attn = disable_self_attn
+        self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
+                              context_dim=context_dim if self.disable_self_attn else None)  # is a self-attention if not self.disable_self_attn
+        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
+        self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim,
+                              heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none
+        self.norm1 = nn.LayerNorm(dim)
+        self.norm2 = nn.LayerNorm(dim)
+        self.norm3 = nn.LayerNorm(dim)
+        self.checkpoint = checkpoint
+    def forward_(self, x, context=None):
+        return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
+    def forward(self, x, context=None):
+        x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
+        x = self.attn2(self.norm2(x), context=context) + x
+        x = self.ff(self.norm3(x)) + x
+        return x
+# feedforward
+class GEGLU(nn.Module):
+    def __init__(self, dim_in, dim_out):
+        super().__init__()
+        self.proj = nn.Linear(dim_in, dim_out * 2)
+    def forward(self, x):
+        x, gate = self.proj(x).chunk(2, dim=-1)
+        return x * F.gelu(gate)
+def zero_module(module):
+    """
+    Zero out the parameters of a module and return it.
+    """
+    for p in module.parameters():
+        p.detach().zero_()
+    return module
+class FeedForward(nn.Module):
+    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = default(dim_out, dim)
+        project_in = nn.Sequential(
+            nn.Linear(dim, inner_dim),
+            nn.GELU()
+        ) if not glu else GEGLU(dim, inner_dim)
+        self.net = nn.Sequential(
+            project_in,
+            nn.Dropout(dropout),
+            nn.Linear(inner_dim, dim_out)
+        )
+    def forward(self, x):
+        return self.net(x)
+class Upsample(nn.Module):
+    """
+    An upsampling layer with an optional convolution.
+    :param channels: channels in the inputs and outputs.
+    :param use_conv: a bool determining if a convolution is applied.
+    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
+                 upsampling occurs in the inner-two dimensions.
+    """
+    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.dims = dims
+        if use_conv:
+            self.conv = nn.Conv2d(self.channels, self.out_channels, 3, padding=padding)
+    def forward(self, x):
+        assert x.shape[1] == self.channels
+        if self.dims == 3:
+            x = F.interpolate(
+                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
+            )
+        else:
+            x = F.interpolate(x, scale_factor=2, mode="nearest")
+        if self.use_conv:
+            x = self.conv(x)
+        return x
+class UpsampleSR600(nn.Module):
+    """
+    An upsampling layer with an optional convolution.
+    :param channels: channels in the inputs and outputs.
+    :param use_conv: a bool determining if a convolution is applied.
+    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
+                 upsampling occurs in the inner-two dimensions.
+    """
+    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.dims = dims
+        if use_conv:
+            self.conv = nn.Conv2d(self.channels, self.out_channels, 3, padding=padding)
+    def forward(self, x):
+        assert x.shape[1] == self.channels
+        if self.dims == 3:
+            x = F.interpolate(
+                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
+            )
+        else:
+            x = F.interpolate(x, scale_factor=2, mode="nearest")
+            # TODO: to match input_blocks, remove elements of two sides
+            x = x[..., 1:-1, :]
+        if self.use_conv:
+            x = self.conv(x)
+        return x
+class ResBlock(nn.Module):
+    """
+    A residual block that can optionally change the number of channels.
+    :param channels: the number of input channels.
+    :param emb_channels: the number of timestep embedding channels.
+    :param dropout: the rate of dropout.
+    :param out_channels: if specified, the number of out channels.
+    :param use_conv: if True and out_channels is specified, use a spatial
+        convolution instead of a smaller 1x1 convolution to change the
+        channels in the skip connection.
+    :param dims: determines if the signal is 1D, 2D, or 3D.
+    :param use_checkpoint: if True, use gradient checkpointing on this module.
+    :param up: if True, use this block for upsampling.
+    :param down: if True, use this block for downsampling.
+    """
+    def __init__(
+        self,
+        channels,
+        emb_channels,
+        dropout,
+        out_channels=None,
+        use_conv=False,
+        use_scale_shift_norm=False,
+        dims=2,
+        up=False,
+        down=False,
+        use_temporal_conv=True,
+        use_image_dataset=False,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.emb_channels = emb_channels
+        self.dropout = dropout
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.use_scale_shift_norm = use_scale_shift_norm
+        self.use_temporal_conv = use_temporal_conv
+        self.in_layers = nn.Sequential(
+            nn.GroupNorm(32, channels),
+            nn.SiLU(),
+            nn.Conv2d(channels, self.out_channels, 3, padding=1),
+        )
+        self.updown = up or down
+        if up:
+            self.h_upd = Upsample(channels, False, dims)
+            self.x_upd = Upsample(channels, False, dims)
+        elif down:
+            self.h_upd = Downsample(channels, False, dims)
+            self.x_upd = Downsample(channels, False, dims)
+        else:
+            self.h_upd = self.x_upd = nn.Identity()
+        self.emb_layers = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(
+                emb_channels,
+                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
+            ),
+        )
+        self.out_layers = nn.Sequential(
+            nn.GroupNorm(32, self.out_channels),
+            nn.SiLU(),
+            nn.Dropout(p=dropout),
+            zero_module(
+                nn.Conv2d(self.out_channels, self.out_channels, 3, padding=1)
+            ),
+        )
+        if self.out_channels == channels:
+            self.skip_connection = nn.Identity()
+        elif use_conv:
+            self.skip_connection = conv_nd(
+                dims, channels, self.out_channels, 3, padding=1
+            )
+        else:
+            self.skip_connection = nn.Conv2d(channels, self.out_channels, 1)
+        if self.use_temporal_conv:
+            self.temopral_conv = TemporalConvBlock_v2(self.out_channels, self.out_channels, dropout=0.1, use_image_dataset=use_image_dataset)
+            # self.temopral_conv_2 = TemporalConvBlock(self.out_channels, self.out_channels, dropout=0.1, use_image_dataset=use_image_dataset)
+    def forward(self, x, emb, batch_size):
+        """
+        Apply the block to a Tensor, conditioned on a timestep embedding.
+        :param x: an [N x C x ...] Tensor of features.
+        :param emb: an [N x emb_channels] Tensor of timestep embeddings.
+        :return: an [N x C x ...] Tensor of outputs.
+        """
+        return self._forward(x, emb, batch_size)
+    def _forward(self, x, emb, batch_size):
+        if self.updown:
+            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
+            h = in_rest(x)
+            h = self.h_upd(h)
+            x = self.x_upd(x)
+            h = in_conv(h)
+        else:
+            h = self.in_layers(x)
+        emb_out = self.emb_layers(emb).type(h.dtype)
+        while len(emb_out.shape) < len(h.shape):
+            emb_out = emb_out[..., None]
+        if self.use_scale_shift_norm:
+            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
+            scale, shift = th.chunk(emb_out, 2, dim=1)
+            h = out_norm(h) * (1 + scale) + shift
+            h = out_rest(h)
+        else:
+            h = h + emb_out
+            h = self.out_layers(h)
+        h = self.skip_connection(x) + h
+        if self.use_temporal_conv:
+            h = rearrange(h, '(b f) c h w -> b c f h w', b=batch_size)
+            h = self.temopral_conv(h)
+            # h = self.temopral_conv_2(h)
+            h = rearrange(h, 'b c f h w -> (b f) c h w')
+        return h
+class Downsample(nn.Module):
+    """
+    A downsampling layer with an optional convolution.
+    :param channels: channels in the inputs and outputs.
+    :param use_conv: a bool determining if a convolution is applied.
+    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
+                 downsampling occurs in the inner-two dimensions.
+    """
+    def __init__(self, channels, use_conv, dims=2, out_channels=None,padding=1):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.dims = dims
+        stride = 2 if dims != 3 else (1, 2, 2)
+        if use_conv:
+            self.op = nn.Conv2d(self.channels, self.out_channels, 3, stride=stride, padding=padding)
+        else:
+            assert self.channels == self.out_channels
+            self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)
+    def forward(self, x):
+        assert x.shape[1] == self.channels
+        return self.op(x)
+class Resample(nn.Module):
+    def __init__(self, in_dim, out_dim, mode):
+        assert mode in ['none', 'upsample', 'downsample']
+        super(Resample, self).__init__()
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        self.mode = mode
+    def forward(self, x, reference=None):
+        if self.mode == 'upsample':
+            assert reference is not None
+            x = F.interpolate(x, size=reference.shape[-2:], mode='nearest')
+        elif self.mode == 'downsample':
+            x = F.adaptive_avg_pool2d(x, output_size=tuple(u // 2 for u in x.shape[-2:]))
+        return x
+class ResidualBlock(nn.Module):
+    def __init__(self, in_dim, embed_dim, out_dim, use_scale_shift_norm=True,
+                 mode='none', dropout=0.0):
+        super(ResidualBlock, self).__init__()
+        self.in_dim = in_dim
+        self.embed_dim = embed_dim
+        self.out_dim = out_dim
+        self.use_scale_shift_norm = use_scale_shift_norm
+        self.mode = mode
+        # layers
+        self.layer1 = nn.Sequential(
+            nn.GroupNorm(32, in_dim),
+            nn.SiLU(),
+            nn.Conv2d(in_dim, out_dim, 3, padding=1))
+        self.resample = Resample(in_dim, in_dim, mode)
+        self.embedding = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(embed_dim, out_dim * 2 if use_scale_shift_norm else out_dim))
+        self.layer2 = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv2d(out_dim, out_dim, 3, padding=1))
+        self.shortcut = nn.Identity() if in_dim == out_dim else nn.Conv2d(in_dim, out_dim, 1)
+        # zero out the last layer params
+        nn.init.zeros_(self.layer2[-1].weight)
+    def forward(self, x, e, reference=None):
+        identity = self.resample(x, reference)
+        x = self.layer1[-1](self.resample(self.layer1[:-1](x), reference))
+        e = self.embedding(e).unsqueeze(-1).unsqueeze(-1).type(x.dtype)
+        if self.use_scale_shift_norm:
+            scale, shift = e.chunk(2, dim=1)
+            x = self.layer2[0](x) * (1 + scale) + shift
+            x = self.layer2[1:](x)
+        else:
+            x = x + e
+            x = self.layer2(x)
+        x = x + self.shortcut(identity)
+        return x
+class AttentionBlock(nn.Module):
+    def __init__(self, dim, context_dim=None, num_heads=None, head_dim=None):
+        # consider head_dim first, then num_heads
+        num_heads = dim // head_dim if head_dim else num_heads
+        head_dim = dim // num_heads
+        assert num_heads * head_dim == dim
+        super(AttentionBlock, self).__init__()
+        self.dim = dim
+        self.context_dim = context_dim
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.scale = math.pow(head_dim, -0.25)
+        # layers
+        self.norm = nn.GroupNorm(32, dim)
+        self.to_qkv = nn.Conv2d(dim, dim * 3, 1)
+        if context_dim is not None:
+            self.context_kv = nn.Linear(context_dim, dim * 2)
+        self.proj = nn.Conv2d(dim, dim, 1)
+        # zero out the last layer params
+        nn.init.zeros_(self.proj.weight)
+    def forward(self, x, context=None):
+        r"""x:       [B, C, H, W].
+            context: [B, L, C] or None.
+        """
+        identity = x
+        b, c, h, w, n, d = *x.size(), self.num_heads, self.head_dim
+        # compute query, key, value
+        x = self.norm(x)
+        q, k, v = self.to_qkv(x).view(b, n * 3, d, h * w).chunk(3, dim=1)
+        if context is not None:
+            ck, cv = self.context_kv(context).reshape(b, -1, n * 2, d).permute(0, 2, 3, 1).chunk(2, dim=1)
+            k = torch.cat([ck, k], dim=-1)
+            v = torch.cat([cv, v], dim=-1)
+        # compute attention
+        attn = torch.matmul(q.transpose(-1, -2) * self.scale, k * self.scale)
+        attn = F.softmax(attn, dim=-1)
+        # gather context
+        x = torch.matmul(v, attn.transpose(-1, -2))
+        x = x.reshape(b, c, h, w)
+        # output
+        x = self.proj(x)
+        return x + identity
+class TemporalAttentionBlock(nn.Module):
+    def __init__(
+        self,
+        dim,
+        heads = 4,
+        dim_head = 32,
+        rotary_emb = None,
+        use_image_dataset = False,
+        use_sim_mask = False
+    ):
+        super().__init__()
+        # consider num_heads first, as pos_bias needs fixed num_heads
+        # heads = dim // dim_head if dim_head else heads
+        dim_head = dim // heads
+        assert heads * dim_head == dim
+        self.use_image_dataset = use_image_dataset
+        self.use_sim_mask = use_sim_mask
+        self.scale = dim_head ** -0.5
+        self.heads = heads
+        hidden_dim = dim_head * heads
+        self.norm = nn.GroupNorm(32, dim)
+        self.rotary_emb = rotary_emb
+        self.to_qkv = nn.Linear(dim, hidden_dim * 3)#, bias = False)
+        self.to_out = nn.Linear(hidden_dim, dim)#, bias = False)
+        # nn.init.zeros_(self.to_out.weight)
+        # nn.init.zeros_(self.to_out.bias)
+    def forward(
+        self,
+        x,
+        pos_bias = None,
+        focus_present_mask = None,
+        video_mask = None
+    ):
+        identity = x
+        n, height, device = x.shape[2], x.shape[-2], x.device
+        x = self.norm(x)
+        x = rearrange(x, 'b c f h w -> b (h w) f c')
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        if exists(focus_present_mask) and focus_present_mask.all():
+            # if all batch samples are focusing on present
+            # it would be equivalent to passing that token's values （v=qkv[-1]） through to the output
+            values = qkv[-1]
+            out = self.to_out(values)
+            out = rearrange(out, 'b (h w) f c -> b c f h w', h = height)
+            return out + identity
+        # split out heads
+        # q, k, v = rearrange_many(qkv, '... n (h d) -> ... h n d', h = self.heads)
+        # shape [b (hw) h n c/h], n=f
+        q= rearrange(qkv[0], '... n (h d) -> ... h n d', h = self.heads)
+        k= rearrange(qkv[1], '... n (h d) -> ... h n d', h = self.heads)
+        v= rearrange(qkv[2], '... n (h d) -> ... h n d', h = self.heads)
+        # scale
+        q = q * self.scale
+        # rotate positions into queries and keys for time attention
+        if exists(self.rotary_emb):
+            q = self.rotary_emb.rotate_queries_or_keys(q)
+            k = self.rotary_emb.rotate_queries_or_keys(k)
+        # similarity
+        # shape [b (hw) h n n], n=f
+        sim = torch.einsum('... h i d, ... h j d -> ... h i j', q, k)
+        # relative positional bias
+        if exists(pos_bias):
+            # print(sim.shape,pos_bias.shape)
+            sim = sim + pos_bias
+        if (focus_present_mask is None and video_mask is not None):
+            #video_mask: [B, n]
+            mask = video_mask[:, None, :] * video_mask[:, :, None] # [b,n,n]
+            mask = mask.unsqueeze(1).unsqueeze(1) #[b,1,1,n,n]
+            sim = sim.masked_fill(~mask, -torch.finfo(sim.dtype).max)
+        elif exists(focus_present_mask) and not (~focus_present_mask).all():
+            attend_all_mask = torch.ones((n, n), device = device, dtype = torch.bool)
+            attend_self_mask = torch.eye(n, device = device, dtype = torch.bool)
+            mask = torch.where(
+                rearrange(focus_present_mask, 'b -> b 1 1 1 1'),
+                rearrange(attend_self_mask, 'i j -> 1 1 1 i j'),
+                rearrange(attend_all_mask, 'i j -> 1 1 1 i j'),
+            )
+            sim = sim.masked_fill(~mask, -torch.finfo(sim.dtype).max)
+        if self.use_sim_mask:
+            sim_mask = torch.tril(torch.ones((n, n), device = device, dtype = torch.bool), diagonal=0)
+            sim = sim.masked_fill(~sim_mask, -torch.finfo(sim.dtype).max)
+        # numerical stability
+        sim = sim - sim.amax(dim = -1, keepdim = True).detach()
+        attn = sim.softmax(dim = -1)
+        # aggregate values
+        out = torch.einsum('... h i j, ... h j d -> ... h i d', attn, v)
+        out = rearrange(out, '... h n d -> ... n (h d)')
+        out = self.to_out(out)
+        out = rearrange(out, 'b (h w) f c -> b c f h w', h = height)
+        if self.use_image_dataset:
+            out = identity + 0*out
+        else:
+            out = identity + out
+        return out
+class TemporalTransformer(nn.Module):
+    """
+    Transformer block for image-like data.
+    First, project the input (aka embedding)
+    and reshape to b, t, d.
+    Then apply standard transformer action.
+    Finally, reshape to image
+    """
+    def __init__(self, in_channels, n_heads, d_head,
+                 depth=1, dropout=0., context_dim=None,
+                 disable_self_attn=False, use_linear=False,
+                 use_checkpoint=True, only_self_att=True, multiply_zero=False):
+        super().__init__()
+        self.multiply_zero = multiply_zero
+        self.only_self_att = only_self_att
+        self.use_adaptor = False
+        if self.only_self_att:
+            context_dim = None
+        if not isinstance(context_dim, list):
+            context_dim = [context_dim]
+        self.in_channels = in_channels
+        inner_dim = n_heads * d_head
+        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        if not use_linear:
+            self.proj_in = nn.Conv1d(in_channels,
+                                    inner_dim,
+                                    kernel_size=1,
+                                    stride=1,
+                                    padding=0)
+        else:
+            self.proj_in = nn.Linear(in_channels, inner_dim)
+            if self.use_adaptor:
+                self.adaptor_in = nn.Linear(frames, frames)
+        self.transformer_blocks = nn.ModuleList(
+            [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
+                checkpoint=use_checkpoint)
+                for d in range(depth)]
+        )
+        if not use_linear:
+            self.proj_out = zero_module(nn.Conv1d(inner_dim,
+                                                in_channels,
+                                                kernel_size=1,
+                                                stride=1,
+                                                padding=0))
+        else:
+            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
+            if self.use_adaptor:
+                self.adaptor_out = nn.Linear(frames, frames)
+        self.use_linear = use_linear
+    def forward(self, x, context=None):
+        # note: if no context is given, cross-attention defaults to self-attention
+        if self.only_self_att:
+            context = None
+        if not isinstance(context, list):
+            context = [context]
+        b, c, f, h, w = x.shape
+        x_in = x
+        x = self.norm(x)
+        if not self.use_linear:
+            x = rearrange(x, 'b c f h w -> (b h w) c f').contiguous()
+            x = self.proj_in(x)
+        # [16384, 16, 320]
+        if self.use_linear:
+            x = rearrange(x, '(b f) c h w -> b (h w) f c', f=self.frames).contiguous()
+            x = self.proj_in(x)
+        if self.only_self_att:
+            x = rearrange(x, 'bhw c f -> bhw f c').contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                x = block(x)
+            x = rearrange(x, '(b hw) f c -> b hw f c', b=b).contiguous()
+        else:
+            x = rearrange(x, '(b hw) c f -> b hw f c', b=b).contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                # context[i] = repeat(context[i], '(b f) l con -> b (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                context[i] = rearrange(context[i], '(b f) l con -> b f l con', f=self.frames).contiguous()
+                # calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
+                for j in range(b):
+                    context_i_j = repeat(context[i][j], 'f l con -> (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                    x[j] = block(x[j], context=context_i_j)
+        if self.use_linear:
+            x = self.proj_out(x)
+            x = rearrange(x, 'b (h w) f c -> b f c h w', h=h, w=w).contiguous()
+        if not self.use_linear:
+            # x = rearrange(x, 'bhw f c -> bhw c f').contiguous()
+            x = rearrange(x, 'b hw f c -> (b hw) c f').contiguous()
+            x = self.proj_out(x)
+            x = rearrange(x, '(b h w) c f -> b c f h w', b=b, h=h, w=w).contiguous()
+        if self.multiply_zero:
+            x = 0.0 * x + x_in
+        else:
+            x = x + x_in
+        return x
+class TemporalTransformerWithAdapter(nn.Module):
+    """
+    Transformer block for image-like data.
+    First, project the input (aka embedding)
+    and reshape to b, t, d.
+    Then apply standard transformer action.
+    Finally, reshape to image
+    """
+    def __init__(self, in_channels, n_heads, d_head,
+                 depth=1, dropout=0., context_dim=None,
+                 disable_self_attn=False, use_linear=False,
+                 use_checkpoint=True, only_self_att=True, multiply_zero=False,
+                 adapter_list=[], adapter_position_list=['parallel', 'parallel', 'parallel'],
+                 adapter_hidden_dim=None, adapter_condition_dim=None):
+        super().__init__()
+        self.multiply_zero = multiply_zero
+        self.only_self_att = only_self_att
+        self.use_adaptor = False
+        if self.only_self_att:
+            context_dim = None
+        if not isinstance(context_dim, list):
+            context_dim = [context_dim]
+        self.in_channels = in_channels
+        inner_dim = n_heads * d_head
+        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        if not use_linear:
+            self.proj_in = nn.Conv1d(in_channels,
+                                    inner_dim,
+                                    kernel_size=1,
+                                    stride=1,
+                                    padding=0)
+        else:
+            self.proj_in = nn.Linear(in_channels, inner_dim)
+            if self.use_adaptor:
+                self.adaptor_in = nn.Linear(frames, frames)
+        self.transformer_blocks = nn.ModuleList(
+            [BasicTransformerBlockWithAdapter(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
+                checkpoint=use_checkpoint, adapter_list=adapter_list, adapter_position_list=adapter_position_list,
+                adapter_hidden_dim=adapter_hidden_dim, adapter_condition_dim=adapter_condition_dim)
+                for d in range(depth)]
+        )
+        if not use_linear:
+            self.proj_out = zero_module(nn.Conv1d(inner_dim,
+                                                in_channels,
+                                                kernel_size=1,
+                                                stride=1,
+                                                padding=0))
+        else:
+            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
+            if self.use_adaptor:
+                self.adaptor_out = nn.Linear(frames, frames)
+        self.use_linear = use_linear
+    def forward(self, x, context=None, adapter_condition=None, adapter_condition_lam=1):
+        # note: if no context is given, cross-attention defaults to self-attention
+        if self.only_self_att:
+            context = None
+        if not isinstance(context, list):
+            context = [context]
+        b, c, f, h, w = x.shape
+        x_in = x
+        x = self.norm(x)
+        if not self.use_linear:
+            x = rearrange(x, 'b c f h w -> (b h w) c f').contiguous()
+            x = self.proj_in(x)
+        # [16384, 16, 320]
+        if self.use_linear:
+            x = rearrange(x, '(b f) c h w -> b (h w) f c', f=self.frames).contiguous()
+            x = self.proj_in(x)
+        if adapter_condition is not None:
+            b_cond, f_cond, c_cond = adapter_condition.shape
+            adapter_condition = adapter_condition.unsqueeze(1).unsqueeze(1).repeat(1, h, w, 1, 1)
+            adapter_condition = adapter_condition.reshape(b_cond*h*w, f_cond, c_cond)
+        if self.only_self_att:
+            x = rearrange(x, 'bhw c f -> bhw f c').contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                x = block(x, adapter_condition=adapter_condition, adapter_condition_lam=adapter_condition_lam)
+            x = rearrange(x, '(b hw) f c -> b hw f c', b=b).contiguous()
+        else:
+            x = rearrange(x, '(b hw) c f -> b hw f c', b=b).contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                # context[i] = repeat(context[i], '(b f) l con -> b (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                context[i] = rearrange(context[i], '(b f) l con -> b f l con', f=self.frames).contiguous()
+                # calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
+                for j in range(b):
+                    context_i_j = repeat(context[i][j], 'f l con -> (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                    x[j] = block(x[j], context=context_i_j)
+        if self.use_linear:
+            x = self.proj_out(x)
+            x = rearrange(x, 'b (h w) f c -> b f c h w', h=h, w=w).contiguous()
+        if not self.use_linear:
+            # x = rearrange(x, 'bhw f c -> bhw c f').contiguous()
+            x = rearrange(x, 'b hw f c -> (b hw) c f').contiguous()
+            x = self.proj_out(x)
+            x = rearrange(x, '(b h w) c f -> b c f h w', b=b, h=h, w=w).contiguous()
+        if self.multiply_zero:
+            x = 0.0 * x + x_in
+        else:
+            x = x + x_in
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
+        super().__init__()
+        inner_dim = dim_head *  heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.attend = nn.Softmax(dim = -1)
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
+        self.to_out = nn.Sequential(
+            nn.Linear(inner_dim, dim),
+            nn.Dropout(dropout)
+        ) if project_out else nn.Identity()
+    def forward(self, x):
+        b, n, _, h = *x.shape, self.heads
+        qkv = self.to_qkv(x).chunk(3, dim = -1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
+        dots = torch.einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
+        attn = self.attend(dots)
+        out = torch.einsum('b h i j, b h j d -> b h i d', attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+class PreNormattention(nn.Module):
+    def __init__(self, dim, fn):
+        super().__init__()
+        self.norm = nn.LayerNorm(dim)
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        return self.fn(self.norm(x), **kwargs) + x
+class TransformerV2(nn.Module):
+    def __init__(self, heads=8, dim=2048, dim_head_k=256, dim_head_v=256, dropout_atte = 0.05, mlp_dim=2048, dropout_ffn = 0.05, depth=1):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        self.depth = depth
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                PreNormattention(dim, Attention(dim, heads = heads, dim_head = dim_head_k, dropout = dropout_atte)),
+                FeedForward(dim, mlp_dim, dropout = dropout_ffn),
+            ]))
+    def forward(self, x):
+        # if self.depth
+        for attn, ff in self.layers[:1]:
+            x = attn(x)
+            x = ff(x) + x
+        if self.depth > 1:
+            for attn, ff in self.layers[1:]:
+                x = attn(x)
+                x = ff(x) + x
+        return x
+class TemporalTransformer_attemask(nn.Module):
+    """
+    Transformer block for image-like data.
+    First, project the input (aka embedding)
+    and reshape to b, t, d.
+    Then apply standard transformer action.
+    Finally, reshape to image
+    """
+    def __init__(self, in_channels, n_heads, d_head,
+                 depth=1, dropout=0., context_dim=None,
+                 disable_self_attn=False, use_linear=False,
+                 use_checkpoint=True, only_self_att=True, multiply_zero=False):
+        super().__init__()
+        self.multiply_zero = multiply_zero
+        self.only_self_att = only_self_att
+        self.use_adaptor = False
+        if self.only_self_att:
+            context_dim = None
+        if not isinstance(context_dim, list):
+            context_dim = [context_dim]
+        self.in_channels = in_channels
+        inner_dim = n_heads * d_head
+        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        if not use_linear:
+            self.proj_in = nn.Conv1d(in_channels,
+                                    inner_dim,
+                                    kernel_size=1,
+                                    stride=1,
+                                    padding=0)
+        else:
+            self.proj_in = nn.Linear(in_channels, inner_dim)
+            if self.use_adaptor:
+                self.adaptor_in = nn.Linear(frames, frames)
+        self.transformer_blocks = nn.ModuleList(
+            [BasicTransformerBlock_attemask(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
+                checkpoint=use_checkpoint)
+                for d in range(depth)]
+        )
+        if not use_linear:
+            self.proj_out = zero_module(nn.Conv1d(inner_dim,
+                                                in_channels,
+                                                kernel_size=1,
+                                                stride=1,
+                                                padding=0))
+        else:
+            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
+            if self.use_adaptor:
+                self.adaptor_out = nn.Linear(frames, frames)
+        self.use_linear = use_linear
+    def forward(self, x, context=None):
+        # note: if no context is given, cross-attention defaults to self-attention
+        if self.only_self_att:
+            context = None
+        if not isinstance(context, list):
+            context = [context]
+        b, c, f, h, w = x.shape
+        x_in = x
+        x = self.norm(x)
+        if not self.use_linear:
+            x = rearrange(x, 'b c f h w -> (b h w) c f').contiguous()
+            x = self.proj_in(x)
+        # [16384, 16, 320]
+        if self.use_linear:
+            x = rearrange(x, '(b f) c h w -> b (h w) f c', f=self.frames).contiguous()
+            x = self.proj_in(x)
+        if self.only_self_att:
+            x = rearrange(x, 'bhw c f -> bhw f c').contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                x = block(x)
+            x = rearrange(x, '(b hw) f c -> b hw f c', b=b).contiguous()
+        else:
+            x = rearrange(x, '(b hw) c f -> b hw f c', b=b).contiguous()
+            for i, block in enumerate(self.transformer_blocks):
+                # context[i] = repeat(context[i], '(b f) l con -> b (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                context[i] = rearrange(context[i], '(b f) l con -> b f l con', f=self.frames).contiguous()
+                # calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
+                for j in range(b):
+                    context_i_j = repeat(context[i][j], 'f l con -> (f r) l con', r=(h*w)//self.frames, f=self.frames).contiguous()
+                    x[j] = block(x[j], context=context_i_j)
+        if self.use_linear:
+            x = self.proj_out(x)
+            x = rearrange(x, 'b (h w) f c -> b f c h w', h=h, w=w).contiguous()
+        if not self.use_linear:
+            # x = rearrange(x, 'bhw f c -> bhw c f').contiguous()
+            x = rearrange(x, 'b hw f c -> (b hw) c f').contiguous()
+            x = self.proj_out(x)
+            x = rearrange(x, '(b h w) c f -> b c f h w', b=b, h=h, w=w).contiguous()
+        if self.multiply_zero:
+            x = 0.0 * x + x_in
+        else:
+            x = x + x_in
+        return x
+class TemporalAttentionMultiBlock(nn.Module):
+    def __init__(
+        self,
+        dim,
+        heads=4,
+        dim_head=32,
+        rotary_emb=None,
+        use_image_dataset=False,
+        use_sim_mask=False,
+        temporal_attn_times=1,
+    ):
+        super().__init__()
+        self.att_layers = nn.ModuleList(
+                [TemporalAttentionBlock(dim, heads, dim_head, rotary_emb, use_image_dataset, use_sim_mask)
+                    for _ in range(temporal_attn_times)]
+                )
+    def forward(
+        self,
+        x,
+        pos_bias = None,
+        focus_present_mask = None,
+        video_mask = None
+    ):
+        for layer in self.att_layers:
+            x = layer(x, pos_bias, focus_present_mask, video_mask)
+        return x
+class InitTemporalConvBlock(nn.Module):
+    def __init__(self, in_dim, out_dim=None, dropout=0.0,use_image_dataset=False):
+        super(InitTemporalConvBlock, self).__init__()
+        if out_dim is None:
+            out_dim = in_dim#int(1.5*in_dim)
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        self.use_image_dataset = use_image_dataset
+        # conv layers
+        self.conv = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding = (1, 0, 0)))
+        # zero out the last layer params,so the conv block is identity
+        # nn.init.zeros_(self.conv1[-1].weight)
+        # nn.init.zeros_(self.conv1[-1].bias)
+        nn.init.zeros_(self.conv[-1].weight)
+        nn.init.zeros_(self.conv[-1].bias)
+    def forward(self, x):
+        identity = x
+        x = self.conv(x)
+        if self.use_image_dataset:
+            x = identity + 0*x
+        else:
+            x = identity + x
+        return x
+class TemporalConvBlock(nn.Module):
+    def __init__(self, in_dim, out_dim=None, dropout=0.0, use_image_dataset= False):
+        super(TemporalConvBlock, self).__init__()
+        if out_dim is None:
+            out_dim = in_dim#int(1.5*in_dim)
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        self.use_image_dataset = use_image_dataset
+        # conv layers
+        self.conv1 = nn.Sequential(
+            nn.GroupNorm(32, in_dim),
+            nn.SiLU(),
+            nn.Conv3d(in_dim, out_dim, (3, 1, 1), padding = (1, 0, 0)))
+        self.conv2 = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding = (1, 0, 0)))
+        # zero out the last layer params,so the conv block is identity
+        # nn.init.zeros_(self.conv1[-1].weight)
+        # nn.init.zeros_(self.conv1[-1].bias)
+        nn.init.zeros_(self.conv2[-1].weight)
+        nn.init.zeros_(self.conv2[-1].bias)
+    def forward(self, x):
+        identity = x
+        x = self.conv1(x)
+        x = self.conv2(x)
+        if self.use_image_dataset:
+            x = identity + 0*x
+        else:
+            x = identity + x
+        return x
+class TemporalConvBlock_v2(nn.Module):
+    def __init__(self, in_dim, out_dim=None, dropout=0.0, use_image_dataset=False):
+        super(TemporalConvBlock_v2, self).__init__()
+        if out_dim is None:
+            out_dim = in_dim # int(1.5*in_dim)
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        self.use_image_dataset = use_image_dataset
+        # conv layers
+        self.conv1 = nn.Sequential(
+            nn.GroupNorm(32, in_dim),
+            nn.SiLU(),
+            nn.Conv3d(in_dim, out_dim, (3, 1, 1), padding = (1, 0, 0)))
+        self.conv2 = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding = (1, 0, 0)))
+        self.conv3 = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding = (1, 0, 0)))
+        self.conv4 = nn.Sequential(
+            nn.GroupNorm(32, out_dim),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding = (1, 0, 0)))
+        # zero out the last layer params,so the conv block is identity
+        nn.init.zeros_(self.conv4[-1].weight)
+        nn.init.zeros_(self.conv4[-1].bias)
+    def forward(self, x):
+        identity = x
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        if self.use_image_dataset:
+            x = identity + 0.0 * x
+        else:
+            x = identity + x
+        return x
+class DropPath(nn.Module):
+    r"""DropPath but without rescaling and supports optional all-zero and/or all-keep.
+    """
+    def __init__(self, p):
+        super(DropPath, self).__init__()
+        self.p = p
+    def forward(self, *args, zero=None, keep=None):
+        if not self.training:
+            return args[0] if len(args) == 1 else args
+        # params
+        x = args[0]
+        b = x.size(0)
+        n = (torch.rand(b) < self.p).sum()
+        # non-zero and non-keep mask
+        mask = x.new_ones(b, dtype=torch.bool)
+        if keep is not None:
+            mask[keep] = False
+        if zero is not None:
+            mask[zero] = False
+        # drop-path index
+        index = torch.where(mask)[0]
+        index = index[torch.randperm(len(index))[:n]]
+        if zero is not None:
+            index = torch.cat([index, torch.where(zero)[0]], dim=0)
+        # drop-path multiplier
+        multiplier = x.new_ones(b)
+        multiplier[index] = 0.0
+        output = tuple(u * self.broadcast(multiplier, u) for u in args)
+        return output[0] if len(args) == 1 else output
+    def broadcast(self, src, dst):
+        assert src.size(0) == dst.size(0)
+        shape = (dst.size(0), ) + (1, ) * (dst.ndim - 1)
+        return src.view(shape)

UniAnimate/utils/__init__.py ADDED Viewed

File without changes

UniAnimate/utils/assign_cfg.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import os, yaml
+from copy import deepcopy, copy
+# def get prior and ldm config
+def assign_prior_mudule_cfg(cfg):
+    '''
+    '''
+    #
+    prior_cfg = deepcopy(cfg)
+    vldm_cfg = deepcopy(cfg)
+    with open(cfg.prior_cfg, 'r') as f:
+        _cfg_update = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        # _cfg_update = _cfg_update.cfg_dict
+        for k, v in _cfg_update.items():
+            if isinstance(v, dict) and k in cfg:
+                prior_cfg[k].update(v)
+            else:
+                prior_cfg[k] = v
+    with open(cfg.vldm_cfg, 'r') as f:
+        _cfg_update = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        # _cfg_update = _cfg_update.cfg_dict
+        for k, v in _cfg_update.items():
+            if isinstance(v, dict) and k in cfg:
+                vldm_cfg[k].update(v)
+            else:
+                vldm_cfg[k] = v
+    return prior_cfg, vldm_cfg
+# def get prior and ldm config
+def assign_vldm_vsr_mudule_cfg(cfg):
+    '''
+    '''
+    #
+    vldm_cfg = deepcopy(cfg)
+    vsr_cfg = deepcopy(cfg)
+    with open(cfg.vldm_cfg, 'r') as f:
+        _cfg_update = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        # _cfg_update = _cfg_update.cfg_dict
+        for k, v in _cfg_update.items():
+            if isinstance(v, dict) and k in cfg:
+                vldm_cfg[k].update(v)
+            else:
+                vldm_cfg[k] = v
+    with open(cfg.vsr_cfg, 'r') as f:
+        _cfg_update = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        # _cfg_update = _cfg_update.cfg_dict
+        for k, v in _cfg_update.items():
+            if isinstance(v, dict) and k in cfg:
+                vsr_cfg[k].update(v)
+            else:
+                vsr_cfg[k] = v
+    return vldm_cfg, vsr_cfg
+# def get prior and ldm config
+def assign_signle_cfg(cfg, _cfg_update, tname):
+    '''
+    '''
+    #
+    vldm_cfg = deepcopy(cfg)
+    if os.path.exists(_cfg_update[tname]):
+        with open(_cfg_update[tname], 'r') as f:
+            _cfg_update = yaml.load(f.read(), Loader=yaml.SafeLoader)
+            # _cfg_update = _cfg_update.cfg_dict
+            for k, v in _cfg_update.items():
+                if isinstance(v, dict) and k in cfg:
+                    vldm_cfg[k].update(v)
+                else:
+                    vldm_cfg[k] = v
+    return vldm_cfg

UniAnimate/utils/config.py ADDED Viewed

	@@ -0,0 +1,230 @@

+import os
+import yaml
+import json
+import copy
+import argparse
+import utils.logging as logging
+logger = logging.get_logger(__name__)
+class Config(object):
+    def __init__(self, load=True, cfg_dict=None, cfg_level=None):
+        self._level = "cfg" + ("." + cfg_level if cfg_level is not None else "")
+        if load:
+            self.args = self._parse_args()
+            logger.info("Loading config from {}.".format(self.args.cfg_file))
+            self.need_initialization = True
+            cfg_base = self._load_yaml(self.args) # self._initialize_cfg()
+            cfg_dict = self._load_yaml(self.args)
+            cfg_dict = self._merge_cfg_from_base(cfg_base, cfg_dict)
+            cfg_dict = self._update_from_args(cfg_dict)
+            self.cfg_dict = cfg_dict
+        self._update_dict(cfg_dict)
+    def _parse_args(self):
+        parser = argparse.ArgumentParser(
+            description="Argparser for configuring [code base name to think of] codebase"
+        )
+        parser.add_argument(
+            "--cfg",
+            dest="cfg_file",
+            help="Path to the configuration file",
+            default='configs/UniAnimate_infer.yaml'
+        )
+        parser.add_argument(
+            "--init_method",
+            help="Initialization method, includes TCP or shared file-system",
+            default="tcp://localhost:9999",
+            type=str,
+        )
+        parser.add_argument(
+            '--debug',
+            action='store_true',
+            default=False,
+            help='Into debug information'
+        )
+        parser.add_argument(
+            "opts",
+            help="other configurations",
+            default=None,
+            nargs=argparse.REMAINDER)
+        return parser.parse_args()
+    def _path_join(self, path_list):
+        path = ""
+        for p in path_list:
+            path+= p + '/'
+        return path[:-1]
+    def _update_from_args(self, cfg_dict):
+        args = self.args
+        for var in vars(args):
+            cfg_dict[var] = getattr(args, var)
+        return cfg_dict
+    def _initialize_cfg(self):
+        if self.need_initialization:
+            self.need_initialization = False
+            if os.path.exists('./configs/base.yaml'):
+                with open("./configs/base.yaml", 'r') as f:
+                    cfg = yaml.load(f.read(), Loader=yaml.SafeLoader)
+            else:
+                with open(os.path.realpath(__file__).split('/')[-3] + "/configs/base.yaml", 'r') as f:
+                    cfg = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        return cfg
+    def _load_yaml(self, args, file_name=""):
+        assert args.cfg_file is not None
+        if not file_name == "": # reading from base file
+            with open(file_name, 'r') as f:
+                cfg = yaml.load(f.read(), Loader=yaml.SafeLoader)
+        else:
+            if os.getcwd().split("/")[-1] == args.cfg_file.split("/")[0]:
+                args.cfg_file = args.cfg_file.replace(os.getcwd().split("/")[-1], "./")
+            with open(args.cfg_file, 'r') as f:
+                    cfg = yaml.load(f.read(), Loader=yaml.SafeLoader)
+                    file_name = args.cfg_file
+        if "_BASE_RUN" not in cfg.keys() and "_BASE_MODEL" not in cfg.keys() and "_BASE" not in cfg.keys():
+            # return cfg if the base file is being accessed
+            cfg = self._merge_cfg_from_command_update(args, cfg)
+            return cfg
+        if "_BASE" in cfg.keys():
+            if cfg["_BASE"][1] == '.':
+                prev_count = cfg["_BASE"].count('..')
+                cfg_base_file = self._path_join(file_name.split('/')[:(-1-cfg["_BASE"].count('..'))] + cfg["_BASE"].split('/')[prev_count:])
+            else:
+                cfg_base_file = cfg["_BASE"].replace(
+                    "./",
+                    args.cfg_file.replace(args.cfg_file.split('/')[-1], "")
+                )
+            cfg_base = self._load_yaml(args, cfg_base_file)
+            cfg = self._merge_cfg_from_base(cfg_base, cfg)
+        else:
+            if "_BASE_RUN" in cfg.keys():
+                if cfg["_BASE_RUN"][1] == '.':
+                    prev_count = cfg["_BASE_RUN"].count('..')
+                    cfg_base_file = self._path_join(file_name.split('/')[:(-1-prev_count)] + cfg["_BASE_RUN"].split('/')[prev_count:])
+                else:
+                    cfg_base_file = cfg["_BASE_RUN"].replace(
+                        "./",
+                        args.cfg_file.replace(args.cfg_file.split('/')[-1], "")
+                    )
+                cfg_base = self._load_yaml(args, cfg_base_file)
+                cfg = self._merge_cfg_from_base(cfg_base, cfg, preserve_base=True)
+            if "_BASE_MODEL" in cfg.keys():
+                if cfg["_BASE_MODEL"][1] == '.':
+                    prev_count = cfg["_BASE_MODEL"].count('..')
+                    cfg_base_file = self._path_join(file_name.split('/')[:(-1-cfg["_BASE_MODEL"].count('..'))] + cfg["_BASE_MODEL"].split('/')[prev_count:])
+                else:
+                    cfg_base_file = cfg["_BASE_MODEL"].replace(
+                        "./",
+                        args.cfg_file.replace(args.cfg_file.split('/')[-1], "")
+                    )
+                cfg_base = self._load_yaml(args, cfg_base_file)
+                cfg = self._merge_cfg_from_base(cfg_base, cfg)
+        cfg = self._merge_cfg_from_command(args, cfg)
+        return cfg
+    def _merge_cfg_from_base(self, cfg_base, cfg_new, preserve_base=False):
+        for k,v in cfg_new.items():
+            if k in cfg_base.keys():
+                if isinstance(v, dict):
+                    self._merge_cfg_from_base(cfg_base[k], v)
+                else:
+                    cfg_base[k] = v
+            else:
+                if "BASE" not in k or preserve_base:
+                    cfg_base[k] = v
+        return cfg_base
+    def _merge_cfg_from_command_update(self, args, cfg):
+        if len(args.opts) == 0:
+            return cfg
+        assert len(args.opts) % 2 == 0, 'Override list {} has odd length: {}.'.format(
+            args.opts, len(args.opts)
+        )
+        keys = args.opts[0::2]
+        vals = args.opts[1::2]
+        for key, val in zip(keys, vals):
+           cfg[key] = val
+        return cfg
+    def _merge_cfg_from_command(self, args, cfg):
+        assert len(args.opts) % 2 == 0, 'Override list {} has odd length: {}.'.format(
+            args.opts, len(args.opts)
+        )
+        keys = args.opts[0::2]
+        vals = args.opts[1::2]
+        # maximum supported depth 3
+        for idx, key in enumerate(keys):
+            key_split = key.split('.')
+            assert len(key_split) <= 4, 'Key depth error. \nMaximum depth: 3\n Get depth: {}'.format(
+                len(key_split)
+            )
+            assert key_split[0] in cfg.keys(), 'Non-existant key: {}.'.format(
+                key_split[0]
+            )
+            if len(key_split) == 2:
+                assert key_split[1] in cfg[key_split[0]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+            elif len(key_split) == 3:
+                assert key_split[1] in cfg[key_split[0]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+                assert key_split[2] in cfg[key_split[0]][key_split[1]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+            elif len(key_split) == 4:
+                assert key_split[1] in cfg[key_split[0]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+                assert key_split[2] in cfg[key_split[0]][key_split[1]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+                assert key_split[3] in cfg[key_split[0]][key_split[1]][key_split[2]].keys(), 'Non-existant key: {}.'.format(
+                    key
+                )
+            if len(key_split) == 1:
+                cfg[key_split[0]] = vals[idx]
+            elif len(key_split) == 2:
+                cfg[key_split[0]][key_split[1]] = vals[idx]
+            elif len(key_split) == 3:
+                cfg[key_split[0]][key_split[1]][key_split[2]] = vals[idx]
+            elif len(key_split) == 4:
+                cfg[key_split[0]][key_split[1]][key_split[2]][key_split[3]] = vals[idx]
+        return cfg
+    def _update_dict(self, cfg_dict):
+        def recur(key, elem):
+            if type(elem) is dict:
+                return key, Config(load=False, cfg_dict=elem, cfg_level=key)
+            else:
+                if type(elem) is str and elem[1:3]=="e-":
+                    elem = float(elem)
+                return key, elem
+        dic = dict(recur(k, v) for k, v in cfg_dict.items())
+        self.__dict__.update(dic)
+    def get_args(self):
+        return self.args
+    def __repr__(self):
+        return "{}\n".format(self.dump())
+    def dump(self):
+        return json.dumps(self.cfg_dict, indent=2)
+    def deep_copy(self):
+        return copy.deepcopy(self)
+if __name__ == '__main__':
+    # debug
+    cfg = Config(load=True)
+    print(cfg.DATA)

UniAnimate/utils/distributed.py ADDED Viewed

	@@ -0,0 +1,430 @@

+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
+import torch
+import torch.nn.functional as F
+import torch.distributed as dist
+import functools
+import pickle
+import numpy as np
+from collections import OrderedDict
+from torch.autograd import Function
+__all__ = ['is_dist_initialized',
+           'get_world_size',
+           'get_rank',
+           'new_group',
+           'destroy_process_group',
+           'barrier',
+           'broadcast',
+           'all_reduce',
+           'reduce',
+           'gather',
+           'all_gather',
+           'reduce_dict',
+           'get_global_gloo_group',
+           'generalized_all_gather',
+           'generalized_gather',
+           'scatter',
+           'reduce_scatter',
+           'send',
+           'recv',
+           'isend',
+           'irecv',
+           'shared_random_seed',
+           'diff_all_gather',
+           'diff_all_reduce',
+           'diff_scatter',
+           'diff_copy',
+           'spherical_kmeans',
+           'sinkhorn']
+#-------------------------------- Distributed operations --------------------------------#
+def is_dist_initialized():
+    return dist.is_available() and dist.is_initialized()
+def get_world_size(group=None):
+    return dist.get_world_size(group) if is_dist_initialized() else 1
+def get_rank(group=None):
+    return dist.get_rank(group) if is_dist_initialized() else 0
+def new_group(ranks=None, **kwargs):
+    if is_dist_initialized():
+        return dist.new_group(ranks, **kwargs)
+    return None
+def destroy_process_group():
+    if is_dist_initialized():
+        dist.destroy_process_group()
+def barrier(group=None, **kwargs):
+    if get_world_size(group) > 1:
+        dist.barrier(group, **kwargs)
+def broadcast(tensor, src, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        return dist.broadcast(tensor, src, group, **kwargs)
+def all_reduce(tensor, op=dist.ReduceOp.SUM, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        return dist.all_reduce(tensor, op, group, **kwargs)
+def reduce(tensor, dst, op=dist.ReduceOp.SUM, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        return dist.reduce(tensor, dst, op, group, **kwargs)
+def gather(tensor, dst=0, group=None, **kwargs):
+    rank = get_rank()  # global rank
+    world_size = get_world_size(group)
+    if world_size == 1:
+        return [tensor]
+    tensor_list = [torch.empty_like(tensor) for _ in range(world_size)] if rank == dst else None
+    dist.gather(tensor, tensor_list, dst, group, **kwargs)
+    return tensor_list
+def all_gather(tensor, uniform_size=True, group=None, **kwargs):
+    world_size = get_world_size(group)
+    if world_size == 1:
+        return [tensor]
+    assert tensor.is_contiguous(), 'ops.all_gather requires the tensor to be contiguous()'
+    if uniform_size:
+        tensor_list = [torch.empty_like(tensor) for _ in range(world_size)]
+        dist.all_gather(tensor_list, tensor, group, **kwargs)
+        return tensor_list
+    else:
+        # collect tensor shapes across GPUs
+        shape = tuple(tensor.shape)
+        shape_list = generalized_all_gather(shape, group)
+        # flatten the tensor
+        tensor = tensor.reshape(-1)
+        size = int(np.prod(shape))
+        size_list = [int(np.prod(u)) for u in shape_list]
+        max_size = max(size_list)
+        # pad to maximum size
+        if size != max_size:
+            padding = tensor.new_zeros(max_size - size)
+            tensor = torch.cat([tensor, padding], dim=0)
+        # all_gather
+        tensor_list = [torch.empty_like(tensor) for _ in range(world_size)]
+        dist.all_gather(tensor_list, tensor, group, **kwargs)
+        # reshape tensors
+        tensor_list = [t[:n].view(s) for t, n, s in zip(
+            tensor_list, size_list, shape_list)]
+        return tensor_list
+@torch.no_grad()
+def reduce_dict(input_dict, group=None, reduction='mean', **kwargs):
+    assert reduction in ['mean', 'sum']
+    world_size = get_world_size(group)
+    if world_size == 1:
+        return input_dict
+    # ensure that the orders of keys are consistent across processes
+    if isinstance(input_dict, OrderedDict):
+        keys = list(input_dict.keys)
+    else:
+        keys = sorted(input_dict.keys())
+    vals = [input_dict[key] for key in keys]
+    vals = torch.stack(vals, dim=0)
+    dist.reduce(vals, dst=0, group=group, **kwargs)
+    if dist.get_rank(group) == 0 and reduction == 'mean':
+        vals /= world_size
+    dist.broadcast(vals, src=0, group=group, **kwargs)
+    reduced_dict = type(input_dict)([
+        (key, val) for key, val in zip(keys, vals)])
+    return reduced_dict
+@functools.lru_cache()
+def get_global_gloo_group():
+    backend = dist.get_backend()
+    assert backend in ['gloo', 'nccl']
+    if backend == 'nccl':
+        return dist.new_group(backend='gloo')
+    else:
+        return dist.group.WORLD
+def _serialize_to_tensor(data, group):
+    backend = dist.get_backend(group)
+    assert backend in ['gloo', 'nccl']
+    device = torch.device('cpu' if backend == 'gloo' else 'cuda')
+    buffer = pickle.dumps(data)
+    if len(buffer) > 1024 ** 3:
+        logger = logging.getLogger(__name__)
+        logger.warning(
+            'Rank {} trying to all-gather {:.2f} GB of data on device'
+            '{}'.format(get_rank(), len(buffer) / (1024 ** 3), device))
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to(device=device)
+    return tensor
+def _pad_to_largest_tensor(tensor, group):
+    world_size = dist.get_world_size(group=group)
+    assert world_size >= 1, \
+        'gather/all_gather must be called from ranks within' \
+        'the give group!'
+    local_size = torch.tensor(
+        [tensor.numel()], dtype=torch.int64, device=tensor.device)
+    size_list = [torch.zeros(
+        [1], dtype=torch.int64, device=tensor.device)
+        for _ in range(world_size)]
+    # gather tensors and compute the maximum size
+    dist.all_gather(size_list, local_size, group=group)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+    # pad tensors to the same size
+    if local_size != max_size:
+        padding = torch.zeros(
+            (max_size - local_size, ),
+            dtype=torch.uint8, device=tensor.device)
+        tensor = torch.cat((tensor, padding), dim=0)
+    return size_list, tensor
+def generalized_all_gather(data, group=None):
+    if get_world_size(group) == 1:
+        return [data]
+    if group is None:
+        group = get_global_gloo_group()
+    tensor = _serialize_to_tensor(data, group)
+    size_list, tensor = _pad_to_largest_tensor(tensor, group)
+    max_size = max(size_list)
+    # receiving tensors from all ranks
+    tensor_list = [torch.empty(
+        (max_size, ), dtype=torch.uint8, device=tensor.device)
+        for _ in size_list]
+    dist.all_gather(tensor_list, tensor, group=group)
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+    return data_list
+def generalized_gather(data, dst=0, group=None):
+    world_size = get_world_size(group)
+    if world_size == 1:
+        return [data]
+    if group is None:
+        group = get_global_gloo_group()
+    rank = dist.get_rank()  # global rank
+    tensor = _serialize_to_tensor(data, group)
+    size_list, tensor = _pad_to_largest_tensor(tensor, group)
+    # receiving tensors from all ranks to dst
+    if rank == dst:
+        max_size = max(size_list)
+        tensor_list = [torch.empty(
+            (max_size, ), dtype=torch.uint8, device=tensor.device)
+            for _ in size_list]
+        dist.gather(tensor, tensor_list, dst=dst, group=group)
+        data_list = []
+        for size, tensor in zip(size_list, tensor_list):
+            buffer = tensor.cpu().numpy().tobytes()[:size]
+            data_list.append(pickle.loads(buffer))
+        return data_list
+    else:
+        dist.gather(tensor, [], dst=dst, group=group)
+        return []
+def scatter(data, scatter_list=None, src=0, group=None, **kwargs):
+    r"""NOTE: only supports CPU tensor communication.
+    """
+    if get_world_size(group) > 1:
+        return dist.scatter(data, scatter_list, src, group, **kwargs)
+def reduce_scatter(output, input_list, op=dist.ReduceOp.SUM, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        return dist.reduce_scatter(output, input_list, op, group, **kwargs)
+def send(tensor, dst, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        assert tensor.is_contiguous(), 'ops.send requires the tensor to be contiguous()'
+        return dist.send(tensor, dst, group, **kwargs)
+def recv(tensor, src=None, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        assert tensor.is_contiguous(), 'ops.recv requires the tensor to be contiguous()'
+        return dist.recv(tensor, src, group, **kwargs)
+def isend(tensor, dst, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        assert tensor.is_contiguous(), 'ops.isend requires the tensor to be contiguous()'
+        return dist.isend(tensor, dst, group, **kwargs)
+def irecv(tensor, src=None, group=None, **kwargs):
+    if get_world_size(group) > 1:
+        assert tensor.is_contiguous(), 'ops.irecv requires the tensor to be contiguous()'
+        return dist.irecv(tensor, src, group, **kwargs)
+def shared_random_seed(group=None):
+    seed = np.random.randint(2 ** 31)
+    all_seeds = generalized_all_gather(seed, group)
+    return all_seeds[0]
+#-------------------------------- Differentiable operations --------------------------------#
+def _all_gather(x):
+    if not (dist.is_available() and dist.is_initialized()) or dist.get_world_size() == 1:
+        return x
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+    tensors = [torch.empty_like(x) for _ in range(world_size)]
+    tensors[rank] = x
+    dist.all_gather(tensors, x)
+    return torch.cat(tensors, dim=0).contiguous()
+def _all_reduce(x):
+    if not (dist.is_available() and dist.is_initialized()) or dist.get_world_size() == 1:
+        return x
+    dist.all_reduce(x)
+    return x
+def _split(x):
+    if not (dist.is_available() and dist.is_initialized()) or dist.get_world_size() == 1:
+        return x
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+    return x.chunk(world_size, dim=0)[rank].contiguous()
+class DiffAllGather(Function):
+    r"""Differentiable all-gather.
+    """
+    @staticmethod
+    def symbolic(graph, input):
+        return _all_gather(input)
+    @staticmethod
+    def forward(ctx, input):
+        return _all_gather(input)
+    @staticmethod
+    def backward(ctx, grad_output):
+        return _split(grad_output)
+class DiffAllReduce(Function):
+    r"""Differentiable all-reducd.
+    """
+    @staticmethod
+    def symbolic(graph, input):
+        return _all_reduce(input)
+    @staticmethod
+    def forward(ctx, input):
+        return _all_reduce(input)
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output
+class DiffScatter(Function):
+    r"""Differentiable scatter.
+    """
+    @staticmethod
+    def symbolic(graph, input):
+        return _split(input)
+    @staticmethod
+    def symbolic(ctx, input):
+        return _split(input)
+    @staticmethod
+    def backward(ctx, grad_output):
+        return _all_gather(grad_output)
+class DiffCopy(Function):
+    r"""Differentiable copy that reduces all gradients during backward.
+    """
+    @staticmethod
+    def symbolic(graph, input):
+        return input
+    @staticmethod
+    def forward(ctx, input):
+        return input
+    @staticmethod
+    def backward(ctx, grad_output):
+        return _all_reduce(grad_output)
+diff_all_gather = DiffAllGather.apply
+diff_all_reduce = DiffAllReduce.apply
+diff_scatter = DiffScatter.apply
+diff_copy = DiffCopy.apply
+#-------------------------------- Distributed algorithms --------------------------------#
+@torch.no_grad()
+def spherical_kmeans(feats, num_clusters, num_iters=10):
+    k, n, c = num_clusters, *feats.size()
+    ones = feats.new_ones(n, dtype=torch.long)
+    # distributed settings
+    rank = get_rank()
+    world_size = get_world_size()
+    # init clusters
+    rand_inds = torch.randperm(n)[:int(np.ceil(k / world_size))]
+    clusters = torch.cat(all_gather(feats[rand_inds]), dim=0)[:k]
+    # variables
+    new_clusters = feats.new_zeros(k, c)
+    counts = feats.new_zeros(k, dtype=torch.long)
+    # iterative Expectation-Maximization
+    for step in range(num_iters + 1):
+        # Expectation step
+        simmat = torch.mm(feats, clusters.t())
+        scores, assigns = simmat.max(dim=1)
+        if step == num_iters:
+            break
+        # Maximization step
+        new_clusters.zero_().scatter_add_(0, assigns.unsqueeze(1).repeat(1, c), feats)
+        all_reduce(new_clusters)
+        counts.zero_()
+        counts.index_add_(0, assigns, ones)
+        all_reduce(counts)
+        mask = (counts > 0)
+        clusters[mask] = new_clusters[mask] / counts[mask].view(-1, 1)
+        clusters = F.normalize(clusters, p=2, dim=1)
+    return clusters, assigns, scores
+@torch.no_grad()
+def sinkhorn(Q, eps=0.5, num_iters=3):
+    # normalize Q
+    Q = torch.exp(Q / eps).t()
+    sum_Q = Q.sum()
+    all_reduce(sum_Q)
+    Q /= sum_Q
+    # variables
+    n, m = Q.size()
+    u = Q.new_zeros(n)
+    r = Q.new_ones(n) / n
+    c = Q.new_ones(m) / (m * get_world_size())
+    # iterative update
+    cur_sum = Q.sum(dim=1)
+    all_reduce(cur_sum)
+    for i in range(num_iters):
+        u = cur_sum
+        Q *= (r / u).unsqueeze(1)
+        Q *= (c / Q.sum(dim=0)).unsqueeze(0)
+        cur_sum = Q.sum(dim=1)
+        all_reduce(cur_sum)
+    return (Q / Q.sum(dim=0, keepdim=True)).t().float()

UniAnimate/utils/logging.py ADDED Viewed

	@@ -0,0 +1,90 @@

+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
+"""Logging."""
+import builtins
+import decimal
+import functools
+import logging
+import os
+import sys
+import simplejson
+# from fvcore.common.file_io import PathManager
+import utils.distributed as du
+def _suppress_print():
+    """
+    Suppresses printing from the current process.
+    """
+    def print_pass(*objects, sep=" ", end="\n", file=sys.stdout, flush=False):
+        pass
+    builtins.print = print_pass
+# @functools.lru_cache(maxsize=None)
+# def _cached_log_stream(filename):
+#     return PathManager.open(filename, "a")
+def setup_logging(cfg, log_file):
+    """
+    Sets up the logging for multiple processes. Only enable the logging for the
+    master process, and suppress logging for the non-master processes.
+    """
+    if du.is_master_proc():
+        # Enable logging for the master process.
+        logging.root.handlers = []
+    else:
+        # Suppress logging for non-master processes.
+        _suppress_print()
+    logger = logging.getLogger()
+    logger.setLevel(logging.INFO)
+    logger.propagate = False
+    plain_formatter = logging.Formatter(
+        "[%(asctime)s][%(levelname)s] %(name)s: %(lineno)4d: %(message)s",
+        datefmt="%m/%d %H:%M:%S",
+    )
+    if du.is_master_proc():
+        ch = logging.StreamHandler(stream=sys.stdout)
+        ch.setLevel(logging.DEBUG)
+        ch.setFormatter(plain_formatter)
+        logger.addHandler(ch)
+    if log_file is not None and du.is_master_proc(du.get_world_size()):
+        filename = os.path.join(cfg.OUTPUT_DIR, log_file)
+        fh = logging.FileHandler(filename)
+        fh.setLevel(logging.DEBUG)
+        fh.setFormatter(plain_formatter)
+        logger.addHandler(fh)
+def get_logger(name):
+    """
+    Retrieve the logger with the specified name or, if name is None, return a
+    logger which is the root logger of the hierarchy.
+    Args:
+        name (string): name of the logger.
+    """
+    return logging.getLogger(name)
+def log_json_stats(stats):
+    """
+    Logs json stats.
+    Args:
+        stats (dict): a dictionary of statistical information to log.
+    """
+    stats = {
+        k: decimal.Decimal("{:.6f}".format(v)) if isinstance(v, float) else v
+        for k, v in stats.items()
+    }
+    json_stats = simplejson.dumps(stats, sort_keys=True, use_decimal=True)
+    logger = get_logger(__name__)
+    logger.info("{:s}".format(json_stats))

UniAnimate/utils/mp4_to_gif.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import os
+# source_mp4_dir = "outputs/UniAnimate_infer"
+# target_gif_dir = "outputs/UniAnimate_infer_gif"
+source_mp4_dir = "outputs/UniAnimate_infer_long"
+target_gif_dir = "outputs/UniAnimate_infer_long_gif"
+os.makedirs(target_gif_dir, exist_ok=True)
+for video in os.listdir(source_mp4_dir):
+     video_dir = os.path.join(source_mp4_dir, video)
+     gif_dir = os.path.join(target_gif_dir, video.replace(".mp4", ".gif"))
+     cmd = f'ffmpeg -i {video_dir} {gif_dir}'
+     os.system(cmd)

UniAnimate/utils/multi_port.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import socket
+from contextlib import closing
+def find_free_port():
+    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
+    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
+        s.bind(('', 0))
+        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+        return str(s.getsockname()[1])

UniAnimate/utils/optim/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .lr_scheduler import *
2	+ from .adafactor import *

UniAnimate/utils/optim/adafactor.py ADDED Viewed

	@@ -0,0 +1,230 @@

+import math
+import torch
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LambdaLR
+__all__ = ['Adafactor']
+class Adafactor(Optimizer):
+    """
+    AdaFactor pytorch implementation can be used as a drop in replacement for Adam original fairseq code:
+    https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py
+    Paper: *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost* https://arxiv.org/abs/1804.04235 Note that
+    this optimizer internally adjusts the learning rate depending on the `scale_parameter`, `relative_step` and
+    `warmup_init` options. To use a manual (external) learning rate schedule you should set `scale_parameter=False` and
+    `relative_step=False`.
+    Arguments:
+        params (`Iterable[nn.parameter.Parameter]`):
+            Iterable of parameters to optimize or dictionaries defining parameter groups.
+        lr (`float`, *optional*):
+            The external learning rate.
+        eps (`Tuple[float, float]`, *optional*, defaults to (1e-30, 1e-3)):
+            Regularization constants for square gradient and parameter scale respectively
+        clip_threshold (`float`, *optional*, defaults 1.0):
+            Threshold of root mean square of final gradient update
+        decay_rate (`float`, *optional*, defaults to -0.8):
+            Coefficient used to compute running averages of square
+        beta1 (`float`, *optional*):
+            Coefficient used for computing running averages of gradient
+        weight_decay (`float`, *optional*, defaults to 0):
+            Weight decay (L2 penalty)
+        scale_parameter (`bool`, *optional*, defaults to `True`):
+            If True, learning rate is scaled by root mean square
+        relative_step (`bool`, *optional*, defaults to `True`):
+            If True, time-dependent learning rate is computed instead of external learning rate
+        warmup_init (`bool`, *optional*, defaults to `False`):
+            Time-dependent learning rate computation depends on whether warm-up initialization is being used
+    This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested.
+    Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3):
+        - Training without LR warmup or clip_threshold is not recommended.
+           - use scheduled LR warm-up to fixed LR
+           - use clip_threshold=1.0 (https://arxiv.org/abs/1804.04235)
+        - Disable relative updates
+        - Use scale_parameter=False
+        - Additional optimizer operations like gradient clipping should not be used alongside Adafactor
+    Example:
+    ```python
+    Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)
+    ```
+    Others reported the following combination to work well:
+    ```python
+    Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
+    ```
+    When using `lr=None` with [`Trainer`] you will most likely need to use [`~optimization.AdafactorSchedule`]
+    scheduler as following:
+    ```python
+    from transformers.optimization import Adafactor, AdafactorSchedule
+    optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
+    lr_scheduler = AdafactorSchedule(optimizer)
+    trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))
+    ```
+    Usage:
+    ```python
+    # replace AdamW with Adafactor
+    optimizer = Adafactor(
+        model.parameters(),
+        lr=1e-3,
+        eps=(1e-30, 1e-3),
+        clip_threshold=1.0,
+        decay_rate=-0.8,
+        beta1=None,
+        weight_decay=0.0,
+        relative_step=False,
+        scale_parameter=False,
+        warmup_init=False,
+    )
+    ```"""
+    def __init__(
+        self,
+        params,
+        lr=None,
+        eps=(1e-30, 1e-3),
+        clip_threshold=1.0,
+        decay_rate=-0.8,
+        beta1=None,
+        weight_decay=0.0,
+        scale_parameter=True,
+        relative_step=True,
+        warmup_init=False,
+    ):
+        r"""require_version("torch>=1.5.0")  # add_ with alpha
+        """
+        if lr is not None and relative_step:
+            raise ValueError("Cannot combine manual `lr` and `relative_step=True` options")
+        if warmup_init and not relative_step:
+            raise ValueError("`warmup_init=True` requires `relative_step=True`")
+        defaults = dict(
+            lr=lr,
+            eps=eps,
+            clip_threshold=clip_threshold,
+            decay_rate=decay_rate,
+            beta1=beta1,
+            weight_decay=weight_decay,
+            scale_parameter=scale_parameter,
+            relative_step=relative_step,
+            warmup_init=warmup_init,
+        )
+        super().__init__(params, defaults)
+    @staticmethod
+    def _get_lr(param_group, param_state):
+        rel_step_sz = param_group["lr"]
+        if param_group["relative_step"]:
+            min_step = 1e-6 * param_state["step"] if param_group["warmup_init"] else 1e-2
+            rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))
+        param_scale = 1.0
+        if param_group["scale_parameter"]:
+            param_scale = max(param_group["eps"][1], param_state["RMS"])
+        return param_scale * rel_step_sz
+    @staticmethod
+    def _get_options(param_group, param_shape):
+        factored = len(param_shape) >= 2
+        use_first_moment = param_group["beta1"] is not None
+        return factored, use_first_moment
+    @staticmethod
+    def _rms(tensor):
+        return tensor.norm(2) / (tensor.numel() ** 0.5)
+    @staticmethod
+    def _approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col):
+        # copy from fairseq's adafactor implementation:
+        # https://github.com/huggingface/transformers/blob/8395f14de6068012787d83989c3627c3df6a252b/src/transformers/optimization.py#L505
+        r_factor = (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1, keepdim=True)).rsqrt_().unsqueeze(-1)
+        c_factor = exp_avg_sq_col.unsqueeze(-2).rsqrt()
+        return torch.mul(r_factor, c_factor)
+    def step(self, closure=None):
+        """
+        Performs a single optimization step
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+        for group in self.param_groups:
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+                grad = p.grad.data
+                if grad.dtype in {torch.float16, torch.bfloat16}:
+                    grad = grad.float()
+                if grad.is_sparse:
+                    raise RuntimeError("Adafactor does not support sparse gradients.")
+                state = self.state[p]
+                grad_shape = grad.shape
+                factored, use_first_moment = self._get_options(group, grad_shape)
+                # State Initialization
+                if len(state) == 0:
+                    state["step"] = 0
+                    if use_first_moment:
+                        # Exponential moving average of gradient values
+                        state["exp_avg"] = torch.zeros_like(grad)
+                    if factored:
+                        state["exp_avg_sq_row"] = torch.zeros(grad_shape[:-1]).to(grad)
+                        state["exp_avg_sq_col"] = torch.zeros(grad_shape[:-2] + grad_shape[-1:]).to(grad)
+                    else:
+                        state["exp_avg_sq"] = torch.zeros_like(grad)
+                    state["RMS"] = 0
+                else:
+                    if use_first_moment:
+                        state["exp_avg"] = state["exp_avg"].to(grad)
+                    if factored:
+                        state["exp_avg_sq_row"] = state["exp_avg_sq_row"].to(grad)
+                        state["exp_avg_sq_col"] = state["exp_avg_sq_col"].to(grad)
+                    else:
+                        state["exp_avg_sq"] = state["exp_avg_sq"].to(grad)
+                p_data_fp32 = p.data
+                if p.data.dtype in {torch.float16, torch.bfloat16}:
+                    p_data_fp32 = p_data_fp32.float()
+                state["step"] += 1
+                state["RMS"] = self._rms(p_data_fp32)
+                lr = self._get_lr(group, state)
+                beta2t = 1.0 - math.pow(state["step"], group["decay_rate"])
+                update = (grad**2) + group["eps"][0]
+                if factored:
+                    exp_avg_sq_row = state["exp_avg_sq_row"]
+                    exp_avg_sq_col = state["exp_avg_sq_col"]
+                    exp_avg_sq_row.mul_(beta2t).add_(update.mean(dim=-1), alpha=(1.0 - beta2t))
+                    exp_avg_sq_col.mul_(beta2t).add_(update.mean(dim=-2), alpha=(1.0 - beta2t))
+                    # Approximation of exponential moving average of square of gradient
+                    update = self._approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col)
+                    update.mul_(grad)
+                else:
+                    exp_avg_sq = state["exp_avg_sq"]
+                    exp_avg_sq.mul_(beta2t).add_(update, alpha=(1.0 - beta2t))
+                    update = exp_avg_sq.rsqrt().mul_(grad)
+                update.div_((self._rms(update) / group["clip_threshold"]).clamp_(min=1.0))
+                update.mul_(lr)
+                if use_first_moment:
+                    exp_avg = state["exp_avg"]
+                    exp_avg.mul_(group["beta1"]).add_(update, alpha=(1 - group["beta1"]))
+                    update = exp_avg
+                if group["weight_decay"] != 0:
+                    p_data_fp32.add_(p_data_fp32, alpha=(-group["weight_decay"] * lr))
+                p_data_fp32.add_(-update)
+                if p.data.dtype in {torch.float16, torch.bfloat16}:
+                    p.data.copy_(p_data_fp32)
+        return loss

UniAnimate/utils/optim/lr_scheduler.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import math
+from torch.optim.lr_scheduler import _LRScheduler
+__all__ = ['AnnealingLR']
+class AnnealingLR(_LRScheduler):
+    def __init__(self, optimizer, base_lr, warmup_steps, total_steps, decay_mode='cosine', min_lr=0.0, last_step=-1):
+        assert decay_mode in ['linear', 'cosine', 'none']
+        self.optimizer = optimizer
+        self.base_lr = base_lr
+        self.warmup_steps = warmup_steps
+        self.total_steps = total_steps
+        self.decay_mode = decay_mode
+        self.min_lr = min_lr
+        self.current_step = last_step + 1
+        self.step(self.current_step)
+    def get_lr(self):
+        if self.warmup_steps > 0 and self.current_step <= self.warmup_steps:
+            return self.base_lr * self.current_step / self.warmup_steps
+        else:
+            ratio = (self.current_step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
+            ratio = min(1.0, max(0.0, ratio))
+            if self.decay_mode == 'linear':
+                return self.base_lr * (1 - ratio)
+            elif self.decay_mode == 'cosine':
+                return self.base_lr * (math.cos(math.pi * ratio) + 1.0) / 2.0
+            else:
+                return self.base_lr
+    def step(self, current_step=None):
+        if current_step is None:
+            current_step = self.current_step + 1
+        self.current_step = current_step
+        new_lr = max(self.min_lr, self.get_lr())
+        if isinstance(self.optimizer, list):
+            for o in self.optimizer:
+                for group in o.param_groups:
+                    group['lr'] = new_lr
+        else:
+            for group in self.optimizer.param_groups:
+                group['lr'] = new_lr
+    def state_dict(self):
+        return {
+            'base_lr': self.base_lr,
+            'warmup_steps': self.warmup_steps,
+            'total_steps': self.total_steps,
+            'decay_mode': self.decay_mode,
+            'current_step': self.current_step}
+    def load_state_dict(self, state_dict):
+        self.base_lr = state_dict['base_lr']
+        self.warmup_steps = state_dict['warmup_steps']
+        self.total_steps = state_dict['total_steps']
+        self.decay_mode = state_dict['decay_mode']
+        self.current_step = state_dict['current_step']