|
|
|
[![PyPI](https://img.shields.io/pypi/v/spatial-correlation-sampler.svg)](https://pypi.org/project/spatial-correlation-sampler/) |
|
|
|
|
|
# Pytorch Correlation module |
|
|
|
this is a custom C++/Cuda implementation of Correlation module, used e.g. in [FlowNetC](https://arxiv.org/abs/1504.06852) |
|
|
|
This [tutorial](http://pytorch.org/tutorials/advanced/cpp_extension.html) was used as a basis for implementation, as well as |
|
[NVIDIA's cuda code](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package) |
|
|
|
- Build and Install C++ and CUDA extensions by executing `python setup.py install`, |
|
- Benchmark C++ vs. CUDA by running `python benchmark.py {cpu, cuda}`, |
|
- Run gradient checks on the code by running `python grad_check.py --backend {cpu, cuda}`. |
|
|
|
# Requirements |
|
|
|
This module is expected to compile for Pytorch `2.1.0`. |
|
|
|
Before installation please check compatibility of your GPU and CUDA (_Compute Capability_) [nvidia docs](https://developer.nvidia.com/cuda-gpus). |
|
e.g RTX 6000 is using CC=8.9 so we are setting the environment variable to |
|
|
|
`export TORCH_CUDA_ARCH_LIST="8.9+PTX"` |
|
|
|
# Installation |
|
|
|
be reminded this module requires `python3-dev` to compile C++ code, e.g. on Ubuntu run: |
|
|
|
`apt install python3-dev` |
|
|
|
this module is available on pip |
|
|
|
`pip install spatial-correlation-sampler` |
|
|
|
For a cpu-only version, you can install from source with |
|
|
|
`python setup_cpu.py install` |
|
|
|
# Known Problems |
|
|
|
This module needs compatible gcc version and CUDA to be compiled. |
|
Namely, CUDA 9.1 and below will need gcc5, while CUDA 9.2 and 10.0 will need gcc7 |
|
See [this issue](https://github.com/ClementPinard/Pytorch-Correlation-extension/issues/1) for more information |
|
|
|
# Usage |
|
|
|
API has a few difference with NVIDIA's module |
|
* output is now a 5D tensor, which reflects the shifts horizontal and vertical. |
|
``` |
|
input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW) |
|
``` |
|
* Output sizes `oH` and `oW` are no longer dependant of patch size, but only of kernel size and padding |
|
* Patch size `patch_size` is now the whole patch, and not only the radii. |
|
* `stride1` is now `stride` and`stride2` is `dilation_patch`, which behave like dilated convolutions |
|
* equivalent `max_displacement` is then `dilation_patch * (patch_size - 1) / 2`. |
|
* `dilation` is a new parameter, it acts the same way as dilated convolution regarding the correlation kernel |
|
* to get the right parameters for FlowNetC, you would have |
|
``` |
|
kernel_size=1 |
|
patch_size=21, |
|
stride=1, |
|
padding=0, |
|
dilation=1 |
|
dilation_patch=2 |
|
``` |
|
|
|
|
|
## Example |
|
```python |
|
import torch |
|
from spatial_correlation_sampler import SpatialCorrelationSampler, |
|
|
|
device = "cuda" |
|
batch_size = 1 |
|
channel = 1 |
|
H = 10 |
|
W = 10 |
|
dtype = torch.float32 |
|
|
|
input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True) |
|
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True) |
|
|
|
#You can either use the function or the module. Note that the module doesn't contain any parameter tensor. |
|
|
|
#function |
|
|
|
out = spatial_correlation_sample(input1, |
|
input2, |
|
kernel_size=3, |
|
patch_size=1, |
|
stride=2, |
|
padding=0, |
|
dilation=2, |
|
dilation_patch=1) |
|
|
|
#module |
|
|
|
correlation_sampler = SpatialCorrelationSampler( |
|
kernel_size=3, |
|
patch_size=1, |
|
stride=2, |
|
padding=0, |
|
dilation=2, |
|
dilation_patch=1) |
|
out = correlation_sampler(input1, input2) |
|
|
|
``` |
|
|
|
# Benchmark |
|
|
|
* default parameters are from `benchmark.py`, FlowNetC parameters are same as use in `FlowNetC` with a batch size of 4, described in [this paper](https://arxiv.org/abs/1504.06852), implemented [here](https://github.com/lmb-freiburg/flownet2) and [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/FlowNetC.py). |
|
* Feel free to file an issue to add entries to this with your hardware ! |
|
|
|
## CUDA Benchmark |
|
|
|
* See [here](https://gist.github.com/ClementPinard/270e910147119831014932f67fb1b5ea) for a benchmark script working with [NVIDIA](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package)'s code, and Pytorch. |
|
* Benchmark are launched with environment variable `CUDA_LAUNCH_BLOCKING` set to `1`. |
|
* Only `float32` is benchmarked. |
|
* FlowNetC correlation parameters where launched with the following command: |
|
|
|
```bash |
|
CUDA_LAUNCH_BLOCKING=1 python benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 cuda -d float |
|
|
|
CUDA_LAUNCH_BLOCKING=1 python NV_correlation_benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 |
|
``` |
|
|
|
| implementation | Correlation parameters | device | pass | min time | avg time | |
|
| -------------- | ---------------------- | ------- | -------- | ------------: | ------------: | |
|
| ours | default | 980 GTX | forward | **5.745 ms** | **5.851 ms** | |
|
| ours | default | 980 GTX | backward | 77.694 ms | 77.957 ms | |
|
| NVIDIA | default | 980 GTX | forward | 13.779 ms | 13.853 ms | |
|
| NVIDIA | default | 980 GTX | backward | **73.383 ms** | **73.708 ms** | |
|
| | | | | | | |
|
| ours | FlowNetC | 980 GTX | forward | **26.102 ms** | **26.179 ms** | |
|
| ours | FlowNetC | 980 GTX | backward | **208.091 ms** | **208.510 ms** | |
|
| NVIDIA | FlowNetC | 980 GTX | forward | 35.363 ms | 35.550 ms | |
|
| NVIDIA | FlowNetC | 980 GTX | backward | 283.748 ms | 284.346 ms | |
|
|
|
### Notes |
|
* The overhead of our implementation regarding `kernel_size` > 1 during backward needs some investigation, feel free to |
|
dive in the code to improve it ! |
|
* The backward pass of NVIDIA is not entirely correct when stride1 > 1 and kernel_size > 1, because not everything |
|
is computed, see [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/correlation_package/src/correlation_cuda_kernel.cu#L120). |
|
|
|
## CPU Benchmark |
|
|
|
* No other implementation is avalaible on CPU. |
|
* It is obviously not recommended to run it on CPU if you have a GPU. |
|
|
|
| Correlation parameters | device | pass | min time | avg time | |
|
| ---------------------- | -------------------- | -------- | ----------: | ----------: | |
|
| default | E5-2630 v3 @ 2.40GHz | forward | 159.616 ms | 188.727 ms | |
|
| default | E5-2630 v3 @ 2.40GHz | backward | 282.641 ms | 294.194 ms | |
|
| FlowNetC | E5-2630 v3 @ 2.40GHz | forward | 2.138 s | 2.144 s | |
|
| FlowNetC | E5-2630 v3 @ 2.40GHz | backward | 7.006 s | 7.075 s | |
|
|