Vector-Quantized Variational Autoencoders (VQ-VAE)

Model description

Learning latent space representations of data remains to be an important task in machine learning. This model, the Vector-Quantized Variational Autoencoder (VQ-VAE) builds upon traditional VAEs in two ways.

The encoder network outputs discrete, rather than continous, codes.
The prior is learned rather than static.

To learn discrete latent representations, ideas from vector quantisation (VQ) are used. Using the VQ method allows the model to avoid issues of "posterior collapse". By pairing these representations with an autoregressive prior, VQ-VAE models can generate high quality images, videos, speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

Full Credits for this example go to Sayak Paul

Further learning

This model has been trained using code from this example, and a result of this paper.

Model

Below we have a graphic from the paper above, showing the VQ-VAE model architecture and quantization process.

Intended uses & limitations

This model is intended to be used for educational purposes. To train your own VQ-VAE model, follow along with this example

Training and evaluation data

This model is trained using the popular MNIST dataset. This dataset can be found/used with the following command

keras.datasets.mnist.load_data()

Hyperparameters

The model was trained usign the following

Latent Dimension = 16
Number of Embeddings = 128
Epochs = 30

The author of the example encourages toying with both the number and size of the embeddings to see how it affects the results.

Reconstruction

Below, we can see a few examples of MNIST digits being reconstructed after passing through our model.

Discrete Latent Space

Below, we can see a few examples of MNIST digits being mapped to a discrete latent space.

Next Steps

The keras example details of this model shows it can be paired with a PixelCNN for novel image generation. Check out the example linked above to try it yourself.

Citation

@article{beyer2024paligemma,
    title={{PaliGemma: A versatile 3B VLM for transfer}},
    author={Lucas Beyer* and Andreas Steiner* and André Susano Pinto* and Alexander Kolesnikov* and Xiao Wang* and Daniel Salz and Maxim Neumann and Ibrahim Alabdulmohsin and Michael Tschannen and Emanuele Bugliarello and Thomas Unterthiner and Daniel Keysers and Skanda Koppula and Fangyu Liu and Adam Grycner and Alexey Gritsenko and Neil Houlsby and Manoj Kumar and Keran Rong and Julian Eisenschlos and Rishabh Kabra and Matthias Bauer and Matko Bošnjak and Xi Chen and Matthias Minderer and Paul Voigtlaender and Ioana Bica and Ivana Balazevic and Joan Puigcerver and Pinelopi Papalampidi and Olivier Henaff and Xi Xiong and Radu Soricut and Jeremiah Harmsen and Xiaohua Zhai*},
    year={2024},
    journal={arXiv preprint arXiv:2407.07726}
}