Generate click coordinates from image and instruction
Audio Conditioned LipSync with Latent Diffusion Models