BRICS: Bi-level Feature Representation of Image CollectionS

Anonymous Authors
concept of our method

Brics projects images into key codes and then uses the key codes to retrieve features from multi-scale feature grids, instead of directly encoding images into features.

Uncurated generated results from the diffusion model trained on our key codes.

Abstract

We present BRICS, a bi-level feature representation for image collections, consisting of a key code space on top of a multi-scale feature grid space. Our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ methods while having a smaller and more efficient decoder network (≈50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the LSUN-Church (≈29% lower than LDM, ≈32% lower than StyleGAN2 and ≈44% lower than Projected GAN on CLIP-FID) and FFHQ datasets.

Pipeline

Overall pipeline of our method in three parts: Encoding, Feature Retrieval and Decoding.

Results

Reconstruction

Our largest model (code size 16 × 16 × 16) has only 10M more parameters (≈ 24% increase) than those of the corresponding VQGAN and RQ-VAE models. However, our model outperforms VQGAN and RQ-VAE across all metrics, showing 55% and 41% improvements in LPIPS, respectively. In terms of computational costs, our method requires ≈ 50% fewer GFlops due to its smaller decoder and the efficiency gained from utilizing the multi-scale feature grids.

Reconstruction metrics on the validation splits of FFHQ and LSUN-Church dataset.

Trainable parameters and computational load of decoders. An * indicates total number of parameters in feature grids and refers to total computational cost of decoding and feature retrieval from feature grids.


Generation

We achieve the state-of-the-art CLIP-FID than others (≈44% lower than Projected GAN, ≈32% lower than StyleGAN2, and ≈29% lower than LDM) while maintaining competitive on all other metrics. Moreover, the precision score of our generated images is significantly higher than others, indicating a substantial reduction of low-quality samples in our results, while our recall is the second best in LSUN-Church and is almost the same as StyleGAN2 in FFHQ dataset.

Quantitative results of generation on LSUN-Church dataset. Our relaxed precision (Config d) method is adapted to two distinct noise schedulers (Config a,b) and gets record low CLIP-FID in Config b + d, while Config a + c (KL-reg with min-snr noise scheduler) gets much worse results. * denotes that we measure the metrics of Projected GAN using the checkpoint provided from the official Projected GAN Github repository. Underlined numbers are the second best results.

Quantitative results of generation on FFHQ dataset. * denotes the results calculated on publicly released checkpoint by LDM author on Github.

Nearest Neighbour

Although our method has much higher precision scores than previous methods, we show the nearest neighbour search by LPIPS to demonstrate that our generated samples are unique and not mere retrievals from the training dataset. In the following images, the leftmost images in each row are generated images from our method and the rest images in each row are the nearest neighbour search results.