Stable diffusion VAE fine tuning (backport AutoencoderKL and its config.yaml to taming-transformers) by rbbb · Pull Request #222 · CompVis/taming-transformers

rbbb · 2023-09-17T12:26:06Z

Can we have stable diffusion VAE fine-tuning directly from taming-transformers ?

The code seems to work (obviously, AutoencoderKL was taken from taming-transformers).
Both the AutoencoderKL code and the config snippet were taken from stable-diffusion.
Usage is strictly identical to 'VQGAN with your own data'.

There is some safetensors loading code, but it doesn't work with torch 1.7 that is recommended with taming-transformers.

Related discussions:
lllyasviel/ControlNet#500

Related code:
https://github.com/cccntu/fine-tune-models/blob/main/run_finetune_vae.py
(adapted from Patil Suraj's stable-diffusion-jax)

…ig.yaml to taming-transformers)

rbbb · 2023-09-19T16:43:13Z

Added a colab notebook in commit efb20eb.

I'm slightly confused about the actual objective function.

some terms are sometimes dropped from the objective [Community] Training AutoencoderKL huggingface/diffusers#894
this gives additional objective functions used https://huggingface.co/stabilityai/sd-vae-ft-mse-original , as well as training setup
my best guess is that it boils down to aesthetic choices instead of correct math (clip skip tweaks, hand choosing Lora checkpoints, etc ...)

sgw-ite · 2024-04-18T01:29:31Z

hi, i use the code in https://github.com/CompVis/taming-transformers/pull/222/files, I would like to ask why you used VQLPIPS as the loss function in line 20(configs/finetune_vae.yaml), and also thank you very much for your code!

rbbb · 2024-04-18T13:23:09Z

Hi.

In the original pull request, I wrote 'it is an aesthetic choice'

There are no rules in what metrics are used to fine-tune the VAE (the VAE police will not come to get you if you change the loss function). It is usual to drop the discriminator in fine tuning, and by usual, I mean 'common practice that people usually do without formal verification or peer reviewed paper'. Dropping or including LPIPS is the same, the result will give you an aesthetically different result.

To take a classical example: https://en.wikipedia.org/wiki/Dither
After reducing color space, having only pixel loss will produce bands in the image
Having perceptive loss should achieve dithering, with the occasional bad pixel.

So you should run the two (with and without LPIPS), look very closely at you images, and see which one you prefer.

More formally, if you take a paper at random (say, the StableDiff3 paper https://arxiv.org/pdf/2403.03206.pdf ), you'll notice that model evaluation is all human preference.

HTH

rbbb added 3 commits September 17, 2023 19:56

Stable diffusion VAE fine tuning (backport AutoencoderKL and its conf…

0d8a259

…ig.yaml to taming-transformers)

Colab notebook

efb20eb

Colab notebook

3484597

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable diffusion VAE fine tuning (backport AutoencoderKL and its config.yaml to taming-transformers)#222

Stable diffusion VAE fine tuning (backport AutoencoderKL and its config.yaml to taming-transformers)#222
rbbb wants to merge 3 commits intoCompVis:masterfrom
rbbb:stable-diff-vae-finetuning

rbbb commented Sep 17, 2023

Uh oh!

rbbb commented Sep 19, 2023

Uh oh!

sgw-ite commented Apr 18, 2024

Uh oh!

rbbb commented Apr 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rbbb commented Sep 17, 2023

Uh oh!

rbbb commented Sep 19, 2023

Uh oh!

sgw-ite commented Apr 18, 2024

Uh oh!

rbbb commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rbbb commented Apr 18, 2024 •

edited

Loading