modelzoo.vision.pytorch.dit.layers.vae.VAEModel.AutoencoderKL#

class modelzoo.vision.pytorch.dit.layers.vae.VAEModel.AutoencoderKL[source]#

Bases: torch.nn.Module

Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model inherits from [ModelMixin]. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

Parameters
  • in_channels (int, optional, defaults to 3) – Number of channels in the input image.

  • out_channels (int, optional, defaults to 3) – Number of channels in the output.

  • (Tuple[str] (up_block_types) – obj:(“DownEncoderBlock2D”,)): Tuple of downsample block types.

  • optional – obj:(“DownEncoderBlock2D”,)): Tuple of downsample block types.

  • to (defaults) – obj:(“DownEncoderBlock2D”,)): Tuple of downsample block types.

  • (Tuple[str] – obj:(“UpDecoderBlock2D”,)): Tuple of upsample block types.

  • optional – obj:(“UpDecoderBlock2D”,)): Tuple of upsample block types.

  • to – obj:(“UpDecoderBlock2D”,)): Tuple of upsample block types.

  • (Tuple[int] (block_out_channels) – obj:(64,)): Tuple of block output channels.

  • optional – obj:(64,)): Tuple of block output channels.

  • to – obj:(64,)): Tuple of block output channels.

  • act_fn (str, optional, defaults to “silu”) – The activation function to use.

  • latent_channels (int, optional, defaults to 4) – Number of channels in the latent space.

  • sample_size (int, optional, defaults to 32) –

  • scaling_factor (float, optional, defaults to 0.18215) – The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper.

Methods

blend_h

blend_v

decode

disable_slicing

Disable sliced VAE decoding.

disable_tiling

Disable tiled VAE decoding.

enable_slicing

Enable sliced VAE decoding.

enable_tiling

Enable tiled VAE decoding.

encode

forward

param sample

Input sample.

tiled_decode

Decode a batch of images using a tiled decoder. Args: When this option is enabled, the VAE will split the input tensor into tiles to compute decoding in several steps. This is useful to keep memory use constant regardless of image size. The end result of tiled decoding is: different from non-tiled decoding due to each tile using a different decoder. To avoid tiling artifacts, the tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the look of the output, but they should be much less noticeable. z (torch.FloatTensor): Input batch of latent vectors. return_dict (bool, optional, defaults to True): Whether or not to return a [DecoderOutput] instead of a plain tuple.

tiled_encode

Encode a batch of images using a tiled encoder. Args: When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several steps. This is useful to keep memory use constant regardless of image size. The end result of tiled encoding is: different from non-tiled encoding due to each tile using a different encoder. To avoid tiling artifacts, the tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the look of the output, but they should be much less noticeable. x (torch.FloatTensor): Input batch of images. return_dict (bool, optional, defaults to True): Whether or not to return a [AutoencoderKLOutput] instead of a plain tuple.

__call__(*args: Any, **kwargs: Any) Any#

Call self as a function.

__init__(in_channels: int = 3, out_channels: int = 3, down_block_types: Tuple[str] = ('DownEncoderBlock2D',), up_block_types: Tuple[str] = ('UpDecoderBlock2D',), block_out_channels: Tuple[int] = (64,), layers_per_block: int = 1, act_fn: str = 'silu', latent_channels: int = 4, norm_num_groups: int = 32, sample_size: int = 32, scaling_factor: float = 0.18215, latent_size: Tuple[int, int] = (32, 32))[source]#
static __new__(cls, *args: Any, **kwargs: Any) Any#
disable_slicing()[source]#

Disable sliced VAE decoding. If enable_slicing was previously invoked, this method will go back to computing decoding in one step.

disable_tiling()[source]#

Disable tiled VAE decoding. If enable_vae_tiling was previously invoked, this method will go back to computing decoding in one step.

enable_slicing()[source]#

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_tiling(use_tiling: bool = True)[source]#

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful to save a large amount of memory and to allow the processing of larger images.

forward(sample: torch.FloatTensor, sample_posterior: bool = False, return_dict: bool = True, generator: Optional[torch.Generator] = None) Union[modelzoo.vision.pytorch.dit.layers.vae.VAEModel.DecoderOutput, torch.FloatTensor][source]#
Parameters
  • sample (torch.FloatTensor) – Input sample.

  • sample_posterior (bool, optional, defaults to False) – Whether to sample from the posterior.

  • return_dict (bool, optional, defaults to True) – Whether or not to return a [DecoderOutput] instead of a plain tuple.

tiled_decode(z: torch.FloatTensor, return_dict: bool = True) Union[modelzoo.vision.pytorch.dit.layers.vae.VAEModel.DecoderOutput, torch.FloatTensor][source]#

Decode a batch of images using a tiled decoder. Args: When this option is enabled, the VAE will split the input tensor into tiles to compute decoding in several steps. This is useful to keep memory use constant regardless of image size. The end result of tiled decoding is: different from non-tiled decoding due to each tile using a different decoder. To avoid tiling artifacts, the tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the look of the output, but they should be much less noticeable.

z (torch.FloatTensor): Input batch of latent vectors. return_dict (bool, optional, defaults to True):

Whether or not to return a [DecoderOutput] instead of a plain tuple.

tiled_encode(x: torch.FloatTensor, return_dict: bool = True) modelzoo.vision.pytorch.dit.layers.vae.VAEModel.AutoencoderKLOutput[source]#

Encode a batch of images using a tiled encoder. Args: When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several steps. This is useful to keep memory use constant regardless of image size. The end result of tiled encoding is: different from non-tiled encoding due to each tile using a different encoder. To avoid tiling artifacts, the tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the look of the output, but they should be much less noticeable.

x (torch.FloatTensor): Input batch of images. return_dict (bool, optional, defaults to True):

Whether or not to return a [AutoencoderKLOutput] instead of a plain tuple.