Introducing Stable Diffusion 3.0: Redefining Text-to-Image Generation with a New Architecture

Today, Stability AI unveils an early preview of its next-generation flagship text-to-image generative AI model, Stable Diffusion 3.0, featuring a groundbreaking diffusion transformation architecture.

Over the past year, Stability AI has been consistently refining and releasing multiple image models, each showcasing heightened levels of sophistication and quality. Following the notable SDXL release in July, which significantly enhanced the Stable Diffusion base model, the company is now poised to push boundaries further.

What's New?

Stable Diffusion 3.0 aims to elevate image quality and performance in generating images from multi-subject prompts while offering vastly improved typography compared to previous iterations. This advancement addresses a historical weakness of Stable Diffusion models, aligning it more closely with competitors such as DALL-E 3, Ideogram, and Midjourney, who have also made strides in this area with recent releases. Stability AI is developing Stable Diffusion 3.0 in various model sizes, ranging from 800M to 8B parameters.

Moreover, Stable Diffusion 3.0 isn't merely an iteration of previous models; it's built on an entirely new architecture.

"Stable Diffusion 3 is a diffusion transformer, a novel architecture akin to the one used in the recent OpenAI Sora model," explains Emad Mostaque, CEO of Stability AI. "It represents the true successor to the original Stable Diffusion."

Diffusion Transformers and Flow Matching: Pioneering Image Generation

Stability AI has explored diverse approaches to image generation, including the recent preview release of Stable Cascade, which leverages the Würstchen architecture for enhanced performance and accuracy. In contrast, Stable Diffusion 3.0 adopts diffusion transformers.

"Stable Diffusion did not incorporate a transformer previously," notes Mostaque.

Transformers have been pivotal in the advancement of generative AI, primarily in text generation models. Image generation, on the other hand, has predominantly relied on diffusion models. The research paper detailing Diffusion Transformers (DiTs) introduces a new architecture that replaces the commonly used U-Net backbone with a transformer operating on latent image patches. This approach optimizes compute utilization and surpasses other diffusion-based image generation methods.

Additionally, Stable Diffusion benefits from flow matching, a significant innovation detailed in research papers. Flow matching introduces a new approach for training Continuous Normalizing Flows (CNFs) to model complex data distributions. By employing Conditional Flow Matching (CFM) with optimal transport paths, Stable Diffusion achieves faster training, more efficient sampling, and superior performance compared to diffusion paths.

Enhanced Typography: The Power of Stable Diffusion 3.0

The improved typography in Stable Diffusion 3.0 is a result of several enhancements integrated into the new model.

"This is attributed to both the transformer architecture and additional text encoders," elaborates Mostaque. "Full sentences and coherent style are now achievable."

While initially showcased as a text-to-image generative AI technology, Stable Diffusion 3.0 will serve as the foundation for broader applications. Stability AI has been expanding into 3D image generation and video generation capabilities, emphasizing the versatility and adaptability of its open models.

"We offer open models adaptable to various needs," states Mostaque. "This series of models across sizes will drive the development of our next-generation visual models, encompassing video, 3D, and beyond."