Apple Research is generating images with a forgotten AI technique

2025-06-23 14:19:00
9to5mac.com

Today, most generative image models basically fall into two main categories: diffusion models, like Stable Diffusion, or autoregressive models, like OpenAI’s GPT-4o. But Apple just released two papers that show how there might be room for a third, forgotten technique: Normalizing Flows. And with a dash of Transformers on top, they might be more capable than previously thought.

First things first: What are Normalizing Flows?

Normalizing Flows (NFs) are a type of AI model that works by learning how to mathematically transform real-world data (like images) into structured noise, and then reverse that process to generate new samples.

The big advantage is that they can calculate the exact likelihood of each image they generate, a property that diffusion models can’t do. This makes flows especially appealing for tasks where understanding the probability of an outcome really matters.

But there’s a reason most people haven’t heard much about them lately: Early flow-based models produced images that looked blurry or lacked the detail and diversity offered by diffusion and transformer-based systems.

Study #1: TarFlow

In the paper “Normalizing Flows are Capable Generative Models”, Apple introduces a new model called TarFlow, short for Transformer AutoRegressive Flow.

At its core, TarFlow replaces the old, handcrafted layers used in previous flow models with Transformer blocks. Basically, it splits images into small patches, and generates them in blocks, with each block predicted based on all the ones that came before. That’s what’s called autoregressive, which is the same underlying method that OpenAI currently uses for image generation.

Image: Apple — Images of various resolutions generated by TarFlow models. From left to right, top to bottom: 256×256 images on AFHQ, 128×128 and 64×64 images on ImageNet. Source: Normalizing Flows are Capable Generative Models

The key difference is that while OpenAI generates discrete tokens, treating images like long sequences of text-like symbols, Apple’s TarFlow generates pixel values directly, without tokenizing the image first. It’s a small, but significant difference because it lets Apple avoid the quality loss and rigidity that often come with compressing images into a fixed vocabulary of tokens.

Still, there were limitations, especially when it came to scaling up to larger, high-res images. And that’s where the second study comes in.

Study #2: STARFlow

In the paper “STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis”, Apple builds directly on TarFlow and presents STARFlow (Scalable Transformer AutoRegressive Flow), with key upgrades.

The biggest change: STARFlow no longer generates images directly in pixel space. Instead, it basically works on a compressed version of the image, and then hands things off to a decoder that upsamples everything back to full resolution at the final step.

This shift to what is called latent space means STARFlow doesn’t need to predict millions of pixels directly. It can focus on the broader image structure first, leaving fine texture detail to the decoder.

Apple also reworked how the model handles text prompts. Instead of building a separate text encoder, STARFlow can plug in existing language models (like Google’s small language model Gemma, which in theory could run on-device) to handle language understanding when the user prompts the model to create the image. This keeps the image generation side of the model focused on refining visual details.

How STARFlow compares with OpenAI’s 4o image generator

While Apple is rethinking flows, OpenAI has also recently moved beyond diffusion with its GPT-4o model. But their approach is fundamentally different.

GPT-4o treats images as sequences of discrete tokens, much like words in a sentence. When you ask ChatGPT to generate an image, the model predicts one image token at a time, building the picture piece by piece. This gives OpenAI enormous flexibility: the same model can generate text, images, and audio within a single, unified token stream.

The tradeoff? Token-by-token generation can be slow, especially for large or high-resolution images. And it’s extremely computationally expensive. But since GPT-4o runs entirely in the cloud, OpenAI isn’t as constrained by latency or power use.

In short: both Apple and OpenAI are moving beyond diffusion, but while OpenAI is building for its data centers, Apple is clearly building for our pockets.

FTC: We use income earning auto affiliate links. More.

Source Link

Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!

Unlock unlimited streaming with a free Amazon Prime trial!
Sign up today!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo