A Caltech-bred artificial intelligence startup emerged from stealth this week with a mathematical breakthrough that could fundamentally reshape how AI systems are deployed, compressing 8-billion parameter language models to run efficiently on smartphones while maintaining performance comparable to much larger systems.

PrismML, co-founded by Caltech mathematician Babak Hassibi, announced on April 1 its open-source release of the 1-bit Bonsai series of large language models . The flagship 1-bit Bonsai 8B model features 8.2 billion parameters but occupies only 1.15 GB of memory—approximately 14 times more compressed than comparable 16-bit models .

While maintaining performance close to traditional 16-bit models, it compresses memory usage from 16GB to 1.15GB, boosts inference speed by 8 times, and reduces energy consumption by up to 80% . The model achieves impressive real-world speeds, with the iPhone 17 Pro Max processing approximately 44 tokens per second .

“We spent years developing the mathematical theory required to compress a neural network without losing its reasoning capabilities,” said Babak Hassibi, CEO and Founder of PrismML and Professor at Caltech. “We see 1-bit not as an endpoint, but as a starting point. We are creating a new paradigm for AI: one that adapts to diverse hardware environments and delivers maximum intelligence per unit of compute and energy.”

The breakthrough addresses a critical constraint in AI deployment: the massive computational resources required to run advanced models have increasingly trapped artificial intelligence inside specialized data centers. The most capable intelligence became trapped inside massive clusters and specialized infrastructure. Yet some of the most important uses of AI are not confined to data centers. They happen on phones, laptops, vehicles, robots, secure enterprise environments, and edge devices. AI deployment no longer aligns with where it is needed .

Industry Investment and Validation

PrismML has raised $16.25 million in SAFE and seed funding from Khosla Ventures, Cerberus Capital, and Caltech . The funding round attracted high-profile technology investors who view the breakthrough as transformational for the industry.

“This is not a minor iteration, but a major technological breakthrough—a mathematical breakthrough—not just another small model,” said Vinod Khosla, founder of Khosla Ventures . “AI’s future will not be defined by who can build the largest datacenters. It will be defined by who can deliver the most intelligence per unit of energy and cost” , Khosla added.

Amir Salek of Cerberus Ventures, who founded and led Google’s TPU program, commented: “Power has become the ultimate bottleneck for scaling AI datacenters, and PrismML is fundamentally transforming the power-to-compute equation. Your bandwidth requirements drop dramatically, your memory size drops dramatically, the energy consumed moving data… also drops dramatically. I’m convinced PrismML has achieved a major mathematical breakthrough that has the potential to improve the economics of AI.”

Technical Innovation

PrismML’s Bonsai model family is based on an architecture where “each weight is represented only by its sign, {−1, +1}, while a shared scale factor is stored for each group of weights,” instead of a 16-bit or 32-bit floating point number . Critically, this is not post-training quantization applied to an existing full-precision model; it is a native 1-bit architecture trained from scratch on Google v4 TPUs .

The practical consequence for inference is significant: the matrix multiplications that dominate transformer compute collapse into simple additions, since multiplying by ±1 requires no FPU. This is why the energy and speed gains are so large .

The energy efficiency gains are substantial. Data from PrismML shows that Bonsai 8B’s energy efficiency is 4 to 5 times higher than the traditional FP16 version. On an M4 Pro Mac, its energy consumption is 0.074 milliwatt-hours per token; on an iPhone 17 Pro Max, it is only 0.068 milliwatt-hours per token .

The 1-bit Bonsai 8B model is an 8-billion parameter Large Language Model where each parameter has 1-bit precision. It has been trained using Google v4 TPUs. It is designed for seamless integration with existing AI workflows and is optimized for low-latency inference on consumer-grade CPUs, NPUs, and edge GPUs. The model achieves high-fidelity reasoning and language understanding comparable to FP16 (16-bit floating point) 8B models, but with a fraction of the memory footprint (1GB vs 16GB) .

**Market Context