AI

Unlocking the Black Box: Researchers Crack Apple's Neural Engine for AI Training

A new breakthrough bypasses Apple's restrictions to enable hyper-efficient local model development on Mac silicon.

4 min read
Unlocking the Black Box: Researchers Crack Apple's Neural Engine for AI Training
Photo: BoliviaInteligente / Unsplash

For years, the Apple Neural Engine (ANE) has been a specialized, inference-only component of Mac and iPhone chips—a high-speed lane for running AI, but a dead end for building it. Because Apple provides no public documentation for training, developers have been forced to rely on power-hungry GPUs for even small fine-tuning tasks. That wall has finally been breached by an independent researcher who found a way to run training loops directly on the hardware.

Bypassing the CoreML Gatekeeper

The project, led by researcher maderix, sidesteps Apple’s official CoreML framework entirely. Instead of using supported tools, the methodology leverages undocumented private APIs like _ANEClient to compile programs in-memory using Model Intermediate Language (MIL). By feeding data through shared memory buffers, the system can perform the complex matrix multiplications and activation functions necessary for neural networks.\n\nWhile Apple markets the M4 ANE at 38 TOPS, this research reveals a real-world throughput of approximately 19 TFLOPS at FP16 precision. This level of performance is achieved by baking weights directly into compiled programs as constants. It represents a fundamental shift in how we perceive the architecture of Apple Silicon and what its hidden silicon can actually achieve.

Why Efficiency Matters for Local AI

The implications for power consumption are staggering. The M4 Neural Engine delivers roughly 6.6 TFLOPS per watt, making it over 80 times more efficient than an enterprise-grade NVIDIA A100 GPU for specific tasks. This efficiency allows for always-on local learning and fine-tuning without draining a laptop's battery or requiring a dedicated server.\n\nHowever, the approach remains a hybrid experiment for now. While the ANE handles the heavy lifting of forward and backward passes, weight gradients are still calculated on the CPU using Apple's Accelerate libraries. This creates a potential bottleneck, but it marks the first time the ANE's true capabilities have been exposed beyond Apple's strict software guardrails.

Why Efficiency Matters for Local AI
Photo: Willian Justen de Vasconcellos / Unsplash

Reverse Engineering the Apple Neural Engine

Stay curious

A weekly digest of stories that make you think twice.
No noise. Just signal.

Free forever. Unsubscribe anytime.