Fine-Tuning LLMs on Apple Silicon: Lessons from Production

I just spent 12 hours fine-tuning a 33B parameter model on my Mac Studio. Three crashes, two NaN catastrophes, and one breakthrough later, I have a custom coding assistant that actually understands my patterns. Here's what I learned.

๐ŸŽฏ Why Bother Fine-Tuning?

Most people run pre-trained models and call it a day. But when you're building real systems, church management software, business automation, multi-tenant SaaS, you need a model that speaks your language.

I'm talking about: Consistent multi-tenant patterns (claims.TenantID checks on every query), Service layer architecture (no business logic in handlers), My actual coding style (not StackOverflow's greatest hits).

Foundation models are brilliant generalists. But I needed a specialist who knows my codebase.

๐Ÿง  The Setup: MLX on Mac Studio M3 Ultra

Apple's MLX framework turns Macs into legitimate ML training machines. My setup: Mac Studio M3 Ultra with 512GB RAM, Qwen3 Coder Next (33B parameters), 1,500 code pairs (75% Go, rest Docker/SQL/Bash/Svelte). Goal: Learn my patterns for Pews (church management SaaS) and CTM projects.

The promise: train a production model on the same hardware running my IDE. No cloud GPUs, no AWS bills, no data leaving my network.

๐Ÿ’ฅ Crash #1: The Metal Buffer Wall

First run looked promising. 30 iterations in, smooth loss curve... then Metal buffer allocation failed. The M3 Ultra has 512GB of unified memory, but Metal enforces separate limits. My mistake: using default batch size (4) with full 2048 token sequences.

Fix: batch_size = 1, grad_accumulation_steps = 4. Gradient accumulation lets you simulate larger batches by accumulating gradients across multiple small batches before updating weights. Same learning dynamics, 4x less memory.

๐Ÿ’ฅ Crash #2: The NaN Incident

Second run made it to iteration 75. Loss dropping nicely, then every parameter turned into NaN. Model toast. Training data toast. 4 hours gone.

Root cause: Learning rate too aggressive. Changed from 5e-5 (standard for smaller models) to 5e-6 (10x smaller for 33B params). Larger models need gentler updates.

Lesson learned: Save checkpoints every 50 iterations, not 150. When you hit NaN at iteration 140, you want iteration 100 waiting for you, not iteration 0.

โœ… Success: Iteration 150

Third time's the charm. Training took 6 hours. Best validation loss: 0.309 at iteration 100. Saved checkpoint at 150 with loss 0.347 (slight overfit, but production-acceptable). The adapter file (LoRA weights): 180MB of learned patterns.

๐Ÿš€ Real-World Results

I fused the adapter into the base model (60GB final size) and tested on actual feature requests. When prompted to add audit logging, the base model gave generic CRUD examples with no tenant awareness. The fine-tuned model? It generated proper multi-tenant Go code with structured logging, context extraction, and my actual error handling patterns.

Speed: 43 tokens/sec with adapter (vs 59 tok/sec base). Worth the 27% slowdown for code that ships.

๐Ÿ—๏ธ Three-Tier Architecture

This fine-tune became part of a larger system: Reflexes (20B fine-tuned) for heartbeat checks and Discord replies, Muscle (33Bร—6 fine-tuned) for code generation and agent factory, Brain (Claude Opus API) for strategy and orchestration.

Local models handle 60-70% of daily tasks. API calls dropped from $200/mo to $100/mo. The fine-tuned models know my patterns, so I waste fewer tokens on correction loops.

๐Ÿง  Lessons Learned

1. Start conservative, then optimize. Batch size 1, low learning rate, frequent checkpoints. Get one successful run before chasing speed.

2. Metal buffer limits are real. Unified memory โ‰  unlimited Metal allocation. Monitor with Activity Monitor during training.

3. Dataset quality beats quantity. 1,500 hand-picked examples beat 10,000 scraped StackOverflow answers. Every pair should teach a pattern you actually want.

4. Save often. NaN can strike anytime. Save every 50 iterations minimum. Disk is cheap, lost training is expensive.

5. MoE models need conservative settings. Mixture-of-Experts models route inputs through expert sub-networks. They're memory-hungry and sensitive to batch sizes.

6. Validation loss plateau doesn't mean failure. My best checkpoint was iteration 100, but I kept training to 150. The extra iterations still taught useful patterns despite higher validation loss.

๐Ÿ“Š Cost Analysis

Cloud alternative (AWS p4d.24xlarge): $32.77/hour ร— 6 hours ร— 3 attempts = $589.86. My setup: Mac Studio (already owned) + $2 electricity = $2 total. Amortized over multiple training runs, the Mac Studio pays for itself fast. Plus: no data upload, no egress fees, instant iteration.

๐Ÿ”ฎ Next Steps

I'm working on a second fine-tune for my Friday assistant, optimized for heartbeat checks, infrastructure monitoring, and proactive notifications. Goal: handle 80% of routine tasks locally, reserve API calls for strategic work. The fine-tuning loop is now proven. Dataset โ†’ train โ†’ validate โ†’ deploy takes one evening. That's fast enough to iterate on real user feedback.


Running 100+ Docker containers across 4 servers, building SaaS from rural Georgia, fine-tuning models in the gap between school pickup and dinner. If you're optimizing infrastructure or building weird stuff, let's connect.