Fine-Tuning a 20B Model on 512GB Apple Silicon: What I Learned
I spent the last few weeks pushing my Mac Studio M3 Ultra to its limits. Spoiler: it can fine-tune a 20-billion parameter model without breaking a sweat.
๐ง Why Run LLMs Locally?
The cloud is great until you're burning $500/month on API calls for experimentation. I wanted full control, no rate limits, no data leaving my network, and the ability to fine-tune models on proprietary data without worrying about ToS violations.
Apple Silicon changed the game. The 512GB of unified memory on my M3 Ultra means I can load models that would cost thousands to run in the cloud. Plus, MLX (Apple's ML framework) is optimized for Metal, making inference and training blazing fast.
โ๏ธ The Setup
Hardware: Mac Studio M3 Ultra, 512GB RAM, 2TB SSD
Software stack:
- Ollama for quick inference testing
- LM Studio for model management and benchmarking
- MLX for fine-tuning (the real MVP)
I started with the gpt-oss 20B model. It's a solid open-source base model, but I wanted to specialize it for technical writing and infrastructure documentation.
๐ Fine-Tuning Process
Here's the core training loop using MLX:
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
from mlx_lm import load, generate
# Load base model
model, tokenizer = load("gpt-oss-20b")
# Prepare dataset (5,000 examples of technical docs)
train_data = load_dataset("./training_data.jsonl")
# LoRA fine-tuning config
lora_config = {
"rank": 8,
"alpha": 16,
"dropout": 0.05,
"target_modules": ["q_proj", "v_proj"]
}
# Training loop
optimizer = optim.AdamW(learning_rate=3e-4)
for epoch in range(3):
for batch in train_data:
loss = compute_loss(model, batch)
optimizer.update(model, loss)
print(f"Epoch {epoch}, Loss: {loss.item()}")
# Save fine-tuned weights
model.save_weights("./gpt-oss-20b-tuned")Results:
- Initial validation loss: 1.85
- After fine-tuning: 0.79 (57% reduction!)
- Training time: ~14 hours for 3 epochs
- Peak memory usage: 487GB (still had headroom!)
The fine-tuned model now generates Docker configs, Traefik rules, and infrastructure docs that match my actual setup. It understands my conventions without me having to spell them out every time.
๐ก Real-World Use Cases
1. Infrastructure documentation: I feed it a Docker Compose file, and it generates comprehensive setup guides with security best practices.
2. Code review: I run it against integration code and it catches edge cases I missed (especially around OAuth flows and rate limiting).
๐ Comparing Ollama, LM Studio, and MLX
Ollama: Best for quick inference. Dead simple CLI. Great for testing prompts.
LM Studio: Beautiful GUI for model management. Excellent for comparing models side-by-side. Limited fine-tuning support.
MLX: The power tool. Full control over training, quantization, and deployment. Requires Python chops but worth it for custom work.
My workflow: Prototype in Ollama โ Benchmark in LM Studio โ Fine-tune with MLX โ Deploy back to Ollama for production use.
๐ง Lessons Learned
- 512GB isn't overkill. I regularly use 400GB+ when fine-tuning. You want headroom for the OS and other processes.
- Dataset quality > dataset size. My 5,000 hand-curated examples outperformed 50,000 scraped samples.
- LoRA is magic. Low-rank adaptation lets you fine-tune massive models with minimal resources. Game-changer.
- MLX is fast. I tried PyTorch with MPS backend first, MLX was 3x faster for the same workload.
- Local inference is addictive. No rate limits, no API keys, no internet required. Once you experience it, cloud APIs feel slow.
- Monitor your temps. The Mac Studio ran warm during training (not hot, but noticeably warmer). Good ventilation matters.
If you're serious about local AI, Apple Silicon is the most cost-effective path. A comparable GPU server would cost 5-10x more, require enterprise power, and sound like a jet engine. My Mac Studio is silent, sips power, and doubles as my daily driver.
Next up: I'm exploring quantization strategies to fit even larger models into memory without sacrificing quality. Stay tuned.