I spent the last few weeks pushing my Mac Studio M3 Ultra to its limits. Spoiler: it can fine-tune a 20-billion parameter model without breaking a sweat.
🧠 Why Run LLMs Locally?
The cloud is great until you're burning $500/month on API calls for experimentation. I wanted full control, no rate limits, no data leaving my network, and the ability to fine-tune models on proprietary data without worrying about ToS violations.
Apple Silicon changed the game. The 512GB of unified memory on my M3 Ultra means I can load models that would cost thousands to run in the cloud. Plus, MLX (Apple's ML framework) is optimized for Metal, making inference and training blazing fast.
⚙️ The Setup
Hardware: Mac Studio M3 Ultra, 512GB RAM, 2TB SSD
Software stack:
- Ollama for quick inference testing
- LM Studio for model management and benchmarking
- MLX for fine-tuning (the real MVP)
I started with the gpt-oss 20B model. It's a solid open-source base model, but I wanted to specialize it for technical writing and infrastructure documentation.
🚀 Fine-Tuning Process
Here's the core training loop using MLX:
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
from mlx_lm import load, generate
# Load base model
model, tokenizer = load("gpt-oss-20b")
# Prepare dataset (5,000 examples of technical docs)
train_data = load_dataset("./training_data.jsonl")
# LoRA fine-tuning config
lora_config = {
"rank": 8,
"alpha": 16,
"dropout": 0.05,
"target_modules": ["q_proj", "v_proj"]
}
# Training loop
optimizer = optim.AdamW(learning_rate=3e-4)
for epoch in range(3):
for batch in train_data:
loss = compute_loss(model, batch)
optimizer.update(model, loss)
print(f"Epoch {epoch}, Loss: {loss.item()}")
# Save fine-tuned weights
model.save_weights("./gpt-oss-20b-tuned")Results:
- Initial validation loss: 1.85
- After fine-tuning: 0.79 (57% reduction!)
- Training time: ~14 hours for 3 epochs
- Peak memory usage: 487GB (still had headroom!)
The fine-tuned model now generates Docker configs, Traefik rules, and infrastructure docs that match my actual setup. It understands my conventions without me having to spell them out every time.
💡 Real-World Use Cases
1. Infrastructure documentation: I feed it a Docker Compose file, and it generates comprehensive setup guides with security best practices.
2. Code review: I run it against integration code and it catches edge cases I missed (especially around OAuth flows and rate limiting).
📊 Comparing Ollama, LM Studio, and MLX
Ollama: Best for quick inference. Dead simple CLI. Great for testing prompts.
LM Studio: Beautiful GUI for model management. Excellent for comparing models side-by-side. Limited fine-tuning support.
MLX: The power tool. Full control over training, quantization, and deployment. Requires Python chops but worth it for custom work.
My workflow: Prototype in Ollama → Benchmark in LM Studio → Fine-tune with MLX → Deploy back to Ollama for production use.
🧠 Lessons Learned
- 512GB isn't overkill. I regularly use 400GB+ when fine-tuning. You want headroom for the OS and other processes.
- Dataset quality > dataset size. My 5,000 hand-curated examples outperformed 50,000 scraped samples.
- LoRA is magic. Low-rank adaptation lets you fine-tune massive models with minimal resources. Game-changer.
- MLX is fast. I tried PyTorch with MPS backend first, MLX was 3x faster for the same workload.
- Local inference is addictive. No rate limits, no API keys, no internet required. Once you experience it, cloud APIs feel slow.
- Monitor your temps. The Mac Studio ran warm during training (not hot, but noticeably warmer). Good ventilation matters.
If you're serious about local AI, Apple Silicon is the most cost-effective path. A comparable GPU server would cost 5-10x more, require enterprise power, and sound like a jet engine. My Mac Studio is silent, sips power, and doubles as my daily driver.
Next up: I'm exploring quantization strategies to fit even larger models into memory without sacrificing quality. Stay tuned.