Fine-Tuning a 20B Model on 512GB Apple Silicon: What I Learned

I fine-tuned a 20-billion parameter model on my Mac Studio M3 Ultra and took validation loss from 1.85 to 0.79. Here's what I learned about running serious AI workloads on Apple Silicon.

Fine-tuning AI models on Apple Silicon

I spent the last few weeks pushing my Mac Studio M3 Ultra to its limits. Spoiler: it can fine-tune a 20-billion parameter model without breaking a sweat.


🧠 Why Run LLMs Locally?

The cloud is great until you're burning $500/month on API calls for experimentation. I wanted full control, no rate limits, no data leaving my network, and the ability to fine-tune models on proprietary data without worrying about ToS violations.

Apple Silicon changed the game. The 512GB of unified memory on my M3 Ultra means I can load models that would cost thousands to run in the cloud. Plus, MLX (Apple's ML framework) is optimized for Metal, making inference and training blazing fast.

⚙️ The Setup

Hardware: Mac Studio M3 Ultra, 512GB RAM, 2TB SSD

Software stack:

  • Ollama for quick inference testing
  • LM Studio for model management and benchmarking
  • MLX for fine-tuning (the real MVP)

I started with the gpt-oss 20B model. It's a solid open-source base model, but I wanted to specialize it for technical writing and infrastructure documentation.

🚀 Fine-Tuning Process

Here's the core training loop using MLX:

import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
from mlx_lm import load, generate

# Load base model
model, tokenizer = load("gpt-oss-20b")

# Prepare dataset (5,000 examples of technical docs)
train_data = load_dataset("./training_data.jsonl")

# LoRA fine-tuning config
lora_config = {
    "rank": 8,
    "alpha": 16,
    "dropout": 0.05,
    "target_modules": ["q_proj", "v_proj"]
}

# Training loop
optimizer = optim.AdamW(learning_rate=3e-4)
for epoch in range(3):
    for batch in train_data:
        loss = compute_loss(model, batch)
        optimizer.update(model, loss)
        print(f"Epoch {epoch}, Loss: {loss.item()}")

# Save fine-tuned weights
model.save_weights("./gpt-oss-20b-tuned")

Results:

  • Initial validation loss: 1.85
  • After fine-tuning: 0.79 (57% reduction!)
  • Training time: ~14 hours for 3 epochs
  • Peak memory usage: 487GB (still had headroom!)

The fine-tuned model now generates Docker configs, Traefik rules, and infrastructure docs that match my actual setup. It understands my conventions without me having to spell them out every time.

💡 Real-World Use Cases

1. Infrastructure documentation: I feed it a Docker Compose file, and it generates comprehensive setup guides with security best practices.

2. Code review: I run it against integration code and it catches edge cases I missed (especially around OAuth flows and rate limiting).

📊 Comparing Ollama, LM Studio, and MLX

Ollama: Best for quick inference. Dead simple CLI. Great for testing prompts.

LM Studio: Beautiful GUI for model management. Excellent for comparing models side-by-side. Limited fine-tuning support.

MLX: The power tool. Full control over training, quantization, and deployment. Requires Python chops but worth it for custom work.

My workflow: Prototype in Ollama → Benchmark in LM Studio → Fine-tune with MLX → Deploy back to Ollama for production use.

🧠 Lessons Learned

  • 512GB isn't overkill. I regularly use 400GB+ when fine-tuning. You want headroom for the OS and other processes.
  • Dataset quality > dataset size. My 5,000 hand-curated examples outperformed 50,000 scraped samples.
  • LoRA is magic. Low-rank adaptation lets you fine-tune massive models with minimal resources. Game-changer.
  • MLX is fast. I tried PyTorch with MPS backend first, MLX was 3x faster for the same workload.
  • Local inference is addictive. No rate limits, no API keys, no internet required. Once you experience it, cloud APIs feel slow.
  • Monitor your temps. The Mac Studio ran warm during training (not hot, but noticeably warmer). Good ventilation matters.

If you're serious about local AI, Apple Silicon is the most cost-effective path. A comparable GPU server would cost 5-10x more, require enterprise power, and sound like a jet engine. My Mac Studio is silent, sips power, and doubles as my daily driver.

Next up: I'm exploring quantization strategies to fit even larger models into memory without sacrificing quality. Stay tuned.