
data center with multiple interconnected server racks each labeled as different nodes
This guide provides end-to-end setup instructions for your cluster to:
✅ Set up Node9 (file server) as the training coordinator
✅ Use Node1, Node2, and Node3 as training nodes
✅ Configure DeepSpeed for multi-node CPU/GPU fine-tuning
✅ Use log_interceptor.py
to log training progress
✅ Convert and load the fine-tuned model into Ollama
1️⃣ Cluster Overview
Node | Role | Specs |
---|---|---|
Node9 | Training Coordinator & File Server | Dell PowerEdge 515 (Handles datasets, training jobs, and logging) |
Node1 | Trainer (GPU) | Dell PowerEdge 715 (NVIDIA 2060, 6GB VRAM, 26-core CPU) |
Node2 | Trainer (GPU, coming soon) | Dell PowerEdge 715 (NVIDIA 2060 coming soon, 26-core CPU) |
Node3 | Trainer (CPU-only) | Dell PowerEdge 715 (CPU-based training, 26-core CPU) |
Workflow:
- Node9 stores datasets, coordinates training, and manages logs.
- Node1-3 handle actual model training.
- Once training is done, Node9 prepares and loads the fine-tuned model into Ollama.
📌 Part 1: Installing Required Dependencies
1️⃣ Install on ALL Nodes (Node1-3 + Node9)
Since you run everything inside virtual environments (venv), first activate your venv:
source /path/to/your/venv/bin/activate
Then install dependencies:
pip install deepseek transformers torch accelerate bitsandbytes deepspeed
If pip
has issues, explicitly use:
/path/to/your/venv/bin/pip install deepseek transformers torch accelerate bitsandbytes deepspeed
Ensure all nodes have:
mkdir -p /ollama/setup
mkdir -p /ollama/scripts
mkdir -p /ollama/web
✅ Now all nodes are ready to participate in training.
📌 Part 2: Configuring the Training Coordinator (Node9)
1️⃣ Set Up Multi-Node Configuration (Node9)
Run this only on Node9:
cat <<EOF > /ollama/setup/hostfile
node1 slots=26
node2 slots=26
node3 slots=26
EOF
✅ Node9 now controls training, and Node1-3 perform training tasks.
2️⃣ Configure DeepSpeed for Multi-Node Training (Node9)
Run this only on Node9:
cat <<EOF > /ollama/setup/ds_config.json
{
"train_batch_size": 1,
"gradient_checkpointing": true,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}
},
"fp16": {
"enabled": true
}
}
EOF
✅ Ensures efficient CPU/GPU offloading across all nodes.
📌 Part 3: Creating the Fine-Tuning Script (Node9)
1️⃣ Save Fine-Tuning Script
Run this only on Node9:
cat <<EOF > /ollama/scripts/train.py
from deepseek import DeepSeekModel
import os
import time
import traceback
from log_interceptor import log_event
hostname = os.uname()[1] # Get hostname of coordinator (Node9)
try:
log_event("train.py", 200, "training", "Fine-tuning started", {"host": hostname}, ai_category="system")
model = DeepSeekModel("deepseek-r1:8b", load_in_4bit=True)
model.fine_tune(
dataset_path="/mnt/shortterm/pretune/file_finetune_data.jsonl",
output_dir=f"/mnt/shortterm/posttune/fine_tuned_deepseek_{hostname}",
epochs=3,
batch_size=1,
learning_rate=2e-5,
use_qlora=True,
save_steps=500,
gradient_checkpointing=True,
deepspeed_config="/ollama/setup/ds_config.json"
)
log_event("train.py", 200, "training", "Fine-tuning completed successfully", {"host": hostname}, ai_category="system")
except Exception as e:
log_event("train.py", 500, "training", "Error during fine-tuning", {"error": str(e), "traceback": traceback.format_exc()}, ai_category="error")
EOF
✅ Fine-tuning will now log start, completion, and errors in /mnt/shortterm/
.
📌 Part 4: Running Training Across All Nodes
1️⃣ Start Multi-Node Training from Node9
Run this on Node9:
deepspeed --num_nodes 3 --num_gpus 0 --hostfile /ollama/setup/hostfile /ollama/scripts/train.py
✅ Node9 controls training, but Node1-3 perform the training.
Once Node2 gets a GPU, run:
deepspeed --num_nodes 3 --num_gpus 1 /ollama/scripts/train.py
✅ Now, Node1 & Node2 use GPUs, and Node3 remains CPU-only.
📌 Part 5: Converting & Loading Fine-Tuned Model into Ollama
1️⃣ Convert the Model
Run this on Node9 after training completes:
deepseek-convert --input /mnt/shortterm/posttune/fine_tuned_deepseek_$(hostname) --output /ollama/setup/deepseek-finetuned.gguf
2️⃣ Load the Model into Ollama
ollama create deepseek-finetuned -f /ollama/setup/deepseek-finetuned.gguf
3️⃣ Verify the Model is Available
ollama list
✅ Now, the fine-tuned model is ready for use in Ollama.
📌 Part 6: Viewing Logs
1️⃣ Check Training Logs
tail -f /mnt/shortterm/$(hostname)_training_logs_$(date +%Y%m%d).jsonl
2️⃣ Search for Errors
grep '"status_code": 500' /mnt/shortterm/$(hostname)_training_logs_$(date +%Y%m%d).jsonl
✅ All training logs are saved in /mnt/shortterm/
.
📌 Summary of the Entire Setup
✅ Step 1: Install dependencies inside venv on all nodes
✅ Step 2: Set up directories on all nodes
✅ Step 3: Configure Node9 as training coordinator
✅ Step 4: Deploy fine-tuning automatically from Node9
✅ Step 5: Convert & load the fine-tuned model into Ollama
✅ Step 6: View training logs in /mnt/shortterm/