Cluster Instructions for Setting Up & Training DeepSeek-R1:8B

data center with multiple interconnected server racks each labeled as different nodes

data center with multiple interconnected server racks each labeled as different nodes

This guide provides end-to-end setup instructions for your cluster to:

Set up Node9 (file server) as the training coordinator
Use Node1, Node2, and Node3 as training nodes
Configure DeepSpeed for multi-node CPU/GPU fine-tuning
Use log_interceptor.py to log training progress
Convert and load the fine-tuned model into Ollama


1️⃣ Cluster Overview

NodeRoleSpecs
Node9Training Coordinator & File ServerDell PowerEdge 515 (Handles datasets, training jobs, and logging)
Node1Trainer (GPU)Dell PowerEdge 715 (NVIDIA 2060, 6GB VRAM, 26-core CPU)
Node2Trainer (GPU, coming soon)Dell PowerEdge 715 (NVIDIA 2060 coming soon, 26-core CPU)
Node3Trainer (CPU-only)Dell PowerEdge 715 (CPU-based training, 26-core CPU)

Workflow:

  • Node9 stores datasets, coordinates training, and manages logs.
  • Node1-3 handle actual model training.
  • Once training is done, Node9 prepares and loads the fine-tuned model into Ollama.

📌 Part 1: Installing Required Dependencies

1️⃣ Install on ALL Nodes (Node1-3 + Node9)

Since you run everything inside virtual environments (venv), first activate your venv:

source /path/to/your/venv/bin/activate

Then install dependencies:

pip install deepseek transformers torch accelerate bitsandbytes deepspeed

If pip has issues, explicitly use:

/path/to/your/venv/bin/pip install deepseek transformers torch accelerate bitsandbytes deepspeed

Ensure all nodes have:

mkdir -p /ollama/setup
mkdir -p /ollama/scripts
mkdir -p /ollama/web

Now all nodes are ready to participate in training.


📌 Part 2: Configuring the Training Coordinator (Node9)

1️⃣ Set Up Multi-Node Configuration (Node9)

Run this only on Node9:

cat <<EOF > /ollama/setup/hostfile
node1 slots=26
node2 slots=26
node3 slots=26
EOF

Node9 now controls training, and Node1-3 perform training tasks.


2️⃣ Configure DeepSpeed for Multi-Node Training (Node9)

Run this only on Node9:

cat <<EOF > /ollama/setup/ds_config.json
{
    "train_batch_size": 1,
    "gradient_checkpointing": true,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    },
    "fp16": {
        "enabled": true
    }
}
EOF

Ensures efficient CPU/GPU offloading across all nodes.


📌 Part 3: Creating the Fine-Tuning Script (Node9)

1️⃣ Save Fine-Tuning Script

Run this only on Node9:

cat <<EOF > /ollama/scripts/train.py
from deepseek import DeepSeekModel
import os
import time
import traceback
from log_interceptor import log_event

hostname = os.uname()[1]  # Get hostname of coordinator (Node9)

try:
    log_event("train.py", 200, "training", "Fine-tuning started", {"host": hostname}, ai_category="system")

    model = DeepSeekModel("deepseek-r1:8b", load_in_4bit=True)

    model.fine_tune(
        dataset_path="/mnt/shortterm/pretune/file_finetune_data.jsonl",
        output_dir=f"/mnt/shortterm/posttune/fine_tuned_deepseek_{hostname}",
        epochs=3,
        batch_size=1,
        learning_rate=2e-5,
        use_qlora=True,
        save_steps=500,
        gradient_checkpointing=True,
        deepspeed_config="/ollama/setup/ds_config.json"
    )

    log_event("train.py", 200, "training", "Fine-tuning completed successfully", {"host": hostname}, ai_category="system")

except Exception as e:
    log_event("train.py", 500, "training", "Error during fine-tuning", {"error": str(e), "traceback": traceback.format_exc()}, ai_category="error")
EOF

Fine-tuning will now log start, completion, and errors in /mnt/shortterm/.


📌 Part 4: Running Training Across All Nodes

1️⃣ Start Multi-Node Training from Node9

Run this on Node9:

deepspeed --num_nodes 3 --num_gpus 0 --hostfile /ollama/setup/hostfile /ollama/scripts/train.py

Node9 controls training, but Node1-3 perform the training.

Once Node2 gets a GPU, run:

deepspeed --num_nodes 3 --num_gpus 1 /ollama/scripts/train.py

Now, Node1 & Node2 use GPUs, and Node3 remains CPU-only.


📌 Part 5: Converting & Loading Fine-Tuned Model into Ollama

1️⃣ Convert the Model

Run this on Node9 after training completes:

deepseek-convert --input /mnt/shortterm/posttune/fine_tuned_deepseek_$(hostname) --output /ollama/setup/deepseek-finetuned.gguf

2️⃣ Load the Model into Ollama

ollama create deepseek-finetuned -f /ollama/setup/deepseek-finetuned.gguf

3️⃣ Verify the Model is Available

ollama list

Now, the fine-tuned model is ready for use in Ollama.


📌 Part 6: Viewing Logs

1️⃣ Check Training Logs

tail -f /mnt/shortterm/$(hostname)_training_logs_$(date +%Y%m%d).jsonl

2️⃣ Search for Errors

grep '"status_code": 500' /mnt/shortterm/$(hostname)_training_logs_$(date +%Y%m%d).jsonl

All training logs are saved in /mnt/shortterm/.


📌 Summary of the Entire Setup

Step 1: Install dependencies inside venv on all nodes
Step 2: Set up directories on all nodes
Step 3: Configure Node9 as training coordinator
Step 4: Deploy fine-tuning automatically from Node9
Step 5: Convert & load the fine-tuned model into Ollama
Step 6: View training logs in /mnt/shortterm/