Troubleshooting and Optimizing the AI Cluster

AI server cluster setup with multiple interconnected nodes each labeled with different roles

AI server cluster setup with multiple interconnected nodes each labeled with different roles

Part 1: Initial Setup and Observations

As we progressed with configuring the AI cluster, we encountered several challenges related to service availability, networking, storage, and logging. This article outlines the troubleshooting steps taken and their resolutions.

Part 2: Ensuring AI Services Start Properly

We began by verifying that all AI services were launching correctly. The start_cluster.sh script was structured as follows:

#!/bin/bash
cd /srv/ollama/

echo "♻ Restarting AI Services..."
source venv/bin/activate
pkill -f "uvicorn"
nohup uvicorn dispatcher:app --host 0.0.0.0 --port 8000 > /dev/null 2>&1 &
nohup uvicorn model_handler:app --host 0.0.0.0 --port 8001 > /dev/null 2>&1 &
nohup uvicorn ai_council:app --host 0.0.0.0 --port 8002 > /dev/null 2>&1 &
nohup uvicorn memory_manager:app --host 0.0.0.0 --port 8003 > /dev/null 2>&1 &

To ensure logs were being recorded, we modified the script to check for a logs directory and redirect logs accordingly:

if [ ! -d "/srv/ollama/logs" ]; then
    echo "📂 Creating logs directory..."
    mkdir -p /srv/ollama/logs
fi

This ensured that all services could properly log outputs.

Part 3: Resolving Network Port and Service Connectivity Issues

A major issue that arose was the Connection Refused error when trying to communicate with Ollama on port 11434. We used the following commands to check service availability:

ss -tulnp | grep 11434

If no output appeared, this indicated Ollama was not running or not bound to the correct interface. Restarting Ollama resolved this issue:

sudo systemctl restart ollama

Part 4: Ensuring Storage Availability for Cluster Nodes

We checked available storage using:

sudo fdisk -l

This revealed /dev/sdb3 was a large unmounted partition, which we set up as a shared storage location:

sudo mkfs.ext4 /dev/sdb3
sudo mkdir -p /mnt/shared_storage
sudo mount /dev/sdb3 /mnt/shared_storage

To ensure it mounts on boot, we added it to /etc/fstab:

/dev/sdb3  /mnt/shared_storage  ext4  defaults  0  0

Part 5: Improving Log Visibility in the Web UI

We needed to dynamically include logs in status.html. The function get_recent_logs() was updated to retrieve logs from multiple files:

def get_recent_logs():
    """Retrieve the last 100 lines of each log file."""
    logs_html = "<h3>Recent Logs</h3>"
    
    for log_file in LOG_FILES:
        log_path = os.path.join(LOG_DIR, log_file)
        if os.path.exists(log_path):
            log_output = subprocess.getoutput(f"tail -n 100 {log_path}")
            logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>{log_output}</pre></div>"
        else:
            logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>No log file found.</pre></div>"
    
    return logs_html

This ensured all logs were properly displayed in status.html.

Part 6: Finalizing the Cluster Monitoring System

With all services running, storage configured, and logs displayed dynamically, we ensured the monitoring system functioned properly. The following command allowed for easy verification:

python3 /srv/ollama/scripts/cluster_health_check.py

This confirmed that all services were running correctly and reporting their statuses in real-time.