Troubleshooting and Optimizing the AI Cluster

AI server cluster setup with multiple interconnected nodes each labeled with different roles

Part 1: Initial Setup and Observations

As we progressed with configuring the AI cluster, we encountered several challenges related to service availability, networking, storage, and logging. This article outlines the troubleshooting steps taken and their resolutions.

Part 2: Ensuring AI Services Start Properly

We began by verifying that all AI services were launching correctly. The start_cluster.sh script was structured as follows:

#!/bin/bash
cd /srv/ollama/

echo "♻ Restarting AI Services..."
source venv/bin/activate
pkill -f "uvicorn"
nohup uvicorn dispatcher:app --host 0.0.0.0 --port 8000 > /dev/null 2>&1 &
nohup uvicorn model_handler:app --host 0.0.0.0 --port 8001 > /dev/null 2>&1 &
nohup uvicorn ai_council:app --host 0.0.0.0 --port 8002 > /dev/null 2>&1 &
nohup uvicorn memory_manager:app --host 0.0.0.0 --port 8003 > /dev/null 2>&1 &

To ensure logs were being recorded, we modified the script to check for a logs directory and redirect logs accordingly:

if [ ! -d "/srv/ollama/logs" ]; then
    echo "📂 Creating logs directory..."
    mkdir -p /srv/ollama/logs
fi

This ensured that all services could properly log outputs.

Part 3: Resolving Network Port and Service Connectivity Issues

A major issue that arose was the Connection Refused error when trying to communicate with Ollama on port 11434. We used the following commands to check service availability:

ss -tulnp | grep 11434

If no output appeared, this indicated Ollama was not running or not bound to the correct interface. Restarting Ollama resolved this issue:

sudo systemctl restart ollama

Part 4: Ensuring Storage Availability for Cluster Nodes

We checked available storage using:

sudo fdisk -l

This revealed /dev/sdb3 was a large unmounted partition, which we set up as a shared storage location:

sudo mkfs.ext4 /dev/sdb3
sudo mkdir -p /mnt/shared_storage
sudo mount /dev/sdb3 /mnt/shared_storage

To ensure it mounts on boot, we added it to /etc/fstab:

/dev/sdb3  /mnt/shared_storage  ext4  defaults  0  0

Part 5: Improving Log Visibility in the Web UI

We needed to dynamically include logs in status.html. The function get_recent_logs() was updated to retrieve logs from multiple files:

def get_recent_logs():
    """Retrieve the last 100 lines of each log file."""
    logs_html = "<h3>Recent Logs</h3>"
    
    for log_file in LOG_FILES:
        log_path = os.path.join(LOG_DIR, log_file)
        if os.path.exists(log_path):
            log_output = subprocess.getoutput(f"tail -n 100 {log_path}")
            logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>{log_output}</pre></div>"
        else:
            logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>No log file found.</pre></div>"
    
    return logs_html

This ensured all logs were properly displayed in status.html.

Part 6: Finalizing the Cluster Monitoring System

With all services running, storage configured, and logs displayed dynamically, we ensured the monitoring system functioned properly. The following command allowed for easy verification:

python3 /srv/ollama/scripts/cluster_health_check.py

This confirmed that all services were running correctly and reporting their statuses in real-time.

Troubleshooting and Optimizing the AI Cluster

Part 1: Initial Setup and Observations

Part 2: Ensuring AI Services Start Properly

Part 3: Resolving Network Port and Service Connectivity Issues

Part 4: Ensuring Storage Availability for Cluster Nodes

Part 5: Improving Log Visibility in the Web UI

Part 6: Finalizing the Cluster Monitoring System

Related Posts

Comprehensive Guide to Generative AI: Concepts, Challenges, and Applications

Human-as-a-Service (HasS): Turning the User into the API

Evolving Intelligence: Neuroplasticity as the Key to Self-Adapting and Sentient AI

Ant Colony Optimization – Hebbian Learning – Applying Eukaryota

RECENT POSTS

KAIROS Framework

Cerevanta Project

CATEGORIES