
AI server cluster setup with multiple interconnected nodes each labeled with different roles
Part 1: Initial Setup and Observations
As we progressed with configuring the AI cluster, we encountered several challenges related to service availability, networking, storage, and logging. This article outlines the troubleshooting steps taken and their resolutions.
Part 2: Ensuring AI Services Start Properly
We began by verifying that all AI services were launching correctly. The start_cluster.sh
script was structured as follows:
#!/bin/bash
cd /srv/ollama/
echo "♻ Restarting AI Services..."
source venv/bin/activate
pkill -f "uvicorn"
nohup uvicorn dispatcher:app --host 0.0.0.0 --port 8000 > /dev/null 2>&1 &
nohup uvicorn model_handler:app --host 0.0.0.0 --port 8001 > /dev/null 2>&1 &
nohup uvicorn ai_council:app --host 0.0.0.0 --port 8002 > /dev/null 2>&1 &
nohup uvicorn memory_manager:app --host 0.0.0.0 --port 8003 > /dev/null 2>&1 &
To ensure logs were being recorded, we modified the script to check for a logs directory and redirect logs accordingly:
if [ ! -d "/srv/ollama/logs" ]; then
echo "📂 Creating logs directory..."
mkdir -p /srv/ollama/logs
fi
This ensured that all services could properly log outputs.
Part 3: Resolving Network Port and Service Connectivity Issues
A major issue that arose was the Connection Refused
error when trying to communicate with Ollama on port 11434
. We used the following commands to check service availability:
ss -tulnp | grep 11434
If no output appeared, this indicated Ollama was not running or not bound to the correct interface. Restarting Ollama resolved this issue:
sudo systemctl restart ollama
Part 4: Ensuring Storage Availability for Cluster Nodes
We checked available storage using:
sudo fdisk -l
This revealed /dev/sdb3
was a large unmounted partition, which we set up as a shared storage location:
sudo mkfs.ext4 /dev/sdb3
sudo mkdir -p /mnt/shared_storage
sudo mount /dev/sdb3 /mnt/shared_storage
To ensure it mounts on boot, we added it to /etc/fstab
:
/dev/sdb3 /mnt/shared_storage ext4 defaults 0 0
Part 5: Improving Log Visibility in the Web UI
We needed to dynamically include logs in status.html
. The function get_recent_logs()
was updated to retrieve logs from multiple files:
def get_recent_logs():
"""Retrieve the last 100 lines of each log file."""
logs_html = "<h3>Recent Logs</h3>"
for log_file in LOG_FILES:
log_path = os.path.join(LOG_DIR, log_file)
if os.path.exists(log_path):
log_output = subprocess.getoutput(f"tail -n 100 {log_path}")
logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>{log_output}</pre></div>"
else:
logs_html += f"<div class='large-div'><h4>{log_file}</h4><pre>No log file found.</pre></div>"
return logs_html
This ensured all logs were properly displayed in status.html
.
Part 6: Finalizing the Cluster Monitoring System
With all services running, storage configured, and logs displayed dynamically, we ensured the monitoring system functioned properly. The following command allowed for easy verification:
python3 /srv/ollama/scripts/cluster_health_check.py
This confirmed that all services were running correctly and reporting their statuses in real-time.