Overview
This article provides a comprehensive guide to setting up a distributed DeepSeek AI cluster on three Ubuntu machines. The goal is to enable these machines to interact with each other, distribute queries efficiently, and optimize computational performance.
Architecture Design
1. Core Setup Overview
Each machine will:
- Run an instance of DeepSeek.
- Communicate with others using REST APIs, WebSockets, or message brokers.
- Share insights collaboratively and refine responses.
2. Roles of Each Machine
- Machine 1 (Controller Node): Manages coordination, distributes queries, and aggregates responses.
- Machines 2 & 3 (Worker Nodes): Process DeepSeek model inference and share results.
3. Communication Framework
- Machines will use WebSockets for real-time, bidirectional communication.
- Queries will be load-balanced across worker nodes to optimize performance.
Step-by-Step Deployment
1. Install Required Software
Run these commands on all three machines:
sudo apt update
sudo apt install python3 python3-pip
pip3 install flask flask-socketio eventlet requests
2. Install DeepSeek
Clone the DeepSeek repository and install dependencies:
git clone https://github.com/deepseek-ai/DeepSeek.git
cd DeepSeek
pip3 install -r requirements.txt
3. Set Up Controller Node (Machine 1)
This machine will receive user queries and distribute them to worker nodes.
controller.py
:
from flask import Flask, request
from flask_socketio import SocketIO, emit
import requests
app = Flask(__name__)
socketio = SocketIO(app)
# Addresses of worker nodes
WORKERS = ["http://192.168.1.11:5000", "http://192.168.1.12:5000"]
@app.route('/')
def index():
return "Controller Node Running."
@socketio.on('query')
def handle_query(data):
question = data['question']
print(f"Received question: {question}")
# Distribute query to workers
responses = []
for worker in WORKERS:
try:
response = requests.post(worker + "/query", json={"question": question}).json()
responses.append(response.get("answer", ""))
except Exception as e:
responses.append(f"Error contacting {worker}: {e}")
# Aggregate responses
final_response = "\n".join(responses)
emit('response', {"final_response": final_response})
if __name__ == '__main__':
socketio.run(app, host='0.0.0.0', port=5000)
4. Set Up Worker Nodes (Machines 2 & 3)
Each worker processes queries from the controller.
worker.py
:
from flask import Flask, request, jsonify
from flask_socketio import SocketIO, emit
app = Flask(__name__)
socketio = SocketIO(app)
@app.route('/query', methods=['POST'])
def query():
question = request.json.get('question', '')
# Run DeepSeek model inference here
response = f"DeepSeek response for: {question}" # Mock response
return jsonify({"answer": response})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
5. Implement Load Balancing
Modify controller.py
to distribute queries evenly across worker nodes.
worker_index = 0
@socketio.on('query')
def handle_query(data):
global worker_index
question = data['question']
worker = WORKERS[worker_index]
worker_index = (worker_index + 1) % len(WORKERS)
try:
response = requests.post(worker + "/query", json={"question": question}).json()
emit('response', {"final_response": response.get("answer", "")})
except Exception as e:
emit('response', {"final_response": f"Error: {e}"})
6. Add Consensus Mechanism
Enable worker nodes to vote on the best response.
Modify worker.py
:
import random
@app.route('/query', methods=['POST'])
def query():
question = request.json.get('question', '')
response = f"DeepSeek response for: {question}" # Mock response
confidence = random.uniform(0.8, 1.0) # Random confidence score
return jsonify({"answer": response, "confidence": confidence})
Modify controller.py
:
@socketio.on('query')
def handle_query(data):
question = data['question']
responses = []
for worker in WORKERS:
try:
response = requests.post(worker + "/query", json={"question": question}).json()
responses.append(response)
except Exception as e:
responses.append({"answer": f"Error: {e}", "confidence": 0.0})
# Find the response with the highest confidence
best_response = max(responses, key=lambda r: r["confidence"])
emit('response', {"final_response": best_response["answer"]})
Optimizations for Performance
- Use GPUs: Add NVIDIA GPUs to worker nodes for faster inference.
- Enable Multi-Turn Dialogues: Store conversation history for context-aware interactions.
- Monitor System Performance:
- Install Prometheus and Grafana to visualize resource usage.
- Hybrid Cloud Integration: Connect the cluster to cloud-based GPU resources for scalability.
- Distributed File System: Implement NFS or GlusterFS for shared storage among machines.
Learn More
DeepSeek-R1: An Open-Source Advanced Reasoning Model
https://huggingface.co/deepseek-ai/DeepSeek-R1
DeepSeek-R1 is an advanced reasoning model developed by DeepSeek, a Chinese AI company. It achieves performance comparable to OpenAI’s o1 model across tasks such as mathematics, coding, and reasoning. The model was trained using large-scale reinforcement learning (RL) without supervised fine-tuning (SFT), allowing it to naturally develop powerful reasoning behaviors. To support the research community, DeepSeek has open-sourced DeepSeek-R1 and its distilled versions based on Llama and Qwen architectures. These distilled models offer various parameter sizes, providing flexibility for different computational resources. The open-source nature of DeepSeek-R1 encourages further research and development in AI reasoning capabilities.
Get up and running with large language models.
https://ollama.com/library/deepseek-r1
DeepSeek’s first-generation reasoning models, achieving performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
Models
DeepSeek-R1
ollama run deepseek-r1:671b
Distilled models
DeepSeek team has demonstrated that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models.
There are smaller models created via fine-tuning against several dense models widely used in the research community using reasoning data generated by DeepSeek-R1. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks.
Conclusion
This setup enables three Ubuntu machines to work collaboratively, optimizing workload distribution and model inference efficiency. Future enhancements could include WebSocket-based real-time querying, fault tolerance mechanisms, and integration with cloud-based AI services.