DeepSeek Self-Host Guide: Complete Setup and Configuration
Table of Contents
- What is DeepSeek?
- Features of DeepSeek
- Why Self Host DeepSeek?
- System Requirements
- Installation Guide
- Configuration
- API Usage
- Performance Optimization
- Backup and Maintenance
- Troubleshooting
- FAQ
What is DeepSeek?
DeepSeek is a series of high-performance large language models designed for code generation, reasoning, and general conversation. DeepSeek models are available for self-hosting, providing enterprises and developers with on-premises AI capabilities while maintaining data privacy and control.
Features of DeepSeek
Advanced Language Understanding
- Code Generation: Excellent at programming tasks across multiple languages
- Mathematical Reasoning: Strong performance on mathematical and logical problems
- Multilingual Support: Supports Chinese, English, and other languages
- Context Understanding: Large context windows for complex conversations
Multiple Model Variants
- DeepSeek-Coder: Specialized for programming and software development
- DeepSeek-Math: Optimized for mathematical reasoning and problem-solving
- DeepSeek-V3: Latest general-purpose model with enhanced capabilities
- DeepSeek-VL: Vision-language model for multimodal tasks
Enterprise Features
- API Compatibility: OpenAI-compatible API for easy integration
- High Throughput: Optimized for production workloads
- Fine-tuning Support: Customize models for specific use cases
- Quantization Support: Reduced memory usage with minimal quality loss
Why Self Host DeepSeek?
Self-hosting provides complete control over your AI infrastructure and data.
Benefits of Self-Hosting DeepSeek
- Data Privacy: Keep sensitive conversations and code on your infrastructure
- Cost Control: No per-token pricing for high-volume usage
- Customization: Fine-tune models for your specific domain
- Latency: Reduce response times with local deployment
- Compliance: Meet regulatory requirements for data handling
- Offline Operation: Work without internet connectivity
System Requirements
Minimum Requirements (7B Model)
- GPU: NVIDIA RTX 4090 or equivalent (24GB VRAM)
- CPU: 8 cores
- RAM: 32GB
- Storage: 100GB SSD
- OS: Linux (Ubuntu 20.04+ recommended)
Recommended Requirements (67B Model)
- GPU: 4x NVIDIA A100 (80GB each) or 8x RTX 4090
- CPU: 32+ cores
- RAM: 128GB+
- Storage: 500GB+ NVMe SSD
- Network: High-speed interconnect for multi-GPU setup
Cloud Instance Recommendations
- AWS: p4d.24xlarge, p3.16xlarge
- Google Cloud: a2-ultragpu-8g
- Azure: Standard_ND96asr_v4
Installation Guide
Using Docker (Recommended)
- Install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
- Pull DeepSeek Docker image:
docker pull deepseek/deepseek-llm:latest
- Run DeepSeek container:
docker run -d \
--name deepseek \
--gpus all \
-p 8000:8000 \
-v /path/to/models:/models \
-e MODEL_PATH=/models/deepseek-llm-67b-chat \
deepseek/deepseek-llm:latest
Manual Installation
- Install Python dependencies:
# Create virtual environment
python3 -m venv deepseek-env
source deepseek-env/bin/activate
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install transformers and other dependencies
pip install transformers==4.36.0
pip install accelerate
pip install bitsandbytes
pip install flash-attn --no-build-isolation
- Download DeepSeek models:
# Using Hugging Face Hub
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='deepseek-ai/deepseek-llm-67b-chat', local_dir='./models/deepseek-llm-67b-chat')"
- Create inference server:
# server.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load model and tokenizer
model_path = "./models/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
data = request.json
messages = data['messages']
# Format messages for DeepSeek
conversation = ""
for msg in messages:
conversation += f"{msg['role']}: {msg['content']}\n"
# Generate response
inputs = tokenizer.encode(conversation, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs,
max_length=inputs.shape[1] + data.get('max_tokens', 512),
temperature=data.get('temperature', 0.7),
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
return jsonify({
"choices": [{
"message": {"role": "assistant", "content": response},
"index": 0,
"finish_reason": "stop"
}]
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
- Start the server:
python server.py
Using vLLM (High Performance)
- Install vLLM:
pip install vllm
- Start vLLM server:
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-llm-67b-chat \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--max-model-len 4096 \
--host 0.0.0.0 \
--port 8000
Configuration
Model Configuration
# config.yaml
model:
name: "deepseek-llm-67b-chat"
path: "/models/deepseek-llm-67b-chat"
dtype: "bfloat16"
trust_remote_code: true
server:
host: "0.0.0.0"
port: 8000
max_concurrent_requests: 64
generation:
max_tokens: 2048
temperature: 0.7
top_p: 0.9
top_k: 50
repetition_penalty: 1.1
performance:
tensor_parallel_size: 4
pipeline_parallel_size: 1
gpu_memory_utilization: 0.9
swap_space: 4
Environment Variables
# Model Configuration
MODEL_NAME=deepseek-llm-67b-chat
MODEL_PATH=/models/deepseek-llm-67b-chat
MAX_MODEL_LEN=4096
DTYPE=bfloat16
# Server Configuration
HOST=0.0.0.0
PORT=8000
WORKERS=1
TENSOR_PARALLEL_SIZE=4
# Performance Tuning
GPU_MEMORY_UTILIZATION=0.9
MAX_NUM_BATCHED_TOKENS=8192
MAX_NUM_SEQS=64
SWAP_SPACE=4
# Logging
LOG_LEVEL=INFO
LOG_FILE=/var/log/deepseek.log
# API Configuration
API_KEY=your-secret-api-key
DISABLE_LOG_STATS=false
Multi-GPU Setup
# For 4-GPU setup
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-llm-67b-chat \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000
API Usage
Chat Completions
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "deepseek-llm-67b-chat",
"messages": [
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
],
"max_tokens": 500,
"temperature": 0.7
}'
Code Generation Example
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key"
)
response = client.chat.completions.create(
model="deepseek-llm-67b-chat",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Create a REST API using FastAPI for a todo application"}
],
max_tokens=1000,
temperature=0.3
)
print(response.choices[0].message.content)
Streaming Responses
stream = client.chat.completions.create(
model="deepseek-llm-67b-chat",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Performance Optimization
Memory Optimization
# Enable quantization to reduce memory usage
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-llm-67b-chat \
--quantization awq \
--dtype half
Batch Processing
# Batch multiple requests for better throughput
requests = [
{"messages": [{"role": "user", "content": f"Question {i}"}]}
for i in range(10)
]
responses = []
for req in requests:
response = client.chat.completions.create(
model="deepseek-llm-67b-chat",
**req
)
responses.append(response)
Caching Strategy
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
# Implementation with caching
pass
def get_completion(prompt):
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
return cached_completion(prompt_hash)
Backup and Maintenance
Model Backup
#!/bin/bash
# Backup DeepSeek models and configuration
# Create backup directory
mkdir -p /backup/deepseek/$(date +%Y%m%d)
# Backup models
cp -r /models /backup/deepseek/$(date +%Y%m%d)/
# Backup configuration
cp config.yaml /backup/deepseek/$(date +%Y%m%d)/
cp .env /backup/deepseek/$(date +%Y%m%d)/
# Create tar archive
tar -czf /backup/deepseek-$(date +%Y%m%d).tar.gz /backup/deepseek/$(date +%Y%m%d)
# Cleanup old backups (keep 7 days)
find /backup -name "deepseek-*.tar.gz" -mtime +7 -delete
Health Monitoring
#!/bin/bash
# Health check script
# Check API endpoint
if curl -f http://localhost:8000/health; then
echo "DeepSeek API is healthy"
else
echo "DeepSeek API is down"
# Restart service
systemctl restart deepseek
fi
# Check GPU memory usage
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | while read mem; do
if [ $mem -gt 75000 ]; then
echo "High GPU memory usage: ${mem}MB"
fi
done
Log Management
# Log rotation configuration
# /etc/logrotate.d/deepseek
/var/log/deepseek.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
postrotate
systemctl reload deepseek
endscript
}
Troubleshooting
Common Issues
Out of Memory Errors
# Reduce model size or increase swap
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Use gradient checkpointing
python server.py --gradient-checkpointing
Slow Inference
# Check GPU utilization
nvidia-smi -l 1
# Optimize batch size
python -m vllm.entrypoints.openai.api_server \
--max-num-batched-tokens 4096
Model Loading Issues
# Clear cache
rm -rf ~/.cache/huggingface/transformers/
# Verify model files
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('deepseek-ai/deepseek-llm-67b-chat')"
Performance Monitoring
# monitoring.py
import psutil
import GPUtil
import time
def monitor_resources():
while True:
# CPU usage
cpu_percent = psutil.cpu_percent()
# Memory usage
memory = psutil.virtual_memory()
# GPU usage
gpus = GPUtil.getGPUs()
print(f"CPU: {cpu_percent}%, RAM: {memory.percent}%")
for gpu in gpus:
print(f"GPU {gpu.id}: {gpu.memoryUtil*100:.1f}% memory, {gpu.load*100:.1f}% usage")
time.sleep(10)
if __name__ == "__main__":
monitor_resources()
FAQ
Which DeepSeek model should I use?
- DeepSeek-Coder: Best for programming and code generation tasks
- DeepSeek-Math: Optimal for mathematical reasoning
- DeepSeek-V3: Latest general-purpose model for diverse tasks
How much VRAM do I need?
- 7B model: 16-24GB VRAM
- 33B model: 64-80GB VRAM (2x A100)
- 67B model: 128-160GB VRAM (4x A100)
Can I run DeepSeek on CPU only?
Yes, but it will be very slow. GPU acceleration is highly recommended for production use.
How do I fine-tune DeepSeek models?
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./deepseek-finetuned",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=5e-5,
warmup_steps=100,
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
trainer.train()
Is DeepSeek compatible with OpenAI API?
Yes, DeepSeek can be deployed with OpenAI-compatible endpoints, making it easy to integrate with existing applications.
How do I scale DeepSeek horizontally?
Use a load balancer to distribute requests across multiple DeepSeek instances:
upstream deepseek_backend {
server 127.0.0.1:8000;
server 127.0.0.1:8001;
server 127.0.0.1:8002;
}
server {
listen 80;
location / {
proxy_pass http://deepseek_backend;
}
}
Can I use DeepSeek for commercial purposes?
Check the specific model license. Many DeepSeek models have permissive licenses allowing commercial use.