DeepSeek Self-Host Guide: Complete Setup and Configuration

What is DeepSeek?
Features of DeepSeek
Why Self Host DeepSeek?
System Requirements
Installation Guide
Configuration
API Usage
Performance Optimization
Backup and Maintenance
Troubleshooting
FAQ

What is DeepSeek?

DeepSeek is a series of high-performance large language models designed for code generation, reasoning, and general conversation. DeepSeek models are available for self-hosting, providing enterprises and developers with on-premises AI capabilities while maintaining data privacy and control.

DeepSeek Interface

Features of DeepSeek

Advanced Language Understanding

Code Generation: Excellent at programming tasks across multiple languages
Mathematical Reasoning: Strong performance on mathematical and logical problems
Multilingual Support: Supports Chinese, English, and other languages
Context Understanding: Large context windows for complex conversations

Multiple Model Variants

DeepSeek-Coder: Specialized for programming and software development
DeepSeek-Math: Optimized for mathematical reasoning and problem-solving
DeepSeek-V3: Latest general-purpose model with enhanced capabilities
DeepSeek-VL: Vision-language model for multimodal tasks

Enterprise Features

API Compatibility: OpenAI-compatible API for easy integration
High Throughput: Optimized for production workloads
Fine-tuning Support: Customize models for specific use cases
Quantization Support: Reduced memory usage with minimal quality loss

Why Self Host DeepSeek?

Self-hosting provides complete control over your AI infrastructure and data.

Benefits of Self-Hosting DeepSeek

Data Privacy: Keep sensitive conversations and code on your infrastructure
Cost Control: No per-token pricing for high-volume usage
Customization: Fine-tune models for your specific domain
Latency: Reduce response times with local deployment
Compliance: Meet regulatory requirements for data handling
Offline Operation: Work without internet connectivity

System Requirements

Minimum Requirements (7B Model)

GPU: NVIDIA RTX 4090 or equivalent (24GB VRAM)
CPU: 8 cores
RAM: 32GB
Storage: 100GB SSD
OS: Linux (Ubuntu 20.04+ recommended)

Recommended Requirements (67B Model)

GPU: 4x NVIDIA A100 (80GB each) or 8x RTX 4090
CPU: 32+ cores
RAM: 128GB+
Storage: 500GB+ NVMe SSD
Network: High-speed interconnect for multi-GPU setup

Cloud Instance Recommendations

AWS: p4d.24xlarge, p3.16xlarge
Google Cloud: a2-ultragpu-8g
Azure: Standard_ND96asr_v4

Installation Guide

Using Docker (Recommended)

Install NVIDIA Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Pull DeepSeek Docker image:

docker pull deepseek/deepseek-llm:latest

Run DeepSeek container:

docker run -d \
  --name deepseek \
  --gpus all \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -e MODEL_PATH=/models/deepseek-llm-67b-chat \
  deepseek/deepseek-llm:latest

Manual Installation

Install Python dependencies:

# Create virtual environment
python3 -m venv deepseek-env
source deepseek-env/bin/activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers and other dependencies
pip install transformers==4.36.0
pip install accelerate
pip install bitsandbytes
pip install flash-attn --no-build-isolation

Download DeepSeek models:

# Using Hugging Face Hub
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='deepseek-ai/deepseek-llm-67b-chat', local_dir='./models/deepseek-llm-67b-chat')"

Create inference server:

# server.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model and tokenizer
model_path = "./models/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
    data = request.json
    messages = data['messages']
    
    # Format messages for DeepSeek
    conversation = ""
    for msg in messages:
        conversation += f"{msg['role']}: {msg['content']}\n"
    
    # Generate response
    inputs = tokenizer.encode(conversation, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + data.get('max_tokens', 512),
            temperature=data.get('temperature', 0.7),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    
    return jsonify({
        "choices": [{
            "message": {"role": "assistant", "content": response},
            "index": 0,
            "finish_reason": "stop"
        }]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Start the server:

python server.py

Using vLLM (High Performance)

Install vLLM:

pip install vllm

Start vLLM server:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --host 0.0.0.0 \
  --port 8000

Configuration

Model Configuration

# config.yaml
model:
  name: "deepseek-llm-67b-chat"
  path: "/models/deepseek-llm-67b-chat"
  dtype: "bfloat16"
  trust_remote_code: true

server:
  host: "0.0.0.0"
  port: 8000
  max_concurrent_requests: 64

generation:
  max_tokens: 2048
  temperature: 0.7
  top_p: 0.9
  top_k: 50
  repetition_penalty: 1.1

performance:
  tensor_parallel_size: 4
  pipeline_parallel_size: 1
  gpu_memory_utilization: 0.9
  swap_space: 4

Environment Variables

# Model Configuration
MODEL_NAME=deepseek-llm-67b-chat
MODEL_PATH=/models/deepseek-llm-67b-chat
MAX_MODEL_LEN=4096
DTYPE=bfloat16

# Server Configuration
HOST=0.0.0.0
PORT=8000
WORKERS=1
TENSOR_PARALLEL_SIZE=4

# Performance Tuning
GPU_MEMORY_UTILIZATION=0.9
MAX_NUM_BATCHED_TOKENS=8192
MAX_NUM_SEQS=64
SWAP_SPACE=4

# Logging
LOG_LEVEL=INFO
LOG_FILE=/var/log/deepseek.log

# API Configuration
API_KEY=your-secret-api-key
DISABLE_LOG_STATS=false

Multi-GPU Setup

# For 4-GPU setup
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000

API Usage

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "deepseek-llm-67b-chat",
    "messages": [
      {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

Code Generation Example

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="deepseek-llm-67b-chat",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Create a REST API using FastAPI for a todo application"}
    ],
    max_tokens=1000,
    temperature=0.3
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="deepseek-llm-67b-chat",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Performance Optimization

Memory Optimization

# Enable quantization to reduce memory usage
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --quantization awq \
  --dtype half

Batch Processing

# Batch multiple requests for better throughput
requests = [
    {"messages": [{"role": "user", "content": f"Question {i}"}]}
    for i in range(10)
]

responses = []
for req in requests:
    response = client.chat.completions.create(
        model="deepseek-llm-67b-chat",
        **req
    )
    responses.append(response)

Caching Strategy

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
    # Implementation with caching
    pass

def get_completion(prompt):
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_completion(prompt_hash)

Backup and Maintenance

Model Backup

#!/bin/bash
# Backup DeepSeek models and configuration

# Create backup directory
mkdir -p /backup/deepseek/$(date +%Y%m%d)

# Backup models
cp -r /models /backup/deepseek/$(date +%Y%m%d)/

# Backup configuration
cp config.yaml /backup/deepseek/$(date +%Y%m%d)/
cp .env /backup/deepseek/$(date +%Y%m%d)/

# Create tar archive
tar -czf /backup/deepseek-$(date +%Y%m%d).tar.gz /backup/deepseek/$(date +%Y%m%d)

# Cleanup old backups (keep 7 days)
find /backup -name "deepseek-*.tar.gz" -mtime +7 -delete

Health Monitoring

#!/bin/bash
# Health check script

# Check API endpoint
if curl -f http://localhost:8000/health; then
    echo "DeepSeek API is healthy"
else
    echo "DeepSeek API is down"
    # Restart service
    systemctl restart deepseek
fi

# Check GPU memory usage
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | while read mem; do
    if [ $mem -gt 75000 ]; then
        echo "High GPU memory usage: ${mem}MB"
    fi
done

Log Management

# Log rotation configuration
# /etc/logrotate.d/deepseek
/var/log/deepseek.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    postrotate
        systemctl reload deepseek
    endscript
}

Troubleshooting

Common Issues

Out of Memory Errors

# Reduce model size or increase swap
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Use gradient checkpointing
python server.py --gradient-checkpointing

Slow Inference

# Check GPU utilization
nvidia-smi -l 1

# Optimize batch size
python -m vllm.entrypoints.openai.api_server \
  --max-num-batched-tokens 4096

Model Loading Issues

# Clear cache
rm -rf ~/.cache/huggingface/transformers/

# Verify model files
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('deepseek-ai/deepseek-llm-67b-chat')"

Performance Monitoring

# monitoring.py
import psutil
import GPUtil
import time

def monitor_resources():
    while True:
        # CPU usage
        cpu_percent = psutil.cpu_percent()
        
        # Memory usage
        memory = psutil.virtual_memory()
        
        # GPU usage
        gpus = GPUtil.getGPUs()
        
        print(f"CPU: {cpu_percent}%, RAM: {memory.percent}%")
        for gpu in gpus:
            print(f"GPU {gpu.id}: {gpu.memoryUtil*100:.1f}% memory, {gpu.load*100:.1f}% usage")
        
        time.sleep(10)

if __name__ == "__main__":
    monitor_resources()

FAQ

Which DeepSeek model should I use?

DeepSeek-Coder: Best for programming and code generation tasks
DeepSeek-Math: Optimal for mathematical reasoning
DeepSeek-V3: Latest general-purpose model for diverse tasks

How much VRAM do I need?

7B model: 16-24GB VRAM
33B model: 64-80GB VRAM (2x A100)
67B model: 128-160GB VRAM (4x A100)

Can I run DeepSeek on CPU only?

Yes, but it will be very slow. GPU acceleration is highly recommended for production use.

How do I fine-tune DeepSeek models?

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./deepseek-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    warmup_steps=100,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Is DeepSeek compatible with OpenAI API?

Yes, DeepSeek can be deployed with OpenAI-compatible endpoints, making it easy to integrate with existing applications.

How do I scale DeepSeek horizontally?

Use a load balancer to distribute requests across multiple DeepSeek instances:

upstream deepseek_backend {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
}

server {
    listen 80;
    location / {
        proxy_pass http://deepseek_backend;
    }
}

Can I use DeepSeek for commercial purposes?

Check the specific model license. Many DeepSeek models have permissive licenses allowing commercial use.

Browse by Category