Nginx运维-监控与管理

清峰大约 9 分钟

Nginx运维-监控与管理

业务场景引入

在运维一个大型互联网平台时，运维团队面临以下挑战：

7x24小时可用性：平台需要全年无休稳定运行，任何故障都可能导致重大业务损失
实时监控需求：需要实时掌握系统状态，及时发现和处理异常
自动化运维：面对数百台服务器，需要自动化部署、配置管理和故障恢复
安全合规：需要满足各种安全标准和合规要求，定期进行安全审计
性能优化：持续监控系统性能，根据业务增长调整配置
故障快速响应：建立完善的故障响应机制，缩短故障恢复时间

这些需求正是Nginx运维管理的核心内容。通过建立完善的监控体系和管理流程，可以确保Nginx服务的稳定性和可靠性。

监控体系构建

基础监控配置

# 启用Nginx状态监控
server {
    listen 8080;
    server_name localhost;
    
    # 基础状态监控
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
    
    # 健康检查端点
    location /health {
        access_log off;
        return 200 "healthy\n";
        add_header Content-Type text/plain;
    }
    
    # 版本信息（生产环境建议关闭）
    location /nginx_info {
        access_log off;
        return 200 "Nginx Version: $nginx_version\n";
        add_header Content-Type text/plain;
    }
}

详细状态监控（需要nginx-module-vts模块）

# 虚拟主机流量状态监控
http {
    vhost_traffic_status_zone;
    
    server {
        listen 8080;
        server_name localhost;
        
        # 详细的流量状态
        location /status {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
            allow 127.0.0.1;
            deny all;
        }
        
        # JSON格式状态数据
        location /status.json {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format json;
            allow 127.0.0.1;
            deny all;
        }
    }
}

自定义监控指标

# 自定义日志格式用于监控
log_format monitor '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for" '
                  'request_time=$request_time '
                  'upstream_response_time=$upstream_response_time '
                  'upstream_connect_time=$upstream_connect_time '
                  'upstream_header_time=$upstream_header_time '
                  'bytes_sent=$bytes_sent '
                  'gzip_ratio=$gzip_ratio';

server {
    access_log /var/log/nginx/monitor.log monitor;
    
    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

日志管理与分析

日志轮转配置

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily
    missingok
    rotate 52
    compress
    delaycompress
    notifempty
    create 644 nginx nginx
    sharedscripts
    prerotate
        if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
            run-parts /etc/logrotate.d/httpd-prerotate; \
        fi \
    endscript
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
    endscript
}

结构化日志配置

# JSON格式日志
log_format json_combined escape=json '{'
    '"time_local":"$time_local",'
    '"remote_addr":"$remote_addr",'
    '"remote_user":"$remote_user",'
    '"request":"$request",'
    '"status": "$status",'
    '"body_bytes_sent":"$body_bytes_sent",'
    '"http_referer":"$http_referer",'
    '"http_user_agent":"$http_user_agent",'
    '"http_x_forwarded_for":"$http_x_forwarded_for",'
    '"request_time":"$request_time",'
    '"upstream_response_time":"$upstream_response_time",'
    '"gzip_ratio":"$gzip_ratio"'
'}';

server {
    access_log /var/log/nginx/access.json json_combined;
    error_log /var/log/nginx/error.log warn;
}

日志分析脚本

#!/bin/bash
# Nginx日志分析脚本

LOG_FILE="/var/log/nginx/access.log"
DATE=$(date +%Y-%m-%d)

# 创建分析报告目录
mkdir -p /var/log/nginx/reports

# 生成每日访问统计
echo "=== Daily Access Report - $DATE ===" > /var/log/nginx/reports/daily_$DATE.txt

# 总请求数
echo "Total Requests: $(wc -l < $LOG_FILE)" >> /var/log/nginx/reports/daily_$DATE.txt

# 独立IP数
echo "Unique IPs: $(awk '{print $1}' $LOG_FILE | sort -u | wc -l)" >> /var/log/nginx/reports/daily_$DATE.txt

# 状态码统计
echo "Status Code Distribution:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $9}' $LOG_FILE | sort | uniq -c | sort -nr >> /var/log/nginx/reports/daily_$DATE.txt

# 访问最多的URL
echo "Top 10 URLs:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $7}' $LOG_FILE | sort | uniq -c | sort -nr | head -10 >> /var/log/nginx/reports/daily_$DATE.txt

# 用户代理统计
echo "Top 10 User Agents:" >> /var/log/nginx/reports/daily_$DATE.txt
awk -F'"' '{print $6}' $LOG_FILE | sort | uniq -c | sort -nr | head -10 >> /var/log/nginx/reports/daily_$DATE.txt

# 响应时间分析
echo "Response Time Analysis:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $NF}' $LOG_FILE | awk -F'=' '{print $2}' | sort -n | awk 'BEGIN {count=0; sum=0} {count++; sum+=$1} END {if (count > 0) print "Average Response Time: " sum/count "s"}' >> /var/log/nginx/reports/daily_$DATE.txt

自动化部署与配置管理

Ansible部署脚本

# nginx-deploy.yml
---
- name: Deploy Nginx Configuration
  hosts: webservers
  become: yes
  vars:
    nginx_user: nginx
    nginx_worker_processes: auto
    nginx_worker_connections: 65535
    
  tasks:
    - name: Install Nginx
      yum:
        name: nginx
        state: present
      when: ansible_os_family == "RedHat"
      
    - name: Install Nginx
      apt:
        name: nginx
        state: present
      when: ansible_os_family == "Debian"
      
    - name: Create SSL directory
      file:
        path: /etc/nginx/ssl
        state: directory
        mode: '0755'
        
    - name: Copy SSL certificates
      copy:
        src: files/{{ item }}
        dest: /etc/nginx/ssl/{{ item }}
        mode: '0644'
      loop:
        - example.com.crt
        - example.com.key
        - dhparam.pem
        
    - name: Deploy main nginx.conf
      template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/nginx.conf
        mode: '0644'
      notify: reload nginx
      
    - name: Deploy site configuration
      template:
        src: templates/site.conf.j2
        dest: /etc/nginx/conf.d/{{ inventory_hostname }}.conf
        mode: '0644'
      notify: reload nginx
      
    - name: Ensure Nginx is running
      systemd:
        name: nginx
        state: started
        enabled: yes
        
  handlers:
    - name: reload nginx
      systemd:
        name: nginx
        state: reloaded

配置模板示例

# templates/nginx.conf.j2
user {{ nginx_user }};
worker_processes {{ nginx_worker_processes }};
worker_rlimit_nofile 65535;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events {
    worker_connections {{ nginx_worker_connections }};
    use epoll;
    multi_accept on;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;
    
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for"';
    
    access_log /var/log/nginx/access.log main;
    
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_comp_level 6;
    gzip_types
        text/plain
        text/css
        text/xml
        text/javascript
        application/json
        application/javascript
        application/xml+rss
        application/atom+xml
        image/svg+xml;
    
    include /etc/nginx/conf.d/*.conf;
}

安全管理

安全配置基线

server {
    listen 80;
    server_name example.com;
    
    # 安全头设置
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    add_header Referrer-Policy "no-referrer-when-downgrade" always;
    add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline';" always;
    
    # 隐藏版本信息
    server_tokens off;
    
    # 限制请求方法
    if ($request_method !~ ^(GET|HEAD|POST)$ ) {
        return 405;
    }
    
    # 防止访问敏感文件
    location ~ /\. {
        deny all;
        access_log off;
        log_not_found off;
    }
    
    # 限制请求频率
    limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
    location /login {
        limit_req zone=one burst=5;
    }
    
    location / {
        root /var/www/html;
        index index.html;
    }
}

访问控制配置

# 基于IP的访问控制
server {
    listen 80;
    server_name admin.example.com;
    
    # 允许特定IP访问
    allow 192.168.1.0/24;
    allow 10.0.0.0/8;
    deny all;
    
    location / {
        auth_basic "Admin Area";
        auth_basic_user_file /etc/nginx/.htpasswd;
        
        root /var/www/admin;
        index index.html;
    }
}

# 基于地理位置的访问控制
geo $allowed_country {
    default no;
    CN yes;
    US yes;
    JP yes;
}

map $allowed_country $blocked_country {
    yes 0;
    no 1;
}

server {
    listen 80;
    server_name example.com;
    
    if ($blocked_country) {
        return 403;
    }
    
    location / {
        root /var/www/html;
        index index.html;
    }
}

备份与恢复策略

配置文件备份脚本

#!/bin/bash
# Nginx配置备份脚本

BACKUP_DIR="/backup/nginx"
DATE=$(date +%Y%m%d_%H%M%S)
HOSTNAME=$(hostname)

# 创建备份目录
mkdir -p $BACKUP_DIR/$DATE

# 备份配置文件
cp -r /etc/nginx $BACKUP_DIR/$DATE/nginx_config
cp -r /etc/ssl $BACKUP_DIR/$DATE/ssl_certs 2>/dev/null || echo "No SSL certificates to backup"

# 创建备份信息文件
cat > $BACKUP_DIR/$DATE/backup_info.txt << EOF
Backup created on: $(date)
Hostname: $HOSTNAME
Nginx version: $(nginx -v 2>&1)
Configuration files: /etc/nginx
SSL certificates: /etc/ssl
EOF

# 压缩备份
tar -czf $BACKUP_DIR/nginx_backup_${HOSTNAME}_${DATE}.tar.gz -C $BACKUP_DIR $DATE

# 清理旧备份（保留最近7天）
find $BACKUP_DIR -name "nginx_backup_*.tar.gz" -mtime +7 -delete
find $BACKUP_DIR -mindepth 1 -maxdepth 1 -type d -not -name "*.tar.gz" -exec rm -rf {} +

echo "Backup completed: $BACKUP_DIR/nginx_backup_${HOSTNAME}_${DATE}.tar.gz"

配置恢复脚本

#!/bin/bash
# Nginx配置恢复脚本

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup_file.tar.gz>"
    exit 1
fi

if [ ! -f "$BACKUP_FILE" ]; then
    echo "Backup file not found: $BACKUP_FILE"
    exit 1
fi

# 停止Nginx服务
systemctl stop nginx

# 备份当前配置
CURRENT_BACKUP="/tmp/nginx_current_$(date +%Y%m%d_%H%M%S)"
cp -r /etc/nginx $CURRENT_BACKUP

# 解压备份文件
TEMP_DIR="/tmp/nginx_restore_$(date +%Y%m%d_%H%M%S)"
mkdir -p $TEMP_DIR
tar -xzf $BACKUP_FILE -C $TEMP_DIR

# 恢复配置文件
cp -r $TEMP_DIR/*/nginx_config/* /etc/nginx/

# 恢复SSL证书（如果存在）
if [ -d "$TEMP_DIR/*/ssl_certs" ]; then
    cp -r $TEMP_DIR/*/ssl_certs/* /etc/ssl/
fi

# 测试配置
nginx -t
if [ $? -eq 0 ]; then
    # 启动Nginx服务
    systemctl start nginx
    echo "Configuration restored successfully"
else
    echo "Configuration test failed, restoring previous configuration"
    cp -r $CURRENT_BACKUP/* /etc/nginx/
    systemctl start nginx
    exit 1
fi

# 清理临时文件
rm -rf $TEMP_DIR $CURRENT_BACKUP

监控告警配置

Prometheus监控集成

# 需要nginx-module-vts模块
http {
    vhost_traffic_status_zone;
    
    server {
        listen 8080;
        server_name localhost;
        
        location /metrics {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format prometheus;
            allow 127.0.0.1;
            deny all;
        }
    }
}

告警规则配置

# Prometheus告警规则
groups:
- name: nginx.rules
  rules:
  - alert: NginxDown
    expr: up{job="nginx"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Nginx instance {{ $labels.instance }} is down"
      description: "Nginx instance {{ $labels.instance }} has been down for more than 1 minute"

  - alert: HighRequestRate
    expr: rate(nginx_http_requests_total[5m]) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High request rate on {{ $labels.instance }}"
      description: "Nginx instance {{ $labels.instance }} is handling more than 1000 requests per second"

  - alert: HighErrorRate
    expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) / rate(nginx_http_requests_total[5m]) > 0.05
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate on {{ $labels.instance }} is above 5%"

告警通知脚本

#!/bin/bash
# Nginx告警通知脚本

ALERT_NAME=$1
ALERT_STATUS=$2
ALERT_DESCRIPTION=$3
HOSTNAME=$(hostname)
TIMESTAMP=$(date)

# 发送邮件告警
send_email_alert() {
    local subject="Nginx Alert: $ALERT_NAME on $HOSTNAME"
    local body="Time: $TIMESTAMP
Host: $HOSTNAME
Alert: $ALERT_NAME
Status: $ALERT_STATUS
Description: $ALERT_DESCRIPTION"
    
    echo "$body" | mail -s "$subject" admin@example.com
}

# 发送Slack通知
send_slack_alert() {
    local payload="{
        \"text\": \"Nginx Alert\",
        \"attachments\": [
            {
                \"color\": \"danger\",
                \"fields\": [
                    {
                        \"title\": \"Alert\",
                        \"value\": \"$ALERT_NAME\",
                        \"short\": true
                    },
                    {
                        \"title\": \"Host\",
                        \"value\": \"$HOSTNAME\",
                        \"short\": true
                    },
                    {
                        \"title\": \"Status\",
                        \"value\": \"$ALERT_STATUS\",
                        \"short\": true
                    },
                    {
                        \"title\": \"Time\",
                        \"value\": \"$TIMESTAMP\",
                        \"short\": true
                    },
                    {
                        \"title\": \"Description\",
                        \"value\": \"$ALERT_DESCRIPTION\",
                        \"short\": false
                    }
                ]
            }
        ]
    }"
    
    curl -X POST -H 'Content-type: application/json' \
         --data "$payload" \
         https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
}

# 根据告警级别选择通知方式
case $ALERT_NAME in
    "NginxDown")
        send_email_alert
        send_slack_alert
        ;;
    "HighRequestRate"|"HighErrorRate")
        send_slack_alert
        ;;
    *)
        send_email_alert
        ;;
esac

性能监控脚本

实时性能监控

#!/bin/bash
# Nginx实时性能监控脚本

NGINX_STATUS_URL="http://localhost:8080/nginx_status"

while true; do
    # 获取Nginx状态
    STATUS=$(curl -s $NGINX_STATUS_URL)
    
    # 解析状态信息
    ACTIVE=$(echo "$STATUS" | awk 'NR==1 {print $3}')
    ACCEPTS=$(echo "$STATUS" | awk 'NR==3 {print $2}')
    HANDLED=$(echo "$STATUS" | awk 'NR==3 {print $3}')
    REQUESTS=$(echo "$STATUS" | awk 'NR==3 {print $4}')
    READING=$(echo "$STATUS" | awk 'NR==4 {print $2}')
    WRITING=$(echo "$STATUS" | awk 'NR==4 {print $4}')
    WAITING=$(echo "$STATUS" | awk 'NR==4 {print $6}')
    
    # 显示实时状态
    clear
    echo "=== Nginx Real-time Status ==="
    echo "Active connections: $ACTIVE"
    echo "Server accepts: $ACCEPTS"
    echo "Server handled: $HANDLED"
    echo "Server requests: $REQUESTS"
    echo "Reading: $READING"
    echo "Writing: $WRITING"
    echo "Waiting: $WAITING"
    echo "=============================="
    
    # 检查异常情况
    if [ $ACTIVE -gt 10000 ]; then
        echo "WARNING: High active connections ($ACTIVE)"
    fi
    
    sleep 5
done

系统资源监控

#!/bin/bash
# 系统资源监控脚本

while true; do
    # 获取系统资源使用情况
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
    MEMORY_USAGE=$(free | grep Mem | awk '{printf("%.2f%%"), $3/$2 * 100.0}')
    DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}')
    LOAD_AVERAGE=$(uptime | awk -F'load average:' '{print $2}')
    
    # 获取Nginx进程信息
    NGINX_PROCESSES=$(ps aux | grep nginx | grep -v grep | wc -l)
    NGINX_MASTER_PID=$(pgrep -f "nginx: master")
    
    # 显示系统状态
    clear
    echo "=== System Resource Status ==="
    echo "CPU Usage: $CPU_USAGE%"
    echo "Memory Usage: $MEMORY_USAGE"
    echo "Disk Usage: $DISK_USAGE"
    echo "Load Average: $LOAD_AVERAGE"
    echo "Nginx Processes: $NGINX_PROCESSES"
    echo "Nginx Master PID: $NGINX_MASTER_PID"
    echo "=============================="
    
    # 检查资源使用情况
    CPU_INT=$(echo $CPU_USAGE | cut -d'.' -f1)
    if [ $CPU_INT -gt 80 ]; then
        echo "WARNING: High CPU usage ($CPU_USAGE%)"
    fi
    
    sleep 10
done

故障恢复与自愈

自动故障检测脚本

#!/bin/bash
# Nginx自动故障检测脚本

NGINX_STATUS_URL="http://localhost:8080/nginx_status"
HEALTH_CHECK_URL="http://localhost/health"
RESTART_THRESHOLD=3
RESTART_COUNT=0

check_nginx_status() {
    # 检查Nginx进程
    if ! pgrep nginx > /dev/null; then
        echo "ERROR: Nginx process not running"
        return 1
    fi
    
    # 检查状态页面
    if ! curl -s --max-time 5 $NGINX_STATUS_URL > /dev/null; then
        echo "ERROR: Nginx status page not accessible"
        return 1
    fi
    
    # 检查健康检查端点
    if ! curl -s --max-time 5 $HEALTH_CHECK_URL | grep -q "healthy"; then
        echo "ERROR: Health check failed"
        return 1
    fi
    
    return 0
}

restart_nginx() {
    echo "Restarting Nginx..."
    systemctl restart nginx
    
    # 等待Nginx启动
    sleep 5
    
    # 检查重启是否成功
    if check_nginx_status; then
        echo "Nginx restarted successfully"
        RESTART_COUNT=0
        # 发送恢复通知
        echo "Nginx service recovered at $(date)" | mail -s "Nginx Recovery" admin@example.com
    else
        RESTART_COUNT=$((RESTART_COUNT + 1))
        echo "Nginx restart failed (attempt $RESTART_COUNT)"
        
        # 如果重启次数超过阈值，发送紧急告警
        if [ $RESTART_COUNT -ge $RESTART_THRESHOLD ]; then
            echo "CRITICAL: Nginx failed to restart after $RESTART_THRESHOLD attempts" | mail -s "Nginx Critical Failure" admin@example.com
        fi
    fi
}

# 主监控循环
while true; do
    if ! check_nginx_status; then
        echo "Nginx health check failed at $(date)"
        restart_nginx
    fi
    
    sleep 30
done

最佳实践总结

运维管理清单

监控体系建设
- 部署基础状态监控
- 配置详细的性能指标收集
- 建立完善的告警机制
日志管理
- 实施日志轮转策略
- 使用结构化日志格式
- 建立日志分析和报告机制
配置管理
- 使用配置管理工具（Ansible、Puppet等）
- 建立配置版本控制
- 实施自动化部署流程
安全管理
- 定期进行安全审计
- 实施访问控制策略
- 保持软件版本更新
备份恢复
- 建立定期备份机制
- 测试恢复流程
- 保持备份数据的可用性

常见运维问题处理

配置错误
- 使用配置测试命令（nginx -t）
- 保持配置文件版本控制
- 实施变更管理流程
性能问题
- 监控系统资源使用
- 分析访问日志
- 优化配置参数
安全事件
- 实施实时监控
- 建立应急响应流程
- 定期安全审计

通过建立完善的运维管理体系，可以确保Nginx服务的稳定运行，快速响应各种故障和异常情况，保障业务的连续性和可靠性。