Nginx运维-监控与管理
大约 9 分钟
Nginx运维-监控与管理
业务场景引入
在运维一个大型互联网平台时,运维团队面临以下挑战:
- 7x24小时可用性:平台需要全年无休稳定运行,任何故障都可能导致重大业务损失
- 实时监控需求:需要实时掌握系统状态,及时发现和处理异常
- 自动化运维:面对数百台服务器,需要自动化部署、配置管理和故障恢复
- 安全合规:需要满足各种安全标准和合规要求,定期进行安全审计
- 性能优化:持续监控系统性能,根据业务增长调整配置
- 故障快速响应:建立完善的故障响应机制,缩短故障恢复时间
这些需求正是Nginx运维管理的核心内容。通过建立完善的监控体系和管理流程,可以确保Nginx服务的稳定性和可靠性。
监控体系构建
基础监控配置
# 启用Nginx状态监控
server {
listen 8080;
server_name localhost;
# 基础状态监控
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
# 健康检查端点
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
# 版本信息(生产环境建议关闭)
location /nginx_info {
access_log off;
return 200 "Nginx Version: $nginx_version\n";
add_header Content-Type text/plain;
}
}
详细状态监控(需要nginx-module-vts模块)
# 虚拟主机流量状态监控
http {
vhost_traffic_status_zone;
server {
listen 8080;
server_name localhost;
# 详细的流量状态
location /status {
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
allow 127.0.0.1;
deny all;
}
# JSON格式状态数据
location /status.json {
vhost_traffic_status_display;
vhost_traffic_status_display_format json;
allow 127.0.0.1;
deny all;
}
}
}
自定义监控指标
# 自定义日志格式用于监控
log_format monitor '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'request_time=$request_time '
'upstream_response_time=$upstream_response_time '
'upstream_connect_time=$upstream_connect_time '
'upstream_header_time=$upstream_header_time '
'bytes_sent=$bytes_sent '
'gzip_ratio=$gzip_ratio';
server {
access_log /var/log/nginx/monitor.log monitor;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
日志管理与分析
日志轮转配置
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 644 nginx nginx
sharedscripts
prerotate
if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
run-parts /etc/logrotate.d/httpd-prerotate; \
fi \
endscript
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
endscript
}
结构化日志配置
# JSON格式日志
log_format json_combined escape=json '{'
'"time_local":"$time_local",'
'"remote_addr":"$remote_addr",'
'"remote_user":"$remote_user",'
'"request":"$request",'
'"status": "$status",'
'"body_bytes_sent":"$body_bytes_sent",'
'"http_referer":"$http_referer",'
'"http_user_agent":"$http_user_agent",'
'"http_x_forwarded_for":"$http_x_forwarded_for",'
'"request_time":"$request_time",'
'"upstream_response_time":"$upstream_response_time",'
'"gzip_ratio":"$gzip_ratio"'
'}';
server {
access_log /var/log/nginx/access.json json_combined;
error_log /var/log/nginx/error.log warn;
}
日志分析脚本
#!/bin/bash
# Nginx日志分析脚本
LOG_FILE="/var/log/nginx/access.log"
DATE=$(date +%Y-%m-%d)
# 创建分析报告目录
mkdir -p /var/log/nginx/reports
# 生成每日访问统计
echo "=== Daily Access Report - $DATE ===" > /var/log/nginx/reports/daily_$DATE.txt
# 总请求数
echo "Total Requests: $(wc -l < $LOG_FILE)" >> /var/log/nginx/reports/daily_$DATE.txt
# 独立IP数
echo "Unique IPs: $(awk '{print $1}' $LOG_FILE | sort -u | wc -l)" >> /var/log/nginx/reports/daily_$DATE.txt
# 状态码统计
echo "Status Code Distribution:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $9}' $LOG_FILE | sort | uniq -c | sort -nr >> /var/log/nginx/reports/daily_$DATE.txt
# 访问最多的URL
echo "Top 10 URLs:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $7}' $LOG_FILE | sort | uniq -c | sort -nr | head -10 >> /var/log/nginx/reports/daily_$DATE.txt
# 用户代理统计
echo "Top 10 User Agents:" >> /var/log/nginx/reports/daily_$DATE.txt
awk -F'"' '{print $6}' $LOG_FILE | sort | uniq -c | sort -nr | head -10 >> /var/log/nginx/reports/daily_$DATE.txt
# 响应时间分析
echo "Response Time Analysis:" >> /var/log/nginx/reports/daily_$DATE.txt
awk '{print $NF}' $LOG_FILE | awk -F'=' '{print $2}' | sort -n | awk 'BEGIN {count=0; sum=0} {count++; sum+=$1} END {if (count > 0) print "Average Response Time: " sum/count "s"}' >> /var/log/nginx/reports/daily_$DATE.txt
自动化部署与配置管理
Ansible部署脚本
# nginx-deploy.yml
---
- name: Deploy Nginx Configuration
hosts: webservers
become: yes
vars:
nginx_user: nginx
nginx_worker_processes: auto
nginx_worker_connections: 65535
tasks:
- name: Install Nginx
yum:
name: nginx
state: present
when: ansible_os_family == "RedHat"
- name: Install Nginx
apt:
name: nginx
state: present
when: ansible_os_family == "Debian"
- name: Create SSL directory
file:
path: /etc/nginx/ssl
state: directory
mode: '0755'
- name: Copy SSL certificates
copy:
src: files/{{ item }}
dest: /etc/nginx/ssl/{{ item }}
mode: '0644'
loop:
- example.com.crt
- example.com.key
- dhparam.pem
- name: Deploy main nginx.conf
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
mode: '0644'
notify: reload nginx
- name: Deploy site configuration
template:
src: templates/site.conf.j2
dest: /etc/nginx/conf.d/{{ inventory_hostname }}.conf
mode: '0644'
notify: reload nginx
- name: Ensure Nginx is running
systemd:
name: nginx
state: started
enabled: yes
handlers:
- name: reload nginx
systemd:
name: nginx
state: reloaded
配置模板示例
# templates/nginx.conf.j2
user {{ nginx_user }};
worker_processes {{ nginx_worker_processes }};
worker_rlimit_nofile 65535;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections {{ nginx_worker_connections }};
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_comp_level 6;
gzip_types
text/plain
text/css
text/xml
text/javascript
application/json
application/javascript
application/xml+rss
application/atom+xml
image/svg+xml;
include /etc/nginx/conf.d/*.conf;
}
安全管理
安全配置基线
server {
listen 80;
server_name example.com;
# 安全头设置
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "no-referrer-when-downgrade" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline';" always;
# 隐藏版本信息
server_tokens off;
# 限制请求方法
if ($request_method !~ ^(GET|HEAD|POST)$ ) {
return 405;
}
# 防止访问敏感文件
location ~ /\. {
deny all;
access_log off;
log_not_found off;
}
# 限制请求频率
limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
location /login {
limit_req zone=one burst=5;
}
location / {
root /var/www/html;
index index.html;
}
}
访问控制配置
# 基于IP的访问控制
server {
listen 80;
server_name admin.example.com;
# 允许特定IP访问
allow 192.168.1.0/24;
allow 10.0.0.0/8;
deny all;
location / {
auth_basic "Admin Area";
auth_basic_user_file /etc/nginx/.htpasswd;
root /var/www/admin;
index index.html;
}
}
# 基于地理位置的访问控制
geo $allowed_country {
default no;
CN yes;
US yes;
JP yes;
}
map $allowed_country $blocked_country {
yes 0;
no 1;
}
server {
listen 80;
server_name example.com;
if ($blocked_country) {
return 403;
}
location / {
root /var/www/html;
index index.html;
}
}
备份与恢复策略
配置文件备份脚本
#!/bin/bash
# Nginx配置备份脚本
BACKUP_DIR="/backup/nginx"
DATE=$(date +%Y%m%d_%H%M%S)
HOSTNAME=$(hostname)
# 创建备份目录
mkdir -p $BACKUP_DIR/$DATE
# 备份配置文件
cp -r /etc/nginx $BACKUP_DIR/$DATE/nginx_config
cp -r /etc/ssl $BACKUP_DIR/$DATE/ssl_certs 2>/dev/null || echo "No SSL certificates to backup"
# 创建备份信息文件
cat > $BACKUP_DIR/$DATE/backup_info.txt << EOF
Backup created on: $(date)
Hostname: $HOSTNAME
Nginx version: $(nginx -v 2>&1)
Configuration files: /etc/nginx
SSL certificates: /etc/ssl
EOF
# 压缩备份
tar -czf $BACKUP_DIR/nginx_backup_${HOSTNAME}_${DATE}.tar.gz -C $BACKUP_DIR $DATE
# 清理旧备份(保留最近7天)
find $BACKUP_DIR -name "nginx_backup_*.tar.gz" -mtime +7 -delete
find $BACKUP_DIR -mindepth 1 -maxdepth 1 -type d -not -name "*.tar.gz" -exec rm -rf {} +
echo "Backup completed: $BACKUP_DIR/nginx_backup_${HOSTNAME}_${DATE}.tar.gz"
配置恢复脚本
#!/bin/bash
# Nginx配置恢复脚本
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file.tar.gz>"
exit 1
fi
if [ ! -f "$BACKUP_FILE" ]; then
echo "Backup file not found: $BACKUP_FILE"
exit 1
fi
# 停止Nginx服务
systemctl stop nginx
# 备份当前配置
CURRENT_BACKUP="/tmp/nginx_current_$(date +%Y%m%d_%H%M%S)"
cp -r /etc/nginx $CURRENT_BACKUP
# 解压备份文件
TEMP_DIR="/tmp/nginx_restore_$(date +%Y%m%d_%H%M%S)"
mkdir -p $TEMP_DIR
tar -xzf $BACKUP_FILE -C $TEMP_DIR
# 恢复配置文件
cp -r $TEMP_DIR/*/nginx_config/* /etc/nginx/
# 恢复SSL证书(如果存在)
if [ -d "$TEMP_DIR/*/ssl_certs" ]; then
cp -r $TEMP_DIR/*/ssl_certs/* /etc/ssl/
fi
# 测试配置
nginx -t
if [ $? -eq 0 ]; then
# 启动Nginx服务
systemctl start nginx
echo "Configuration restored successfully"
else
echo "Configuration test failed, restoring previous configuration"
cp -r $CURRENT_BACKUP/* /etc/nginx/
systemctl start nginx
exit 1
fi
# 清理临时文件
rm -rf $TEMP_DIR $CURRENT_BACKUP
监控告警配置
Prometheus监控集成
# 需要nginx-module-vts模块
http {
vhost_traffic_status_zone;
server {
listen 8080;
server_name localhost;
location /metrics {
vhost_traffic_status_display;
vhost_traffic_status_display_format prometheus;
allow 127.0.0.1;
deny all;
}
}
}
告警规则配置
# Prometheus告警规则
groups:
- name: nginx.rules
rules:
- alert: NginxDown
expr: up{job="nginx"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Nginx instance {{ $labels.instance }} is down"
description: "Nginx instance {{ $labels.instance }} has been down for more than 1 minute"
- alert: HighRequestRate
expr: rate(nginx_http_requests_total[5m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "High request rate on {{ $labels.instance }}"
description: "Nginx instance {{ $labels.instance }} is handling more than 1000 requests per second"
- alert: HighErrorRate
expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) / rate(nginx_http_requests_total[5m]) > 0.05
for: 1m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate on {{ $labels.instance }} is above 5%"
告警通知脚本
#!/bin/bash
# Nginx告警通知脚本
ALERT_NAME=$1
ALERT_STATUS=$2
ALERT_DESCRIPTION=$3
HOSTNAME=$(hostname)
TIMESTAMP=$(date)
# 发送邮件告警
send_email_alert() {
local subject="Nginx Alert: $ALERT_NAME on $HOSTNAME"
local body="Time: $TIMESTAMP
Host: $HOSTNAME
Alert: $ALERT_NAME
Status: $ALERT_STATUS
Description: $ALERT_DESCRIPTION"
echo "$body" | mail -s "$subject" admin@example.com
}
# 发送Slack通知
send_slack_alert() {
local payload="{
\"text\": \"Nginx Alert\",
\"attachments\": [
{
\"color\": \"danger\",
\"fields\": [
{
\"title\": \"Alert\",
\"value\": \"$ALERT_NAME\",
\"short\": true
},
{
\"title\": \"Host\",
\"value\": \"$HOSTNAME\",
\"short\": true
},
{
\"title\": \"Status\",
\"value\": \"$ALERT_STATUS\",
\"short\": true
},
{
\"title\": \"Time\",
\"value\": \"$TIMESTAMP\",
\"short\": true
},
{
\"title\": \"Description\",
\"value\": \"$ALERT_DESCRIPTION\",
\"short\": false
}
]
}
]
}"
curl -X POST -H 'Content-type: application/json' \
--data "$payload" \
https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
}
# 根据告警级别选择通知方式
case $ALERT_NAME in
"NginxDown")
send_email_alert
send_slack_alert
;;
"HighRequestRate"|"HighErrorRate")
send_slack_alert
;;
*)
send_email_alert
;;
esac
性能监控脚本
实时性能监控
#!/bin/bash
# Nginx实时性能监控脚本
NGINX_STATUS_URL="http://localhost:8080/nginx_status"
while true; do
# 获取Nginx状态
STATUS=$(curl -s $NGINX_STATUS_URL)
# 解析状态信息
ACTIVE=$(echo "$STATUS" | awk 'NR==1 {print $3}')
ACCEPTS=$(echo "$STATUS" | awk 'NR==3 {print $2}')
HANDLED=$(echo "$STATUS" | awk 'NR==3 {print $3}')
REQUESTS=$(echo "$STATUS" | awk 'NR==3 {print $4}')
READING=$(echo "$STATUS" | awk 'NR==4 {print $2}')
WRITING=$(echo "$STATUS" | awk 'NR==4 {print $4}')
WAITING=$(echo "$STATUS" | awk 'NR==4 {print $6}')
# 显示实时状态
clear
echo "=== Nginx Real-time Status ==="
echo "Active connections: $ACTIVE"
echo "Server accepts: $ACCEPTS"
echo "Server handled: $HANDLED"
echo "Server requests: $REQUESTS"
echo "Reading: $READING"
echo "Writing: $WRITING"
echo "Waiting: $WAITING"
echo "=============================="
# 检查异常情况
if [ $ACTIVE -gt 10000 ]; then
echo "WARNING: High active connections ($ACTIVE)"
fi
sleep 5
done
系统资源监控
#!/bin/bash
# 系统资源监控脚本
while true; do
# 获取系统资源使用情况
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
MEMORY_USAGE=$(free | grep Mem | awk '{printf("%.2f%%"), $3/$2 * 100.0}')
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}')
LOAD_AVERAGE=$(uptime | awk -F'load average:' '{print $2}')
# 获取Nginx进程信息
NGINX_PROCESSES=$(ps aux | grep nginx | grep -v grep | wc -l)
NGINX_MASTER_PID=$(pgrep -f "nginx: master")
# 显示系统状态
clear
echo "=== System Resource Status ==="
echo "CPU Usage: $CPU_USAGE%"
echo "Memory Usage: $MEMORY_USAGE"
echo "Disk Usage: $DISK_USAGE"
echo "Load Average: $LOAD_AVERAGE"
echo "Nginx Processes: $NGINX_PROCESSES"
echo "Nginx Master PID: $NGINX_MASTER_PID"
echo "=============================="
# 检查资源使用情况
CPU_INT=$(echo $CPU_USAGE | cut -d'.' -f1)
if [ $CPU_INT -gt 80 ]; then
echo "WARNING: High CPU usage ($CPU_USAGE%)"
fi
sleep 10
done
故障恢复与自愈
自动故障检测脚本
#!/bin/bash
# Nginx自动故障检测脚本
NGINX_STATUS_URL="http://localhost:8080/nginx_status"
HEALTH_CHECK_URL="http://localhost/health"
RESTART_THRESHOLD=3
RESTART_COUNT=0
check_nginx_status() {
# 检查Nginx进程
if ! pgrep nginx > /dev/null; then
echo "ERROR: Nginx process not running"
return 1
fi
# 检查状态页面
if ! curl -s --max-time 5 $NGINX_STATUS_URL > /dev/null; then
echo "ERROR: Nginx status page not accessible"
return 1
fi
# 检查健康检查端点
if ! curl -s --max-time 5 $HEALTH_CHECK_URL | grep -q "healthy"; then
echo "ERROR: Health check failed"
return 1
fi
return 0
}
restart_nginx() {
echo "Restarting Nginx..."
systemctl restart nginx
# 等待Nginx启动
sleep 5
# 检查重启是否成功
if check_nginx_status; then
echo "Nginx restarted successfully"
RESTART_COUNT=0
# 发送恢复通知
echo "Nginx service recovered at $(date)" | mail -s "Nginx Recovery" admin@example.com
else
RESTART_COUNT=$((RESTART_COUNT + 1))
echo "Nginx restart failed (attempt $RESTART_COUNT)"
# 如果重启次数超过阈值,发送紧急告警
if [ $RESTART_COUNT -ge $RESTART_THRESHOLD ]; then
echo "CRITICAL: Nginx failed to restart after $RESTART_THRESHOLD attempts" | mail -s "Nginx Critical Failure" admin@example.com
fi
fi
}
# 主监控循环
while true; do
if ! check_nginx_status; then
echo "Nginx health check failed at $(date)"
restart_nginx
fi
sleep 30
done
最佳实践总结
运维管理清单
监控体系建设
- 部署基础状态监控
- 配置详细的性能指标收集
- 建立完善的告警机制
日志管理
- 实施日志轮转策略
- 使用结构化日志格式
- 建立日志分析和报告机制
配置管理
- 使用配置管理工具(Ansible、Puppet等)
- 建立配置版本控制
- 实施自动化部署流程
安全管理
- 定期进行安全审计
- 实施访问控制策略
- 保持软件版本更新
备份恢复
- 建立定期备份机制
- 测试恢复流程
- 保持备份数据的可用性
常见运维问题处理
配置错误
- 使用配置测试命令(nginx -t)
- 保持配置文件版本控制
- 实施变更管理流程
性能问题
- 监控系统资源使用
- 分析访问日志
- 优化配置参数
安全事件
- 实施实时监控
- 建立应急响应流程
- 定期安全审计
通过建立完善的运维管理体系,可以确保Nginx服务的稳定运行,快速响应各种故障和异常情况,保障业务的连续性和可靠性。