Part 10: Monitoring, Logging, and Best Practices

Introduction

This final part covers essential production concerns: comprehensive monitoring with Prometheus and Grafana, centralized logging with the EFK stack, security hardening, and a complete production readiness checklist.

Monitoring with Prometheus and Grafana

Installing Prometheus Stack

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Prometheus, Grafana, AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123

# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=300s

# View all monitoring components
kubectl get all -n monitoring

Accessing Prometheus and Grafana

# Port-forward Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &

# Port-forward Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

# Access Prometheus: http://localhost:9090
# Access Grafana: http://localhost:3000 (admin/admin123)

ServiceMonitor for Custom Applications

# app-with-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: app
        image: nginx:alpine
        ports:
        - containerPort: 80
          name: http
        - containerPort: 9113
          name: metrics
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app
  namespace: default
  labels:
    app: sample-app
spec:
  selector:
    app: sample-app
  ports:
  - port: 80
    targetPort: 80
    name: http
  - port: 9113
    targetPort: 9113
    name: metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sample-app-metrics
  namespace: default
  labels:
    release: prometheus  # Must match Prometheus selector
spec:
  selector:
    matchLabels:
      app: sample-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

# Apply resources
kubectl apply -f app-with-metrics.yaml

# Verify ServiceMonitor
kubectl get servicemonitor -n default

# Check if Prometheus discovered the targets
# Go to Prometheus UI -> Status -> Targets
# Should see sample-app endpoints

Custom PrometheusRule

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
  - name: app.rules
    interval: 30s
    rules:
    # Alert on high pod restarts
    - alert: HighPodRestartRate
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
        description: "Pod has restarted {{ $value }} times in the last 15 minutes"

    # Alert on pod crash loops
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"

    # Alert on high CPU usage
    - alert: HighCPUUsage
      expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod) > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage in {{ $labels.namespace }}/{{ $labels.pod }}"
        description: "CPU usage is {{ $value | humanizePercentage }}"

    # Alert on high memory usage
    - alert: HighMemoryUsage
      expr: sum(container_memory_working_set_bytes) by (namespace, pod) / sum(container_spec_memory_limit_bytes) by (namespace, pod) > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage in {{ $labels.namespace }}/{{ $labels.pod }}"
        description: "Memory usage is {{ $value | humanizePercentage }}"

    # Alert on persistent volume space
    - alert: PersistentVolumeSpaceLow
      expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "PV {{ $labels.persistentvolumeclaim }} running out of space"
        description: "Volume is {{ $value | humanizePercentage }} full"

# Apply PrometheusRule
kubectl apply -f prometheusrule.yaml

# Verify rules loaded
kubectl get prometheusrules -n monitoring

# Check in Prometheus UI -> Alerts

Grafana Dashboards

# List pre-installed dashboards
kubectl get configmaps -n monitoring | grep grafana-dashboard

# Access Grafana and import dashboards:
# 1. Kubernetes Cluster Monitoring (ID: 7249)
# 2. Node Exporter Full (ID: 1860)
# 3. Kubernetes Pod Metrics (ID: 6417)
# 4. Kubernetes Deployment Statefulset Daemonset metrics (ID: 8588)

# Or use Grafana UI: + -> Import -> Enter Dashboard ID

Custom Grafana Dashboard

# custom-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  custom-dashboard.json: |
    {
      "dashboard": {
        "title": "Custom Application Dashboard",
        "panels": [
          {
            "title": "Pod CPU Usage",
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"default\"}[5m])) by (pod)"
              }
            ],
            "type": "graph"
          }
        ]
      }
    }

Centralized Logging with EFK Stack

Installing Elasticsearch

# Add Elastic Helm repo
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=1 \
  --set minimumMasterNodes=1 \
  --set resources.requests.memory=2Gi \
  --set resources.limits.memory=2Gi \
  --set volumeClaimTemplate.resources.requests.storage=30Gi

# Wait for Elasticsearch to be ready
kubectl wait --for=condition=Ready pod -l app=elasticsearch-master -n logging --timeout=600s

# Verify Elasticsearch
kubectl port-forward -n logging svc/elasticsearch-master 9200:9200 &
curl http://localhost:9200/_cluster/health?pretty

Installing Fluentd

# fluentd-daemonset.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluentd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluentd
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
    </filter>

    <match **>
      @type elasticsearch
      host elasticsearch-master.logging.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix kubernetes
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch-master.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: "http"
        resources:
          limits:
            memory: 512Mi
            cpu: 200m
          requests:
            memory: 256Mi
            cpu: 100m
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluentd-config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluentd-config
        configMap:
          name: fluentd-config

# Apply Fluentd DaemonSet
kubectl apply -f fluentd-daemonset.yaml

# Verify Fluentd pods
kubectl get pods -n logging -l app=fluentd

Installing Kibana

# Install Kibana
helm install kibana elastic/kibana \
  --namespace logging \
  --set service.type=LoadBalancer \
  --set resources.requests.memory=1Gi \
  --set resources.limits.memory=2Gi

# Wait for Kibana to be ready
kubectl wait --for=condition=Ready pod -l app=kibana -n logging --timeout=300s

# Get Kibana URL
kubectl get svc -n logging kibana-kibana

# Port-forward Kibana
kubectl port-forward -n logging svc/kibana-kibana 5601:5601 &

# Access Kibana: http://localhost:5601

Kibana Configuration

# In Kibana UI:
# 1. Go to Management -> Stack Management -> Index Patterns
# 2. Create index pattern: kubernetes-*
# 3. Select @timestamp as time field
# 4. Go to Discover to view logs

# Search examples in Kibana:
# - kubernetes.namespace_name: "default"
# - kubernetes.pod_name: "nginx-*"
# - log: "error"
# - kubernetes.labels.app: "my-app" AND level: "error"

Health Checks Best Practices

Liveness and Readiness Probes

# health-checks.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: nginx:alpine
        ports:
        - containerPort: 80

        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1

        # Readiness probe - remove from service if not ready
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

        # Startup probe - for slow-starting containers
        startupProbe:
          httpGet:
            path: /startup
            port: 80
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 30
          successThreshold: 1

        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

Different Probe Types

# probe-types.yaml
apiVersion: v1
kind: Pod
metadata:
  name: probe-examples
spec:
  containers:
  - name: http-probe
    image: nginx:alpine
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      initialDelaySeconds: 3
      periodSeconds: 3

  - name: tcp-probe
    image: redis:alpine
    livenessProbe:
      tcpSocket:
        port: 6379
      initialDelaySeconds: 15
      periodSeconds: 10

  - name: exec-probe
    image: postgres:15-alpine
    env:
    - name: POSTGRES_PASSWORD
      value: password
    livenessProbe:
      exec:
        command:
        - sh
        - -c
        - pg_isready -U postgres
      initialDelaySeconds: 30
      periodSeconds: 10

Security Best Practices

1. Role-Based Access Control (RBAC)

# rbac-example.yaml
---
# ServiceAccount for application
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  namespace: production
---
# Role with limited permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
  resourceNames: ["app-config"]
---
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
---
# ClusterRole for cluster-wide resources
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-viewer
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: view-nodes
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: production
roleRef:
  kind: ClusterRole
  name: node-viewer
  apiGroup: rbac.authorization.k8s.io

# Apply RBAC
kubectl apply -f rbac-example.yaml

# Test permissions
kubectl auth can-i get pods --namespace production --as system:serviceaccount:production:app-sa
kubectl auth can-i delete pods --namespace production --as system:serviceaccount:production:app-sa

# View roles and bindings
kubectl get roles,rolebindings -n production
kubectl get clusterroles,clusterrolebindings

2. Pod Security Standards

# pod-security-standards.yaml
---
# Baseline policy (namespace label)
apiVersion: v1
kind: Namespace
metadata:
  name: baseline-ns
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# Restricted policy
apiVersion: v1
kind: Namespace
metadata:
  name: restricted-ns
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# Secure pod example
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  namespace: restricted-ns
spec:
  serviceAccountName: app-sa
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: nginx:alpine
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: cache
      mountPath: /var/cache/nginx
    - name: run
      mountPath: /var/run
  volumes:
  - name: cache
    emptyDir: {}
  - name: run
    emptyDir: {}

# Apply pod security standards
kubectl apply -f pod-security-standards.yaml

# Try to create insecure pod (should fail in restricted namespace)
kubectl run insecure --image=nginx --namespace=restricted-ns --privileged=true

3. Network Policies for Security

# security-network-policies.yaml
---
# Deny all ingress and egress by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# Allow DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
---
# Allow web traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-traffic
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 80
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

4. Secret Management

# Enable encryption at rest (on control plane)
sudo cat > /etc/kubernetes/encryption-config.yaml << EOF
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: key1
              secret: $(head -c 32 /dev/urandom | base64)
      - identity: {}
EOF

# Update kube-apiserver to use encryption config
# Add flag: --encryption-provider-config=/etc/kubernetes/encryption-config.yaml

# Install sealed-secrets for GitOps
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# Use external secret management
# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets \
  external-secrets/external-secrets \
  -n external-secrets-system \
  --create-namespace

Resource Management and Cost Optimization

1. Vertical Pod Autoscaler (VPA)

# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

cd ../../../

# vpa-example.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"  # Auto, Recreate, Initial, Off
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi

# Apply VPA
kubectl apply -f vpa-example.yaml

# View VPA recommendations
kubectl get vpa web-app-vpa -o yaml
kubectl describe vpa web-app-vpa

2. Horizontal Pod Autoscaler (HPA)

# hpa-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

# Install metrics-server (if not installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Apply HPA
kubectl apply -f hpa-example.yaml

# Watch HPA
kubectl get hpa web-app-hpa -w

# Generate load
kubectl run load-generator --image=busybox:latest --restart=Never -- /bin/sh -c "while true; do wget -q -O- http://web-app; done"

# Watch scaling
kubectl get hpa,pods -w

3. Cluster Autoscaler

# For AWS
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

# Edit the deployment with your cluster name
kubectl edit deployment cluster-autoscaler -n kube-system

# Add annotations to avoid eviction
kubectl annotate deployment cluster-autoscaler -n kube-system \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

# View logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

Backup and Disaster Recovery

1. Velero Installation

# Install Velero CLI on Ubuntu
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
rm -rf velero-v1.12.0-linux-amd64*

# Install Velero on cluster (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

# Verify installation
kubectl get pods -n velero

2. Creating Backups

# Backup entire cluster
velero backup create full-backup --include-namespaces '*'

# Backup specific namespace
velero backup create prod-backup --include-namespaces production

# Backup with label selector
velero backup create app-backup --selector app=critical

# Scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces production

# View backups
velero backup get
velero backup describe full-backup

# Download backup logs
velero backup logs full-backup

3. Restoring from Backups

# List available backups
velero backup get

# Restore entire backup
velero restore create --from-backup full-backup

# Restore specific namespace
velero restore create --from-backup prod-backup --include-namespaces production

# Restore to different namespace
velero restore create --from-backup prod-backup \
  --namespace-mappings production:production-restored

# View restore status
velero restore get
velero restore describe <restore-name>

# View logs
velero restore logs <restore-name>

Troubleshooting Guide

Pod Issues

# Pod stuck in Pending
kubectl describe pod <pod-name>
# Check: Insufficient resources, node selector, taints, PVC binding

# Pod stuck in CrashLoopBackOff
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
# Check: Application errors, misconfiguration, health probes

# Pod stuck in ImagePullBackOff
kubectl describe pod <pod-name>
# Check: Image name, image pull secrets, registry access

# Pod evicted
kubectl get events --sort-by='.lastTimestamp'
# Check: Resource pressure (memory, disk)

# Debug with ephemeral container (K8s 1.25+)
kubectl debug <pod-name> -it --image=nicolaka/netshoot

Network Issues

# Test DNS
kubectl run dns-test --image=busybox:latest -it --rm --restart=Never -- nslookup kubernetes.default

# Test service connectivity
kubectl run netshoot --image=nicolaka/netshoot -it --rm --restart=Never -- /bin/bash
# Inside: curl http://service-name:port

# Check network policies
kubectl get networkpolicies -A
kubectl describe networkpolicy <policy-name>

# Verify kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

# Check CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

Storage Issues

# PVC stuck in Pending
kubectl describe pvc <pvc-name>
# Check: No matching PV, StorageClass issues

# PV not released
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'

# Check storage usage
kubectl get pvc -A
kubectl describe pvc <pvc-name>

Cluster Issues

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

# View cluster events
kubectl get events --sort-by='.lastTimestamp' -A

# Check control plane components
kubectl get componentstatuses
kubectl get pods -n kube-system

# Check API server logs
sudo journalctl -u kube-apiserver -f

# Check etcd health
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health

Production Readiness Checklist

Security Checklist

High Availability Checklist

Multi-node control plane (3+ nodes)
Multi-zone/region deployment
Configure Pod Disruption Budgets
Use multiple replicas for critical services
Implement health checks (liveness, readiness)
Configure resource requests and limits
Set up monitoring and alerting
Implement autoscaling (HPA, VPA, CA)
Use anti-affinity for replica distribution

Monitoring and Logging Checklist

Deploy Prometheus for metrics
Set up Grafana dashboards
Configure alerting rules
Implement centralized logging (EFK/Loki)
Monitor node and cluster health
Track resource utilization
Set up SLIs and SLOs
Configure log retention policies

Backup and DR Checklist

Implement regular backups (Velero)
Test restore procedures
Backup persistent volumes
Document disaster recovery plan
Set up cross-region replication
Maintain configuration in version control
Regular disaster recovery drills

Operations Checklist

Use GitOps for deployments (ArgoCD/Flux)
Implement CI/CD pipelines
Use Helm for package management
Maintain documentation
Set up development/staging/production environments
Implement rollback strategies
Use namespaces for isolation
Tag and version all resources

Performance Checklist

Right-size resource requests/limits
Enable cluster autoscaling
Use horizontal pod autoscaling
Optimize container images (multi-stage builds)
Implement caching strategies
Use CDN for static assets
Configure persistent volume performance
Monitor and optimize database queries

Summary

Congratulations! You’ve completed the Kubernetes Mastery series. In this final part, you learned:

Installing and configuring Prometheus and Grafana for monitoring
Creating custom alerts and dashboards
Setting up centralized logging with EFK stack
Implementing health checks effectively
Security best practices (RBAC, Pod Security, Network Policies)
Resource management and cost optimization
Backup and disaster recovery with Velero
Comprehensive troubleshooting techniques
Production readiness checklist

Key Commands Reference

# Monitoring
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Logging
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs -f <pod-name>
kubectl logs --tail=100 <pod-name>

# Debugging
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
kubectl debug <pod-name> -it --image=nicolaka/netshoot

# Backups
velero backup create <name>
velero restore create --from-backup <name>
velero schedule create <name> --schedule="0 2 * * *"

# Monitoring resources
kubectl top nodes
kubectl top pods
kubectl get hpa

Next Steps

Explore service meshes (Istio, Linkerd)
Learn Kubernetes operators
Implement GitOps with ArgoCD or Flux
Study Kubernetes internals and controllers
Contribute to Kubernetes ecosystem
Get certified (CKA, CKAD, CKS)

Thank you for completing this comprehensive Kubernetes journey! Keep practicing, and remember: production-ready Kubernetes is a continuous learning process.