infrastructure-cost-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Infrastructure Cost Optimization

云基础设施成本优化

Overview

概述

Reduce infrastructure costs through intelligent resource allocation, reserved instances, spot instances, and continuous optimization without sacrificing performance.
通过智能资源分配、预留实例、竞价实例以及持续优化,在不牺牲性能的前提下降低基础设施成本。

When to Use

适用场景

  • Cloud cost reduction
  • Budget management and tracking
  • Resource utilization optimization
  • Multi-environment cost allocation
  • Waste identification and elimination
  • Reserved instance planning
  • Spot instance integration
  • 云成本削减
  • 预算管理与跟踪
  • 资源利用率优化
  • 多环境成本分配
  • 资源浪费识别与消除
  • 预留实例规划
  • 竞价实例集成

Implementation Examples

实施示例

1. AWS Cost Optimization Configuration

1. AWS成本优化配置

yaml
undefined
yaml
undefined

cost-optimization-setup.yaml

cost-optimization-setup.yaml

apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-scripts namespace: operations data: analyze-costs.sh: | #!/bin/bash set -euo pipefail
echo "=== AWS Cost Analysis ==="

# Get daily cost trend
echo "Daily costs for last 7 days:"
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics "BlendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[*].[TimePeriod.Start,Total.BlendedCost.Amount]' \
  --output table

# Find unattached resources
echo -e "\n=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

echo -e "\n=== Unattached Elastic IPs ==="
aws ec2 describe-addresses \
  --filters Name=association-id,Values=none \
  --query 'Addresses[*].[PublicIp,AllocationId]' \
  --output table

echo -e "\n=== Unused RDS Instances ==="
aws rds describe-db-instances \
  --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass,Engine,AllocatedStorage]' \
  --output table

# Estimate savings with Reserved Instances
echo -e "\n=== Reserved Instance Savings Potential ==="
aws ce get-reservation-purchase-recommendation \
  --service "EC2" \
  --lookback-period THIRTY_DAYS \
  --query 'Recommendations[0].[RecommendationSummary.TotalEstimatedMonthlySavingsAmount,RecommendationSummary.TotalEstimatedMonthlySavingsPercentage]' \
  --output table
optimize-resources.sh: | #!/bin/bash set -euo pipefail
echo "Starting resource optimization..."

# Remove unattached volumes
echo "Removing unattached volumes..."
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].VolumeId' \
  --output text | \
while read volume_id; do
  echo "Deleting volume: $volume_id"
  aws ec2 delete-volume --volume-id "$volume_id" 2>/dev/null || true
done

# Release unused Elastic IPs
echo "Releasing unused Elastic IPs..."
aws ec2 describe-addresses \
  --filters Name=association-id,Values=none \
  --query 'Addresses[*].AllocationId' \
  --output text | \
while read alloc_id; do
  echo "Releasing EIP: $alloc_id"
  aws ec2 release-address --allocation-id "$alloc_id" 2>/dev/null || true
done

# Modify RDS to smaller instances
echo "Analyzing RDS for downsizing..."
# Implement logic to check CloudWatch metrics and downsize if needed

echo "Optimization complete"

apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-scripts namespace: operations data: analyze-costs.sh: | #!/bin/bash set -euo pipefail
echo "=== AWS Cost Analysis ==="

# Get daily cost trend
echo "Daily costs for last 7 days:"
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics "BlendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[*].[TimePeriod.Start,Total.BlendedCost.Amount]' \
  --output table

# Find unattached resources
echo -e "\n=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

echo -e "\n=== Unattached Elastic IPs ==="
aws ec2 describe-addresses \
  --filters Name=association-id,Values=none \
  --query 'Addresses[*].[PublicIp,AllocationId]' \
  --output table

echo -e "\n=== Unused RDS Instances ==="
aws rds describe-db-instances \
  --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass,Engine,AllocatedStorage]' \
  --output table

# Estimate savings with Reserved Instances
echo -e "\n=== Reserved Instance Savings Potential ==="
aws ce get-reservation-purchase-recommendation \
  --service "EC2" \
  --lookback-period THIRTY_DAYS \
  --query 'Recommendations[0].[RecommendationSummary.TotalEstimatedMonthlySavingsAmount,RecommendationSummary.TotalEstimatedMonthlySavingsPercentage]' \
  --output table
optimize-resources.sh: | #!/bin/bash set -euo pipefail
echo "Starting resource optimization..."

# Remove unattached volumes
echo "Removing unattached volumes..."
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].VolumeId' \
  --output text | \
while read volume_id; do
  echo "Deleting volume: $volume_id"
  aws ec2 delete-volume --volume-id "$volume_id" 2>/dev/null || true
done

# Release unused Elastic IPs
echo "Releasing unused Elastic IPs..."
aws ec2 describe-addresses \
  --filters Name=association-id,Values=none \
  --query 'Addresses[*].AllocationId' \
  --output text | \
while read alloc_id; do
  echo "Releasing EIP: $alloc_id"
  aws ec2 release-address --allocation-id "$alloc_id" 2>/dev/null || true
done

# Modify RDS to smaller instances
echo "Analyzing RDS for downsizing..."
# Implement logic to check CloudWatch metrics and downsize if needed

echo "Optimization complete"

Terraform cost optimization

Terraform cost optimization

resource "aws_ec2_instance" "spot" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium"

Use spot instances for non-critical workloads

instance_market_options { market_type = "spot"
spot_options {
  max_price                      = "0.05"  # Set max price
  spot_instance_type             = "persistent"
  interrupt_behavior             = "terminate"
  valid_until                    = "2025-12-31T23:59:59Z"
}
}
tags = { Name = "spot-instance" CostCenter = "engineering" } }
resource "aws_ec2_instance" "spot" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium"

Use spot instances for non-critical workloads

instance_market_options { market_type = "spot"
spot_options {
  max_price                      = "0.05"  # Set max price
  spot_instance_type             = "persistent"
  interrupt_behavior             = "terminate"
  valid_until                    = "2025-12-31T23:59:59Z"
}
}
tags = { Name = "spot-instance" CostCenter = "engineering" } }

Reserved instance for baseline capacity

Reserved instance for baseline capacity

resource "aws_ec2_instance" "reserved" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium"

Tag for reserved instance matching

tags = { Name = "reserved-instance" ReservationType = "reserved" } }
resource "aws_ec2_fleet" "mixed" { name = "mixed-capacity"
launch_template_configs { launch_template_specification { launch_template_id = aws_launch_template.app.id version = "$Latest" }
overrides {
  instance_type       = "t3.medium"
  weighted_capacity   = "1"
  priority            = 1  # Reserved
}

overrides {
  instance_type       = "t3.large"
  weighted_capacity   = "2"
  priority            = 2  # Reserved
}

overrides {
  instance_type       = "t3a.medium"
  weighted_capacity   = "1"
  priority            = 3  # Spot
}

overrides {
  instance_type       = "t3a.large"
  weighted_capacity   = "2"
  priority            = 4  # Spot
}
}
target_capacity_specification { total_target_capacity = 10 on_demand_target_capacity = 6 spot_target_capacity = 4 default_target_capacity_type = "on-demand" }
fleet_type = "maintain" }
undefined
resource "aws_ec2_instance" "reserved" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.medium"

Tag for reserved instance matching

tags = { Name = "reserved-instance" ReservationType = "reserved" } }
resource "aws_ec2_fleet" "mixed" { name = "mixed-capacity"
launch_template_configs { launch_template_specification { launch_template_id = aws_launch_template.app.id version = "$Latest" }
overrides {
  instance_type       = "t3.medium"
  weighted_capacity   = "1"
  priority            = 1  # Reserved
}

overrides {
  instance_type       = "t3.large"
  weighted_capacity   = "2"
  priority            = 2  # Reserved
}

overrides {
  instance_type       = "t3a.medium"
  weighted_capacity   = "1"
  priority            = 3  # Spot
}

overrides {
  instance_type       = "t3a.large"
  weighted_capacity   = "2"
  priority            = 4  # Spot
}
}
target_capacity_specification { total_target_capacity = 10 on_demand_target_capacity = 6 spot_target_capacity = 4 default_target_capacity_type = "on-demand" }
fleet_type = "maintain" }
undefined

2. Kubernetes Cost Optimization

2. Kubernetes成本优化

yaml
undefined
yaml
undefined

k8s-cost-optimization.yaml

k8s-cost-optimization.yaml

apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-policies namespace: kube-system data: policies.yaml: | # Resource quotas per namespace apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: production spec: hard: requests.cpu: "100" requests.memory: "200Gi" limits.cpu: "200" limits.memory: "400Gi" pods: "500" scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: ["high", "medium"]

apiVersion: v1 kind: ConfigMap metadata: name: cost-optimization-policies namespace: kube-system data: policies.yaml: | # Resource quotas per namespace apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: production spec: hard: requests.cpu: "100" requests.memory: "200Gi" limits.cpu: "200" limits.memory: "400Gi" pods: "500" scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: ["high", "medium"]

Pod Disruption Budget for cost-effective scaling

Pod Disruption Budget for cost-effective scaling

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: cost-optimized-pdb namespace: production spec: minAvailable: 1 selector: matchLabels: tier: backend

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: cost-optimized-pdb namespace: production spec: minAvailable: 1 selector: matchLabels: tier: backend

Prioritize spot instances with taints/tolerations

Prioritize spot instances with taints/tolerations

apiVersion: v1 kind: Node metadata: name: spot-node-1 spec: taints: - key: cloud.google.com/gke-preemptible value: "true" effect: NoSchedule

apiVersion: apps/v1 kind: Deployment metadata: name: cost-optimized-app namespace: production spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: # Tolerate spot instances tolerations: - key: cloud.google.com/gke-preemptible operator: Equal value: "true" effect: NoSchedule
  # Prefer nodes with lower cost
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]

  containers:
    - name: app
      image: myapp:latest
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi
undefined
apiVersion: v1 kind: Node metadata: name: spot-node-1 spec: taints: - key: cloud.google.com/gke-preemptible value: "true" effect: NoSchedule

apiVersion: apps/v1 kind: Deployment metadata: name: cost-optimized-app namespace: production spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: # Tolerate spot instances tolerations: - key: cloud.google.com/gke-preemptible operator: Equal value: "true" effect: NoSchedule
  # Prefer nodes with lower cost
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]

  containers:
    - name: app
      image: myapp:latest
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi
undefined

3. Cost Monitoring Dashboard

3. 成本监控仪表盘

python
undefined
python
undefined

cost-monitoring.py

cost-monitoring.py

import boto3 import json from datetime import datetime, timedelta
class CostOptimizer: def init(self): self.ce_client = boto3.client('ce') self.ec2_client = boto3.client('ec2') self.rds_client = boto3.client('rds')
def get_daily_costs(self, days=30):
    """Get daily costs for past N days"""
    end_date = datetime.now().date()
    start_date = end_date - timedelta(days=days)

    response = self.ce_client.get_cost_and_usage(
        TimePeriod={
            'Start': str(start_date),
            'End': str(end_date)
        },
        Granularity='DAILY',
        Metrics=['BlendedCost'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'}
        ]
    )

    return response

def find_underutilized_instances(self):
    """Find EC2 instances with low CPU usage"""
    cloudwatch = boto3.client('cloudwatch')
    instances = []

    ec2_instances = self.ec2_client.describe_instances()
    for reservation in ec2_instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Check CPU utilization
            response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=7),
                EndTime=datetime.now(),
                Period=3600,
                Statistics=['Average']
            )

            if response['Datapoints']:
                avg_cpu = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
                if avg_cpu < 10:  # Less than 10% average
                    instances.append({
                        'InstanceId': instance_id,
                        'Type': instance['InstanceType'],
                        'AverageCPU': avg_cpu,
                        'Recommendation': 'Downsize or terminate'
                    })

    return instances

def estimate_reserved_instance_savings(self):
    """Estimate potential savings from reserved instances"""
    response = self.ce_client.get_reservation_purchase_recommendation(
        Service='EC2',
        LookbackPeriod='THIRTY_DAYS',
        PageSize=100
    )

    total_savings = 0
    for recommendation in response.get('Recommendations', []):
        summary = recommendation['RecommendationSummary']
        savings = float(summary['EstimatedMonthlyMonthlySavingsAmount'])
        total_savings += savings

    return total_savings

def generate_report(self):
    """Generate comprehensive cost optimization report"""
    print("=== Cost Optimization Report ===\n")

    # Daily costs
    print("Daily Costs:")
    costs = self.get_daily_costs(7)
    for result in costs['ResultsByTime']:
        date = result['TimePeriod']['Start']
        total = result['Total']['BlendedCost']['Amount']
        print(f"  {date}: ${total}")

    # Underutilized instances
    print("\nUnderutilized Instances:")
    underutilized = self.find_underutilized_instances()
    for instance in underutilized:
        print(f"  {instance['InstanceId']}: {instance['AverageCPU']:.1f}% CPU - {instance['Recommendation']}")

    # Reserved instance savings
    print("\nReserved Instance Savings Potential:")
    savings = self.estimate_reserved_instance_savings()
    print(f"  Estimated Monthly Savings: ${savings:.2f}")
import boto3 import json from datetime import datetime, timedelta
class CostOptimizer: def init(self): self.ce_client = boto3.client('ce') self.ec2_client = boto3.client('ec2') self.rds_client = boto3.client('rds')
def get_daily_costs(self, days=30):
    """Get daily costs for past N days"""
    end_date = datetime.now().date()
    start_date = end_date - timedelta(days=days)

    response = self.ce_client.get_cost_and_usage(
        TimePeriod={
            'Start': str(start_date),
            'End': str(end_date)
        },
        Granularity='DAILY',
        Metrics=['BlendedCost'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'}
        ]
    )

    return response

def find_underutilized_instances(self):
    """Find EC2 instances with low CPU usage"""
    cloudwatch = boto3.client('cloudwatch')
    instances = []

    ec2_instances = self.ec2_client.describe_instances()
    for reservation in ec2_instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Check CPU utilization
            response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=7),
                EndTime=datetime.now(),
                Period=3600,
                Statistics=['Average']
            )

            if response['Datapoints']:
                avg_cpu = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
                if avg_cpu < 10:  # Less than 10% average
                    instances.append({
                        'InstanceId': instance_id,
                        'Type': instance['InstanceType'],
                        'AverageCPU': avg_cpu,
                        'Recommendation': 'Downsize or terminate'
                    })

    return instances

def estimate_reserved_instance_savings(self):
    """Estimate potential savings from reserved instances"""
    response = self.ce_client.get_reservation_purchase_recommendation(
        Service='EC2',
        LookbackPeriod='THIRTY_DAYS',
        PageSize=100
    )

    total_savings = 0
    for recommendation in response.get('Recommendations', []):
        summary = recommendation['RecommendationSummary']
        savings = float(summary['EstimatedMonthlyMonthlySavingsAmount'])
        total_savings += savings

    return total_savings

def generate_report(self):
    """Generate comprehensive cost optimization report"""
    print("=== Cost Optimization Report ===\n")

    # Daily costs
    print("Daily Costs:")
    costs = self.get_daily_costs(7)
    for result in costs['ResultsByTime']:
        date = result['TimePeriod']['Start']
        total = result['Total']['BlendedCost']['Amount']
        print(f"  {date}: ${total}")

    # Underutilized instances
    print("\nUnderutilized Instances:")
    underutilized = self.find_underutilized_instances()
    for instance in underutilized:
        print(f"  {instance['InstanceId']}: {instance['AverageCPU']:.1f}% CPU - {instance['Recommendation']}")

    # Reserved instance savings
    print("\nReserved Instance Savings Potential:")
    savings = self.estimate_reserved_instance_savings()
    print(f"  Estimated Monthly Savings: ${savings:.2f}")

Usage

Usage

if name == 'main': optimizer = CostOptimizer() optimizer.generate_report()
undefined
if name == 'main': optimizer = CostOptimizer() optimizer.generate_report()
undefined

Cost Optimization Strategies

成本优化策略

✅ DO

✅ 建议做法

  • Use reserved instances for baseline
  • Leverage spot instances
  • Right-size resources
  • Monitor cost trends
  • Implement auto-scaling
  • Use multi-region pricing
  • Tag resources consistently
  • Schedule non-essential resources
  • 为基础负载使用预留实例
  • 利用竞价实例
  • 合理调整资源规格
  • 监控成本趋势
  • 实现自动扩缩容
  • 利用多区域定价
  • 统一标记资源
  • 为非核心资源设置调度启停

❌ DON'T

❌ 避免做法

  • Over-provision resources
  • Ignore unused resources
  • Neglect cost monitoring
  • Run all on-demand
  • Forget to release EIPs
  • Mix cost centers
  • Ignore savings opportunities
  • Deploy without budgets
  • 过度配置资源
  • 忽略未使用资源
  • 忽视成本监控
  • 全部使用按需实例
  • 忘记释放弹性IP
  • 混淆成本中心
  • 忽视节省成本的机会
  • 无预算部署

Cost Saving Opportunities

成本节省机会

  • Reserved Instances: 40-70% savings
  • Spot Instances: 70-90% savings
  • Committed Use Discounts: 25-55% savings
  • Right-sizing: 10-30% savings
  • Resource cleanup: 5-20% savings
  • 预留实例:节省40-70%
  • 竞价实例:节省70-90%
  • 承诺使用折扣:节省25-55%
  • 资源规格合理调整:节省10-30%
  • 资源清理:节省5-20%

Resources

参考资源