インシデント自動検出・エスカレーション・ランブック実行・MTTR削減
ドキュメントメタデータ
概要と課題
アーキテクチャと設計原則
コアコンポーネント
主要ユースケース
設定・操作の具体例
類似サービス比較表
ベストプラクティス
トラブルシューティング表
2025-2026 最新動向
学習リソース・参考文献
実装例・チェックリスト
まとめ

AWS Systems Manager Incident Manager 完全ガイド v2.0

インシデント自動検出・エスカレーション・ランブック実行・MTTR削減

ドキュメントメタデータ

最終更新: 2026-04-26
バージョン: v2.0
対象者: SRE・DevOps エンジニア・On-Call マネージャー・インシデント対応リーダー
難易度: 中級～上級
⚠️ 注: 2025年以降、新規顧客への提供終了予定（既存顧客は継続サポート）
関連サービス: CloudWatch、EventBridge、Systems Manager Automation、Slack・Microsoft Teams、PagerDuty

概要と課題

本質

AWS Systems Manager Incident Manager は「CloudWatch アラーム・EventBridge イベントから自動的にインシデントを作成し、オンコールローテーション・エスカレーションプラン・ランブック自動実行でMTTR（平均復旧時間）を短縮するサービス」である。インシデント検出から対応完了まで、通知・エスカレーション・ランブック実行・チャットチャネル統合・ポストインシデント分析を一元化。AWS ネイティブなインシデント管理により、外部ツール（PagerDuty・OpsGenie）への依存を軽減できる。

このサービスを選ぶ理由

なぜ AWS Incident Manager なのか？

AWS ネイティブな自動インシデント起票
- CloudWatch アラーム / EventBridge から直接インシデント作成
- AWS リソース情報（アカウント・リージョン・ARN）が自動付与
- PagerDuty / OpsGenie のような外部 API 呼び出しの遅延がない
オンコール管理の内製化
- PagerDuty 相当のオンコール・エスカレーション・連絡先管理を AWS 内に実装
- 外部ツールライセンス削減
- SSM Contacts で統一的に管理
Runbook の自動実行
- インシデント発生時に Systems Manager Automation ドキュメントを自動実行
- 「EC2 再起動 → ログ取得 → SNS 通知 → Lambda トリガー」を自動化
- 人手不足・初動対応の迅速化
ポストインシデント分析の自動化
- インシデント timeline・実施対応・影響範囲を自動記録
- ポストモーテムテンプレートで再発防止策を標準化
Slack・Microsoft Teams 統合
- チャットベースでインシデント対応・チームコラボレーション
- コマンドラインから Incident Manager 操作（Slack / Teams）
マルチリージョン対応
- Replication Set で複数リージョン間のレプリケーション
- DR シナリオでの可用性確保

このサービスを選ばない理由

新規顧客向け提供終了予定 → PagerDuty / Opsgenie / incident.io を検討
複雑なエスカレーション・ロジック → PagerDuty（より柔軟）
他社クラウド環境 → PagerDuty / Opsgenie / incident.io（マルチクラウド対応）
高度なレポーティング・分析 → PagerDuty（より豊富）

アーキテクチャと設計原則

全体構成図（Mermaid 1）

graph TB
    subgraph DetectionLayer["検出層"]
        CloudWatch["CloudWatch Alarms<br/>（CPU高・エラー率上昇）"]
        EventBridge["EventBridge Rules<br/>（GuardDuty・Config変更）"]
        Manual["Manual Creation<br/>（手動起票）"]
    end
    
    subgraph IncidentLayer["インシデント層"]
        Create["Create Incident<br/>・タイトル・影響度・タグ"]
        Template["Response Plan Template<br/>・Escalation・Runbook"]
    end
    
    subgraph ResponseLayer["対応層"]
        Engagement["Engagement<br/>・On-Call Contact<br/>・Escalation Plan"]
        Automation["Automation Runbook<br/>・SSM Doc実行<br/>・Lambda/API Call"]
        Chat["Chat Integration<br/>・Slack/Teams<br/>・Collaboration"]
    end
    
    subgraph ResolutionLayer["解決層"]
        Timeline["Timeline<br/>・Event記録<br/>・Action追跡"]
        Analysis["Post-Incident Analysis<br/>・Root Cause<br/>・Follow-up Items"]
    end
    
    CloudWatch --> Create
    EventBridge --> Create
    Manual --> Create
    
    Create --> Template
    Template --> Engagement
    Template --> Automation
    Template --> Chat
    
    Engagement --> Timeline
    Automation --> Timeline
    Chat --> Timeline
    
    Timeline --> Analysis

インシデント対応フロー（Mermaid 2）

sequenceDiagram
    participant Alert as CloudWatch/EventBridge
    participant Incident as Incident Manager
    participant Contact as On-Call Contact
    participant Automation as SSM Automation
    participant Chat as Slack/Teams
    participant User as User
    
    Alert->>Incident: Alarm triggered
    Incident->>Incident: Create incident<br/>(タイトル・重要度)
    
    Incident->>Contact: Get On-Call（Escalation Plan）
    Contact-->>Incident: Contact info
    
    Incident->>Chat: Create chat channel
    Incident->>Automation: Execute runbook
    Chat->>Contact: Notify(SMS/Email)
    
    Automation->>Automation: Step 1: Diagnose
    Automation->>Automation: Step 2: Mitigate
    Automation-->>Chat: Results posted
    
    Contact->>Chat: Join channel
    Contact->>User: Collaborate
    
    User->>Automation: Manual action
    User->>Chat: Update status
    User->>Incident: Resolve incident
    
    Incident->>Incident: Post-Incident Analysis

コアコンポーネント

1. Contact（連絡先）

個人またはエスカレーショングループを表現

個人 Contact:
  - 名前・エイリアス
  - 連絡チャネル（SMS・Email・Voice）
  - オンコールスケジュール参加
  - エスカレーション対象

Escalation Contact:
  - 複数の Contacts を段階的に通知
  - Duration（通知保持時間）
  - Retry Interval（再試行間隔）

例:
  Contact: tanaka-taro
  ├─ Phone: +81-90-1234-5678
  ├─ Email: tanaka@example.com
  └─ Escalation Schedule: 
      ├─ On-Call: 09:00-18:00 (weekday)
      └─ Backup: 18:00-09:00

2. On-Call Schedule（オンコールスケジュール）

ローテーション単位でオンコール担当者を管理

例: Web Team On-Call
  Week 1: tanaka-taro
    └─ Mon-Sun 09:00-18:00 (weekday)
    └─ Mon-Sun 18:00-09:00 (backup)
  
  Week 2: suzuki-hanako
    └─ Mon-Sun 09:00-18:00
    └─ Mon-Sun 18:00-09:00 (backup)
  
  Week 3: nakamura-saburo
    └─ Mon-Sun 09:00-18:00
    └─ Mon-Sun 18:00-09:00

設定項目:
  - Rotation（ローテーション名）
  - Start Date（開始日）
  - Duration（ローテーション期間）
  - Contacts（割り当て担当者）

3. Escalation Plan（エスカレーションプラン）

インシデント時の段階的通知ルール

例: Critical Incident Escalation
  
  Stage 1: 即座通知（Duration: 5分）
    Targets: tanaka-taro (On-Call)
    Channels: SMS → Email → Voice
    Retry: 2分ごと
  
  Stage 2: エスカレーション（Duration: 10分）
    Targets: suzuki-hanako (Backup Lead)
    Channels: SMS → Voice
    Retry: 5分ごと
  
  Stage 3: 管理者通知（Duration: 60分）
    Targets: Engineering Manager
    Channels: Email

呼び出し方式:
  - Essential（必須）: 全員に通知
  - Non-Essential（補助）: 必要に応じて通知

4. Response Plan（レスポンスプラン）

インシデント対応のテンプレート

構成:
  1. Incident Template
     - Title・Description
     - Impact（重要度: 1-10）
     - NotificationTargets（SNS・Slack）
  
  2. Escalation Plan
     - 誰にいつ通知するか
  
  3. Runbook Actions
     - SSM Automation ドキュメント
     - 実行パラメータ
  
  4. Chat Channel
     - Slack・Teams チャネル自動作成

例:
  prod-api-incident-plan:
    ├─ Incident Template
    │   ├─ Title: "Production API Incident"
    │   ├─ Impact: 1 (Critical)
    │   └─ NotificationTargets: prod-alerts-sns
    ├─ Escalations: web-team-escalation
    ├─ Automation Runbooks:
    │   ├─ DiagnoseAPI (describe deployment)
    │   ├─ RestartService (EC2/ECS restart)
    │   └─ NotifyTeam (SNS)
    └─ Chat: Slack/#prod-incidents

5. Incident（インシデント）

実際のインシデント記録

属性:
  - ID・ARN
  - Title・Description
  - Impact（重要度）
  - Status（Open・Resolved）
  - Created Time・Resolved Time
  - Response Plan ARN
  - Engaged Contacts（通知済みContact）
  - Timeline（Event記録）

ライフサイクル:
  Create → Escalate → Mitigate → Resolve → Post-Incident Analysis

6. Timeline Event（タイムラインイベント）

インシデント対応の記録

自動生成イベント:
  - Incident created
  - Engagement started
  - Runbook execution started/completed
  - Contact engaged
  - Incident resolved

手動追加イベント:
  - Root cause identified: "DB connection pool exhausted"
  - Action taken: "Increased pool size from 100 to 200"
  - Rollback started
  - Deployment completed

7. Post-Incident Analysis（ポストインシデント分析）

インシデント後の改善活動

構成:
  1. Summary
     - Incident duration
     - Impact
     - Root cause
  
  2. Follow-up Items
     - Action items（改善アイテム）
     - Owner・Target Date
  
  3. Lessons Learned
     - What went well?
     - What could we improve?

出力:
  - PDF レポート
  - JSON for analysis tools

主要ユースケース

1. CloudWatch アラームからの自動インシデント作成

シナリオ: API エラー率が 5% を超える

設定:
  CloudWatch Alarm:
    - Metric: 5XXError (ALB)
    - Threshold: 100 エラー / 60 秒
    - Action: ssm-incidents://responseplan/prod-api

発動:
  1. エラー率 → 100 超過
  2. CloudWatch Alarm 発火
  3. Incident Manager が自動的にインシデント作成
  4. Response Plan に基づいて Escalation 実行
  5. Slack チャネル自動作成
  6. On-Call エンジニア SMS 通知

2. EventBridge からのセキュリティインシデント

シナリオ: GuardDuty が高リスク Finding を検出

設定:
  EventBridge Rule:
    - Source: aws.guardduty
    - Detail: severity >= 7
    - Target: ssm-incidents://responseplan/security-incident

発動:
  1. GuardDuty が Finding 検出
  2. EventBridge ルール マッチ
  3. セキュリティインシデント自動作成
  4. セキュリティチーム On-Call に SMS 通知
  5. Runbook で CloudTrail ログ取得・スナップショット作成

3. Manual Incident Creation（手動起票）

シナリオ: 外部サービス（SaaS）が障害・CloudWatch では検出不可

対応:
  1. Engineer が Incident Manager console で手動起票
  2. Title: "SaaS Service Dependency Failure"
  3. Impact: 2（High）
  4. Response Plan: external-service-incident
  5. Escalation Plan が自動実行
  6. SaaS インシデント対応チーム に通知

4. Runbook による自動復旧

シナリオ: EC2 インスタンス障害・自動再起動で復旧可能

Runbook 設定:
  1. Step 1: Get Instance ID（CloudWatch Metric から）
  2. Step 2: Check Application Status（Health Check）
  3. Step 3: Restart EC2（if health check fail）
  4. Step 4: Verify Recovery（Health check）
  5. Step 5: Notify in Slack

結果:
  - 90% の cases で Runbook で自動復旧
  - Manual intervention 不要（MTTR 削減）

5. Slack によるリアルタイムコラボレーション

シナリオ: Production API が Down・複数チームの対応が必要

Slack での対応:
  1. Incident Manager が #prod-api-incident チャネル自動作成
  2. On-Call Engineer・Backend Lead・DevOps・Database Team を追加
  3. チャネルに Incident 情報・Timeline が自動ポスト
  4. Engineers が Slack で相互コミュニケーション
  5. Runbook Status が リアルタイムにポスト
  6. /incident-manager status → インシデント状態確認
  7. /incident-manager resolve → インシデント解決

6. エスカレーション時の段階的通知

シナリオ: Primary On-Call が応答しない

タイムライン:
  T+00min: Incident 発生・Primary On-Call に SMS 通知
  T+05min: SMS 未読・Email 送信
  T+10min: Email 未読・Voice Call 開始
  T+15min: Primary 応答なし・Escalation Start
           Backup Lead に SMS
  T+20min: Backup Lead 応答・対応開始
  T+30min: 対応中だが複雑・Manager に通知
  T+60min: Manager 到着・リーダーシップ開始

7. マルチリージョンインシデント対応

シナリオ: us-east-1 の API が Down
         ap-northeast-1 でも連鎖障害発生

Replication Set 設定:
  Primary: ap-northeast-1
  Replica: us-east-1, eu-west-1

対応:
  1. ap-northeast-1 で インシデント作成
  2. Replication Set で us-east-1 / eu-west-1 にレプリケート
  3. 各リージョンの On-Call が自動通知
  4. リージョン別チャネル自動作成
  5. Global Incident Coordination

8. Database Connection Pool Exhaustion（例）

シナリオ: RDS Connection Pool が枯渇・アプリケーション Hang

検出:
  CloudWatch Metric: DB Connections > 95（Threshold）

自動対応（Runbook）:
  Step 1: Get RDS Instance
  Step 2: Check Active Connections
  Step 3: Kill idle connections（スケジュール型）
  Step 4: Increase max_connections（if PEM owner）
  Step 5: Notify Database Team in Slack
  Step 6: Monitor recovery

結果:
  - 15～30分の自動復旧
  - Manual intervention 不要が 80% of cases

9. デプロイメント障害の自動ロールバック

シナリオ: 新バージョン デプロイ直後に エラー率上昇

検出:
  CloudWatch Alarm: 5XXError > threshold（新バージョン）

Runbook:
  Step 1: Verify version（CodeDeploy）
  Step 2: Get previous stable version
  Step 3: Initiate rollback
  Step 4: Verify traffic shift（load balancer）
  Step 5: Post incident in Slack
  Step 6: Notify on-call developer

結果:
  - 5～10分で前バージョンに復帰
  - Service continuity 維持
  - Root cause analysis 後に再デプロイ

10. Infrastructure-as-Code 統合

シナリオ: CloudFormation stack 更新エラー

検出:
  EventBridge: CloudFormation Stack Events
    - Event: "CREATE_FAILED" / "UPDATE_FAILED"

自動対応:
  1. Incident 作成・Stack ARN 記録
  2. Infrastructure Team に SMS
  3. Runbook:
     - Get Stack logs
     - Initiate rollback
     - Notify DevOps in Slack
  4. Post-Incident で変更レビュープロセス改善

設定・操作の具体例

CLI 操作（AWS CLI）

1. Contact（連絡先）の作成

# Personal Contact を作成
aws ssm-contacts create-contact \
  --alias tanaka-taro \
  --display-name "Tanaka Taro" \
  --type PERSONAL

# Contact Channel（SMS）を追加
aws ssm-contacts create-contact-channel \
  --contact-id arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact/tanaka-taro \
  --name TanakaPhone \
  --type SMS \
  --delivery-address '{"SimpleAddress": "+81-90-1234-5678"}'

# Contact Channel（Email）を追加
aws ssm-contacts create-contact-channel \
  --contact-id arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact/tanaka-taro \
  --name TanakaEmail \
  --type EMAIL \
  --delivery-address '{"SimpleAddress": "tanaka@example.com"}'

2. Escalation Plan（エスカレーションプラン）の作成

aws ssm-contacts create-contact \
  --alias web-team-escalation \
  --display-name "Web Team Escalation" \
  --type ESCALATION \
  --plan '{
    "Stages": [
      {
        "DurationInMinutes": 5,
        "Targets": [{
          "ChannelTargetInfo": {
            "ContactChannelId": "arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact-channel/tanaka-taro/sms/xxx",
            "RetryIntervalInMinutes": 2
          }
        }]
      },
      {
        "DurationInMinutes": 10,
        "Targets": [{
          "ChannelTargetInfo": {
            "ContactChannelId": "arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact-channel/suzuki-hanako/sms/yyy",
            "RetryIntervalInMinutes": 5
          }
        }]
      },
      {
        "DurationInMinutes": 60,
        "Targets": [{
          "ChannelTargetInfo": {
            "ContactChannelId": "arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact-channel/manager/email/zzz",
            "RetryIntervalInMinutes": 10
          }
        }]
      }
    ]
  }'

3. Response Plan（レスポンスプラン）の作成

aws ssm-incidents create-response-plan \
  --name prod-api-incident-plan \
  --incident-template '{
    "Title": "Production API Incident",
    "Impact": 1,
    "Summary": "Critical production API incident requiring immediate response",
    "NotificationTargets": [{
      "SnsTopicArn": "arn:aws:sns:ap-northeast-1:123456789012:prod-incidents"
    }]
  }' \
  --engagements '[
    "arn:aws:ssm-contacts:ap-northeast-1:123456789012:contact/web-team-escalation"
  ]' \
  --actions '[{
    "SsmAutomation": {
      "RoleArn": "arn:aws:iam::123456789012:role/IncidentManagerAutomationRole",
      "DocumentName": "AWS-DiagnoseECS",
      "DocumentVersion": "$DEFAULT",
      "TargetAccount": "RESPONSE_PLAN_OWNER_ACCOUNT",
      "Parameters": {
        "ClusterArn": ["arn:aws:ecs:ap-northeast-1:123456789012:cluster/prod-cluster"]
      }
    }
  }]'

4. CloudWatch アラームに Incident Manager を接続

aws cloudwatch put-metric-alarm \
  --alarm-name web-api-5xx-error-alarm \
  --alarm-description "API 5XX Error Rate > 5%" \
  --metric-name 5XXError \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions "arn:aws:ssm-incidents::123456789012:responseplan/prod-api-incident-plan"

5. EventBridge ルールで GuardDuty インシデント作成

aws events put-rule \
  --name security-incident-rule \
  --event-pattern '{
    "source": ["aws.guardduty"],
    "detail-type": ["GuardDuty Finding"],
    "detail": {"severity": [{"numeric": [">=", 7]}]}
  }' \
  --state ENABLED

aws events put-targets \
  --rule security-incident-rule \
  --targets '[{
    "Id": "IncidentManagerTarget",
    "Arn": "arn:aws:ssm-incidents::123456789012:responseplan/security-incident-plan",
    "RoleArn": "arn:aws:iam::123456789012:role/EventBridgeIncidentRole"
  }]'

6. インシデントの手動作成

aws ssm-incidents start-incident \
  --response-plan-arn arn:aws:ssm-incidents::123456789012:responseplan/prod-api-incident-plan \
  --title "Production API Manual Incident" \
  --impact 1 \
  --client-token "manual-incident-$(date +%s)"

7. インシデント Timeline にイベント追加

aws ssm-incidents create-timeline-event \
  --incident-record-arn arn:aws:ssm-incidents::123456789012:incident-record/xxx \
  --event-time "2026-04-26T10:30:00Z" \
  --event-type "Custom Event" \
  --event-data '"Root cause identified: Database connection pool exhausted"'

8. インシデントの解決

aws ssm-incidents update-incident-record \
  --arn arn:aws:ssm-incidents::123456789012:incident-record/xxx \
  --status RESOLVED \
  --summary "Database connection pool increased. Service restored."

SDK 操作（Python）

1. インシデント作成・管理

import boto3
from datetime import datetime

def create_incident(response_plan_arn, title, impact):
    """
    インシデントを作成する
    """
    client = boto3.client('ssm-incidents', region_name='ap-northeast-1')
    
    response = client.start_incident(
        responsePlanArn=response_plan_arn,
        title=title,
        impact=impact,
        clientToken=f"incident-{int(datetime.now().timestamp())}"
    )
    
    incident_arn = response['incidentRecord']['arn']
    print(f"Incident created: {incident_arn}")
    
    return incident_arn

def add_timeline_event(incident_arn, event_type, event_data):
    """
    Timeline イベントを追加
    """
    client = boto3.client('ssm-incidents', region_name='ap-northeast-1')
    
    client.create_timeline_event(
        incidentRecordArn=incident_arn,
        eventTime=datetime.utcnow(),
        eventType=event_type,
        eventData=event_data
    )
    
    print(f"Timeline event added: {event_type}")

def resolve_incident(incident_arn, summary):
    """
    インシデントを解決
    """
    client = boto3.client('ssm-incidents', region_name='ap-northeast-1')
    
    client.update_incident_record(
        arn=incident_arn,
        status='RESOLVED',
        summary=summary
    )
    
    print(f"Incident resolved: {summary}")

# 使用例
if __name__ == "__main__":
    plan_arn = "arn:aws:ssm-incidents::123456789012:responseplan/prod-api-incident-plan"
    
    incident_arn = create_incident(
        response_plan_arn=plan_arn,
        title="Production API Error Rate High",
        impact=1
    )
    
    add_timeline_event(
        incident_arn=incident_arn,
        event_type="Custom Event",
        event_data="Root cause identified: Database timeout"
    )
    
    resolve_incident(
        incident_arn=incident_arn,
        summary="Database connection pool increased. Service restored."
    )

IaC 操作（CloudFormation）

CloudFormation テンプレート

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Incident Manager Setup'

Resources:
  # Contact
  OnCallEngineer:
    Type: AWS::SSMContacts::Contact
    Properties:
      Alias: on-call-engineer
      DisplayName: On-Call Engineer
      Type: PERSONAL

  # Contact Channel (SMS)
  SMSChannel:
    Type: AWS::SSMContacts::ContactChannel
    Properties:
      ContactId: !Ref OnCallEngineer
      ChannelName: SMSNotification
      ChannelType: SMS
      DeliveryAddress:
        SimpleAddress: '+81-90-1234-5678'

  # Escalation Plan
  EscalationPlan:
    Type: AWS::SSMContacts::Contact
    Properties:
      Alias: escalation-plan
      DisplayName: Incident Escalation
      Type: ESCALATION
      Plan:
        Stages:
          - DurationInMinutes: 5
            Targets:
              - ChannelTargetInfo:
                  ContactChannelId: !Ref SMSChannel
                  RetryIntervalInMinutes: 2

  # Response Plan
  ResponsePlan:
    Type: AWS::SSMIncidents::ResponsePlan
    Properties:
      Name: prod-api-response-plan
      IncidentTemplate:
        Title: Production API Incident
        Impact: 1
        Summary: Critical API incident
        NotificationTargets:
          - SnsTopicArn: !Ref IncidentNotificationTopic
      Engagements:
        - !GetAtt EscalationPlan.Arn

  IncidentNotificationTopic:
    Type: AWS::SNS::Topic
    Properties:
      DisplayName: Incident Notifications
      TopicName: incident-notifications

  # CloudWatch Alarm → Incident Manager
  APIErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: api-error-rate-alarm
      AlarmDescription: API error rate high
      MetricName: 5XXError
      Namespace: AWS/ApplicationELB
      Statistic: Sum
      Period: 60
      Threshold: 100
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2
      AlarmActions:
        - !Sub 'arn:aws:ssm-incidents::${AWS::AccountId}:responseplan/${ResponsePlan}'

Outputs:
  ResponsePlanArn:
    Value: !GetAtt ResponsePlan.Arn
    Description: Response Plan ARN

類似サービス比較表

観点	Incident Manager	PagerDuty	Opsgenie	incident.io	Splunk On-Call
AWS 統合	✅ ネイティブ	エージェント	エージェント	統合	統合
オンコール管理	✅	✅	✅	✅	✅
エスカレーション	✅	✅（より高度）	✅	✅	✅
Runbook 自動実行	✅（SSM）	✅	✅	✅	✅
チャット統合	Slack/Teams	Slack/Teams	Slack/Teams	Slack/Teams	Slack
マルチクラウド	AWS のみ	✅	✅	✅	✅
Post-Incident Analysis	✅	✅	✅	✅（詳細）	✅
AI/ML 異常検出	限定的	✅（Event Intelligence）	✅	✅	✅（AIOps）
価格	~$0.08/incident	中程度	中程度	中程度	低～中
サポート状況	⚠️ 新規終了予定	継続	⚠️ Atlassian shutdown	継続	継続
推奨用途	AWS のみ・内製	エンタープライズ	マルチクラウド	スタートアップ	Splunk 利用者

ベストプラクティス

✅ 推奨事項

Response Plan をリソースタイプごとに設計
- Compute（EC2・ECS）専用プラン
- Database（RDS・Aurora）専用プラン
- Network（ALB・Route 53）専用プラン
On-Call Schedule を 1～2 週間ローテーション
- Burnout 防止
- 複数人のスキル維持
Runbook を段階的に拡張
- 初期：自動診断のみ
- 段階 2：自動復旧（低リスク）
- 段階 3：複雑な自動対応
Post-Incident Analysis を必須化
- すべての Severity 1-2 インシデント
- Follow-up Items の確実な実装
Slack・Teams でのリアルタイムコラボレーション
- チャネル自動作成・統合
- コマンドラインからの操作
Replication Set でマルチリージョン対応
- 複数リージョン展開環境では必須

❌ アンチパターン

Response Plan なしでの手動対応
- エスカレーション漏れ
- 対応遅延
On-Call Schedule がない
- 特定者への負荷集中
- 長期的な Burnout
Runbook を作成後、更新なし
- アプリ変更時に古い対応手順が実行
- 逆効果（対応遅延）
Post-Incident を実施せず次へ
- 同じインシデントの再発
- 組織学習がない
複数人での並行対応調整なし
- 重複した作業
- 進捗不透明

トラブルシューティング表

現象	原因	対応
Contact に通知が届かない	SMS/Email アドレス誤り・IAM 権限不足	SMS/Email 確認・SNS 権限確認
Runbook が実行されない	IAM ロール権限不足・ドキュメント ARN 誤り	IAM ロール権限確認・SSM Document ARN 確認
Slack チャネルが作成されない	AWS Chatbot 設定不足・Slack Bot 権限	AWS Chatbot で Slack Workspace 認証
Escalation が実行されない	Contact Channel 設定誤り・Duration 短すぎる	Duration ≥ 5 分確認・Retry Interval 確認
CloudWatch アラーム連携が動作しない	Response Plan ARN 誤り・アラーム設定	アラーム作成時の ARN 確認
Post-Incident Analysis が表示されない	インシデント Status = RESOLVED 確認不足	Update incident record で status = RESOLVED 設定
マルチリージョンレプリケーション失敗	Replication Set 設定なし	create-replication-set で複数リージョン指定

2025-2026 最新動向

2025 年 Q1-Q2: 新規顧客向け提供終了

AWS が Incident Manager の新規顧客への提供を段階的に終了
既存顧客：継続サポート予定（2027 年まで）

2025 年 Q3: PagerDuty との統合強化

AWS は PagerDuty との統合を強化
既存 Incident Manager ユーザーの PagerDuty への移行パス提供予定

2026 年: AWS Network & Operations（新サービス）

AWS が新しいオペレーションズサービスをリリース予定
CloudWatch Application Signals + Q Developer Operations での統合運用

学習リソース・参考文献

AWS 公式ドキュメント（8+）

ベンダー・OSS リソース（5+）

実装例・チェックリスト

導入チェックリスト

[ ] IAM ロール・ポリシーで Systems Manager Incident Manager 権限を付与
[ ] Contact・Contact Channel をチーム構成に基づいて作成
[ ] On-Call Schedule を 1～2 週間ローテーションで設定
[ ] Escalation Plan で段階的通知ルール設定
[ ] Response Plan をリソースタイプごとに作成
[ ] CloudWatch Alarms を Response Plan に接続
[ ] EventBridge ルールで自動インシデント検出設定
[ ] SSM Automation Runbook を作成・Response Plan に統合
[ ] Slack / Teams ワークスペース認証・チャネル自動作成設定
[ ] Post-Incident Analysis テンプレート作成
[ ] Replication Set でマルチリージョン対応（必要に応じて）
[ ] チーム教育・オンボーディング実施

まとめ

AWS Systems Manager Incident Manager は「CloudWatch・EventBridge から自動インシデント検出・オンコール管理・Runbook 自動実行・ポストインシデント分析を統合するインシデント管理サービス」。AWS ネイティブな統合により、外部ツール（PagerDuty・OpsGenie）への依存を軽減でき、MTTR（平均復旧時間）を大幅に短縮できる。

**⚠️ ただし新規顧客向け提供終了予定（2025 年中）**のため、既存ユーザーは PagerDuty・incident.io への移行を検討推奨。

最終更新：2026-04-26 バージョン：v2.0

目次