Skip to main content

xTribe Fusion Hub - Incident Management System Setup Guide (Public Version)

Introduction

This guide provides an overview of xTribe Fusion Hub, an automated incident management system that integrates Datadog, ServiceNow, and PagerDuty. The system uses Argo Workflows for orchestration, NATS for event messaging, and Vector for event processing. Containerized Python scripts handle the creation of incidents and the synchronization of data across all platforms.

Overview of the System Architecture

Below is a high-level architecture diagram of the xTribe Fusion Hub:

xTribe Fusion Hub Architecture

Figure 1: xTribe Fusion Hub - Incident Management System Architecture

Architecture Explanation

  1. Datadog Integration:

    • Datadog monitors metrics and Service Level Objectives (SLOs), and generates alerts when anomalies are detected.
    • The system subscribes to Datadog events via an API, receiving alert data, which is then processed by containerized Python scripts to escalate them into cases or incidents.

    Datadog Monitor Alert

    Figure 2: Datadog Monitor Alert - xTribe FusionHub incident demo

    Datadog Incident

    Figure 3: Datadog Incident Overview - xTribe FusionHub incident demo

  2. ServiceNow Integration:

    • Containerized Python scripts handle the creation or update of a corresponding incident in ServiceNow once a case or incident is created in Datadog.
    • ServiceNow retrieves the incident details through its API and synchronizes the incident data accordingly.

    ServiceNow Incident

    Figure 4: ServiceNow Incident Overview - xTribe FusionHub incident demo

  3. PagerDuty Integration:

    • The same containerized Python scripts are responsible for creating or updating an incident in PagerDuty.
    • The PagerDuty integration ensures that the appropriate on-call team is notified for critical incidents (P1/P2), triggering immediate actions if needed.

    PagerDuty Incident

    Figure 5: PagerDuty Incident Overview - xTribe FusionHub incident demo

  4. NATS & Vector:

    • Vector captures and processes events, then filters and enriches them for downstream services.
    • NATS serves as the messaging backbone, ensuring efficient and real-time communication between all components, especially when handling large event volumes.
  5. Argo Workflows with Containerized Python Scripts:

    • Argo Workflows orchestrates the entire process, triggering the execution of the containerized Python scripts, which handle all tasks related to incident creation, data synchronization, and API communication between Datadog, ServiceNow, and PagerDuty.

Key Interaction Scenarios

1. Incident Detected by Datadog (P1/P2)

  1. Alert Detection in Datadog: Datadog detects an anomaly, such as a critical service outage (P1 or P2 level).

  2. Python Script Execution for Incident Creation: Argo Workflows triggers the execution of containerized Python scripts, which handle the creation of a Datadog case and escalate it into an incident in Datadog.

  3. Propagation to ServiceNow and PagerDuty:

    • The same containerized Python scripts interact with the ServiceNow and PagerDuty APIs to create corresponding incidents in each platform.
  4. Addition of Incident URLs Across Platforms:

    • Once the incidents are created in all three platforms, the ServiceNow incident will be updated to include the URLs for the corresponding Datadog and PagerDuty incidents.
    • Similarly, the PagerDuty and Datadog incidents will include links to the corresponding incidents in the other two platforms, ensuring full visibility across systems.

2. Incident Created by a Client in ServiceNow (P1/P2)

  1. Client Incident Creation in ServiceNow: A client detects a critical issue and directly logs a P1 or P2 incident in ServiceNow, which generates a unique incident URL.

  2. Python Script Execution for Synchronization: Argo Workflows triggers the containerized Python scripts, which handle the creation of corresponding incidents in PagerDuty and Datadog.

  3. Addition of Incident URLs Across Platforms:

    • After the incidents are created in PagerDuty and Datadog, the URLs of these incidents are added to the ServiceNow incident, and vice versa. The PagerDuty incident will contain links to both the ServiceNow and Datadog incidents, and the Datadog incident will have the same.

Incident Detection and Faster Response

Incidents can be detected either through Datadog monitoring, which identifies anomalies and triggers alerts, or through ServiceNow, where incidents are created by clients. Once detected, PagerDuty notifies the CTO team, ensuring that the responsible person is alerted immediately. This allows for quick interaction and resolution, minimizing downtime and ensuring a rapid response to critical incidents.


Benefits of Incident URLs

  • Quick Cross-Platform Navigation: Teams can easily switch between platforms by following the incident URLs provided in the incident records of Datadog, ServiceNow, and PagerDuty.
  • Unified Incident Tracking: This functionality ensures that incidents are not just synchronized but also linked after creation, providing a comprehensive view of the incident from different perspectives (monitoring, service management, and on-call notifications).

Conclusion

The successful implementation of this automated incident management system represents a significant stride towards operational efficiency and robust incident handling. By seamlessly integrating Datadog, ServiceNow, and PagerDuty with the orchestration capabilities of Argo Workflows, the system ensures that critical alerts are promptly addressed, and incident response procedures are streamlined across platforms.

This guide serves as a foundational blueprint for deploying a complex, responsive incident management system. As you adapt and evolve your setup, continuous improvement and iteration will be key to maintaining an effective incident response mechanism.

For the complete code, including the Python scripts used for data transformation and workflow operations, please refer to the following GitLab repositories:

Update an existing version

It is possible to edit versioned docs in their respective folder:

  • versioned_docs/version-1.0/hello.md updates http://localhost:3000/docs/hello
  • docs/hello.md updates http://localhost:3000/docs/next/hello