AI agents at work : Automating Complex IT Incident Management with CrewAI: A Deep Dive

4 min readSep 9, 2024

Managing incidents in IT infrastructure — like server crashes, network slowdowns, or application failures — requires quick detection, investigation, and resolution. While some tasks still require human intervention (e.g., server restarts), we can automate significant portions of the process, such as monitoring system logs and generating detailed incident reports.

In this blog, we will focus on realistic tasks that CrewAI agents can perform:

Monitoring logs for anomalies.
Gathering contextual information from external sources.
Generating reports based on the gathered information.

Let’s dive into building a CrewAI-based system that automates log monitoring and report generation.

What Can Be Automated with CrewAI

Automated Log Monitoring: Scan server logs for specific error patterns, and raise alerts when anomalies are detected.
Incident Documentation: Automatically compile details of the issue, such as log entries, system states, and relevant external information (e.g., known vulnerabilities).
Alerting: Notify the IT team once the incident is documented and ready for review.

Step 1: Installation and Setup

First, install the required packages:

pip install crewai
pip install 'crewai[tools]'

Step 2: Log Monitoring Agent

We’ll create an agent that monitors logs for issues like high CPU usage, network disconnections, or critical errors. This agent uses FileReadTool to parse system logs. Instead of resolving issues directly, the agent flags anomalies and sends them for further investigation.

Define Log Patterns:

Let’s start by defining what patterns the agent will look for in the logs. These patterns could include high-severity errors or warning messages:

log_patterns = ["ERROR", "CRITICAL", "FATAL", "Connection lost", "CPU overload"]

Monitoring Agent Code:

import os
from crewai import Agent, Task
from crewai_tools import FileReadTool

# Agent to monitor logs
monitoring_agent = Agent(
    role="Log Monitor",
    goal="Monitor system logs and detect critical errors.",
    tools=[FileReadTool()],
    verbose=True
)

# Task for log monitoring
monitoring_task = Task(
    description="Scan server logs for anomalies like CPU overload or connection issues.",
    expected_output="List of detected anomalies and their timestamps.",
    agent=monitoring_agent
)

# Define a function to scan the logs
def scan_logs(log_file_path):
    with open(log_file_path, 'r') as log_file:
        log_data = log_file.readlines()
    anomalies = [line for line in log_data if any(pattern in line for pattern in log_patterns)]
    return anomalies

Explanation:
The monitoring_agent reads the log file and searches for the defined patterns. If it finds critical errors, it flags them for further analysis.

Step 3: Gathering External Context with Web Scraping

When incidents occur, it’s helpful to check if they are part of a known issue or if similar problems are being reported elsewhere. For this, we use the FirecrawlCrawlWebsiteTool to fetch additional data from external sources like forums or vulnerability databases.

Gathering External Context:

from crewai_tools import FirecrawlCrawlWebsiteTool

# Agent to gather external context for the issue
context_agent = Agent(
    role="External Data Collector",
    goal="Fetch relevant information from external sources about the anomaly.",
    tools=[FirecrawlCrawlWebsiteTool()],
    verbose=True
)

# Task for gathering external context
context_task = Task(
    description="Search external sources like forums or vulnerability databases for relevant information.",
    expected_output="Relevant external data about the anomaly.",
    agent=context_agent
)

Explanation:
This agent searches web sources to gather contextual information about the incident. For instance, if an error suggests a security breach, the agent could search for related reports or vulnerability advisories.

Step 4: Generating an Incident Report

The next step is to automatically generate an incident report based on the findings from the logs and external data. This can be formatted in Markdown or another readable format and sent to the relevant IT staff.

Report Generation Code:

import markdown

# Agent to compile the incident report
reporting_agent = Agent(
    role="Incident Reporter",
    goal="Generate a detailed incident report based on log data and external sources.",
    verbose=True
)

# Task to generate the report
reporting_task = Task(
    description="Compile a report including log anomalies, external context, and resolution suggestions.",
    expected_output="Markdown formatted incident report.",
    agent=reporting_agent
)

# Function to create a markdown report
def create_report(anomalies, external_data):
    report_content = f"# Incident Report\n\n## Detected Anomalies\n"
    for anomaly in anomalies:
        report_content += f"- {anomaly}\n"
    
    report_content += "\n## External Information\n"
    for data in external_data:
        report_content += f"- {data}\n"
    
    return markdown.markdown(report_content)

# Example usage
log_anomalies = scan_logs("system_logs.txt")
external_data = ["Security Advisory X found", "Reported issue Y"]
report = create_report(log_anomalies, external_data)

# Save the report as markdown
with open("incident_report.md", "w") as report_file:
    report_file.write(report)

Explanation:
This function generates a detailed report that includes the detected log anomalies and any relevant information from external sources. It then saves this as a Markdown file for easy review by IT staff.

Step 5: Bringing It All Together

Now we will connect all the agents and tasks into a single workflow. The log monitoring agent will flag errors, the context agent will gather external data, and the reporting agent will compile a comprehensive incident report.

CrewAI Workflow:

from crewai import Crew

# Define the crew to run the tasks in sequence
incident_crew = Crew(
    agents=[monitoring_agent, context_agent, reporting_agent],
    tasks=[monitoring_task, context_task, reporting_task],
    verbose=True
)

# Execute the crew tasks
incident_crew.kickoff()

By automating log monitoring, external data gathering, and incident reporting using CrewAI, you can significantly reduce the time and effort needed to identify and document IT infrastructure issues. This workflow is easily extendable with custom tools and integrations to alert your team or even trigger automated responses in the future.

The goal is not to replace human intervention in critical fixes but to automate repetitive tasks, allowing your IT team to focus on high-priority issues.

For more such implementations:

Github

Kshitij Kutumbe - Medium

Read writing from Kshitij Kutumbe on Medium. Data Scientist: Machine Learning | Deep Learning | NLP | Generative AI…

medium.com

For discussion on AI implementations in any kind of workflows:

Email
kshitijkutumbe@gmail.com