With the rise of cloud, distributed architectures, containers, and microservices, a rise in data overload is visible. With growing amounts of DevOps processes; alerts, repeated mundane jobs etc. have put new demands to both synthesize meaning from this influx of information and connect it to broader business objectives.
AIOps is the application of artificial intelligence for IT operations. AIOps uses machine learning and data science to give IT operations teams a real-time understanding of any issues affecting the availability or performance of the systems under their care. Rather than reacting to issues as they arise in the application environment, AIOps platforms allow IT operations teams to proactively manage performance challenges faster, and in real-time
This case study focuses on solving the following business needs:
1. With an ever-increasing rise in alerts, a large number of incidents were getting generated. There was a need to develop a framework that can generate correlations and identify correlated events, thereby reduce overall incidents volume.
2. For many incidents a reactive strategy does not work and can lead to a loss of reputation; there was a need to develop predictive capabilities that can detect anomalous events and predict critical events well in advance.
3. Given the pressures of reducing the Resolution time and short window of opportunity available to the analysts, there was a need to provide search capabilities so that the analysts can have a head start as to how similar incidents were solved in past.
Data from multiple systems sending alerts, including traditional IT monitoring, log events in text format, application and network performance data etc were made available for the PoC.
The solution framework developed had a discovery phase where the base data was visualized and explored, a NLP driven text mining layer where log data in text format was pre-processed, clustered and correlations were developed to identify related events using Machine Learning algorithms. Topic Mining was used to get a quick overview of a large number of event data. Next, a temporal mining layer explored the temporal relationship between nodes and cluster groups, necessary features were developed on top of the associations generated from temporal layers. Advanced Machine learning algorithms were then developed on these features to predict critical events almost 12 hours in advance. Last but not the least a search layer that computed the similarity of any incident with those in Service Now database was developed that provided analysts insights readily available information on similar incidents and how they were solved in past so that the analysts do not have to reinvent the wheel.