Improving Today’s Dynamic IT with Machine Learning as a Formal & Agile Approach to ITSM
by Stephen Hart
Modern IT Service Management is working pretty well, all things considered. It enables highly distributed teams, parts of which may even work for entirely different companies, to come together to support business services that rely on extremely complex software stacks – and to do it pretty effectively, at least most of the time. All of this is enabled by formal structures and processes, documented and implemented in support systems.
The problem with these formalised approaches to ITSM, however, is the perception that they are too rigid, often consuming too much time and effort for too little benefit. Also, while processes may be well defined, too many steps still require significant manual effort. This is particularly the case with the incident management process. The workflow escalates from event aggregation for anomaly detection, to event analysis for root cause determination, and ultimately separate communication channels for remediation. Each of these steps requires use of different domain-specific tools, each led by a specialist team. Results of each step are gathered laboriously and shared back to the main workflow in order to advance towards resolution.
These complications are further compounded by the increasing complexity and rate of change of today’s IT infrastructure. The primary culprit is the march to virtualisation, first to the compute layer, moving now to the network and storage layers, as well as increased mobility and self-service cloud.
The results are not pretty:
- For a single incident as perceived by users, there will be a storm of alerts, generated by applications, infrastructure, and tools- all without any relative prioritization of severity, let alone cross-domain context.
- The sheer volume of alerts can overwhelm operators, forcing them to prioritise – but without having deep visibility into what is actually affecting services.
- Attempts to reduce the alert volume to manageable levels by placing hard thresholds or filters run the risk of hiding or delaying important information, delaying operators’ reactions to unfolding issues.
- The process for relating cluster of alerts relies on deep skills and existing knowledge of the complete environment. Reaching this generally means an escalation after a lengthy manual triage process.
- Parallel investigations by different teams, lengthy communications, and more frequent escalations conspire to increase the duration and impact of incidents – not to mention the opportunity cost of all the other tasks that staff are distracted from by the constant need to fight fires.
- Because of the messy and disconnected nature of the process, it can be difficult, if not impossible to gather full understanding after an incident has been resolved to prevent its recurrence.
A Structured, But Agile Process Solution
The answer is not, as some may suggest, to get rid of structured approaches to ITSM. Flat hierarchies, informal collaboration, and the removal of “artificial” domain barriers may work at small scale, or temporarily, but such an unstructured approach quickly shows its limitations.
The response is then to reintroduce some sort of structured albeit agile process. It has been said that, “those who do not understand ITIL are doomed to reinvent it – poorly” (with apologies to George Santayana). There are a lot of valuable lessons of the past codified in those formalised processes, and getting rid of them en masse simply means that the next generation will have the opportunity to make the same mistakes all over again.
A better approach is to augment the formal processes and make them easier for people to work with. In fact, why not let our machines automate parts that are routine and therefore irritating to humans?
A combination of real-time machine learning and social collaboration technologies is a great way to get ahead of the speed, change and scale that are the hallmarks of today’s digital transformation and dynamic IT. Instead of having skilled humans watching screens for the unexpected, let agile algorithms detect departures from the ordinary. Instead of the ping-pong escalations of email and vmail, give the right people the tools to work together faster and solve incidents earlier, as they start to unfold. This has the added benefit of not wasting the time of many unrelated people who get copied into the support thread, or dialled into the war room bridge call.
- Solve the problem of data overload by cleaning large scale and high rate of event data that is full of noise and partial warnings, with a CMDB underneath that is at best 80% accurate.
- Virtually eliminate triage by placing the resulting alerts in context – related alerts into situations, decorated with service impacts.
- Enable effective collaboration across technical and organisational domains, reducing incident impact and duration.
- Orchestrate push notifications of relevant experts, service restoration, ticketing, and knowledge recycling.
This is the way to achieve the best of both worlds: the standardisation and predictability of a formal (but agile!) approach to ITSM, delivered with a much-reduced need for time-consuming manual tasks and inefficient communication. Humans are still in the loop – this is not automation to replace staff, but to augment and extend their capabilities. Today’s unpredictable and evolving nature of IT means that full automation is not yet practical outside of fairly narrow applications, while the problem of ITSM by definition cannot be constrained in such a way.
Stephen Hart is the CTO of Moogsoft.