IT Operations Retrospectives

I was reading Quality Matters: The Benefits of QA-Focused Retros which got me thinking about this style of activity in the context of IT operations.

It feels like I’ve been seeing lack of ‘retrospective’ activities in this context in organisations I have worked at (both as consultant and employee) for years a very long time indeed.

Probably before I became aware of the term ‘retrospectives’ I would have characterised this activity as a ticket review, a look back through service tickets, incident reports etc. to look for patterns of failures or recurring failures of the same type, or different but similar, i.e. issues with a shared or overlapping root-cause, the idea being, lets learn from what went wrong. Not exactly rocket-science you might say.

Without diving too much into the ‘why are retrospectives MIA’, it’s relatively easy to see how small teams, who are frequently firefighting fail to see the value in looking back. It can be very time-consuming if you’ve fallen into the firefight cycle with its often accompanying lack of detailed reporting around what you’ve fixed and why it broke.

A different take on the Agile retrospective

It’s widely held that the idea of ‘retrospectives’ comes from the Agile world, in Operations whilst we can, and do work closely alongside Agile development, we need to consider how an Ops-focused retrospective may have a different set of outcomes.

It can be hard to see how we can adopt an agile sprint retrospective for operations use when we think about ‘stop-start-continue’ nature of the Agile sprint, i.e. where each team member involved considers what the team should stop, start or continue doing, because our needs in operations are somewhat different.

There is a pretty straightforward answer to that point, which is to do something different. The outcome of an operations focussed retrospective should rather be aimed at steering how we operate our systems in the future, how we plan and execute preventative work, and how we create feedback from our experiences running platforms.

So What is the benefit?

Our operations retrospectives have considerable value in being used to improve the feedback cycle to development and other stakeholders. Because Operations tend to be at the sharp end of system failures (both customer visible and responsible for the break-fix activity), operations teams are in an ideal position to do some analysis and thinking about the nature of problem which are being fixed.

Examples what could be incorporated into feedback are things such as lower human interaction with deployments, use of version checks, etc.

Ops teams should push for prioritisation of testing and have good practices around log management.

Of course, a lot of this is predicated on operations teams having enabling information and subsystems like version control (git, svn, mercurial et al), centralised (aggregated) logging etc in place.

But, even if we only simple have a ticket management system, a commonplace tool for operations, we should still be looking at our records and checking for patterns of failures, making the most of what we have will enable us to make better arguments for enhancing tooling and techniques which power retrospectives.







Work In Progress Limits

WIP Limits AKA [the] Context Switch overhead [problem]

The idea of suggesting that we do less is often a tricky one to get our head around, from a business perspective when we are under pressure to get things done, saying “no, this item has to go in the backlog” is not always easy, this is especially prevalent where we are firefighting.

So, why do I say “AKA context switch overhead”?

Digging in. Before I knew about ‘WIP limits’ as a thing, I used to explain this idea that increasing the number of things we were doing at once would lead to loss of effectiveness at delivery. My audience was largely Unix systems admin and network engineers so it was a bit technical, it went something along these lines..

“Think of a time slice scheduler in operating system terms, in our OS it has one job; to divide fixed CPU resources amongst the processes demanding CPU time on your machine. Every time the scheduler switches the CPU to work on a different process there is a fixed small amount of time required to change ‘context’ to the new process (think registers, memory, paging etc)..”

Often the penny would drop at this point in the conversation, people would begin to see that the overall throughput would be lower because the ‘end to end’ or ‘cycle’ processing time for any given process would be longer due to more time spent context switching and less on execution.

The effect of the number of processes on the cycle time

We can show the time scheduling of our imaginary computer mathematically (I hope) using the information below to produce a number which represents this ‘execution time window for processing’ or ‘cycle time’:

Execution time window : e
Available time: T
Number of processes: n
Delay for context switch: d

Using the formula ‘e = (T / n) – d‘ we can see the amount of time a process gets on the cpu for execution.

Lets plug some numbers into this equation to provide a simple example.

e = (T / n) – d

e = (10 / 10) – 0.1 = 0.9 seconds
e = (10 / 40) – 0.1 = 0.15 seconds
e = (10 / 80) – 0.1 = 0.025 seconds
e = (10 / 100) – 0.1 = 0 seconds !

As we can see our window of execution get smaller and smaller until it becomes zero, the way around this in our OS would be to introduce a priority scheme, which ranks processes in importance of execution (think of the Unix ‘nice’ command) such that the core system will always keep functioning, the side effect is that we will end up with some processes which *may* never get any CPU time. This happens to be a method we can readily integrate into our work planning, i.e. task and project prioritisation.

The effect of number of processes on throughput

Another part of the exercise is to look at the effect of increasing the number n , of processes we are running on the total available ‘processing time’, i.e. the time in which processes are executing on the CPU.

Using formula ‘processing time = T – (d * n)‘ we can see the impact on total available CPU time for execution

10 – ( 0.1 * 10 ) = 9 seconds
10 – ( 0.1 * 40 ) = 6 seconds
10 – ( 0.1 * 80 ) = 2 seconds
10 – ( 0.1 * 100 ) = 0 seconds !

As we see from the above, at 100 processes we are effectively bogged down in context switching only, there is no time for processes on the CPU to execute, meaning effectively zero throughput.

Again, we avoid this in both OS terms and work planning by introducing prioritisation.

Back to WIP limits

To illuminate our understanding of why limiting WIP is necessary we need to think in terms of flow and time to complete a task or activity. It is reasonable to assert that as humans switching our focus from one thing to another comes with a time penalty, a period in which we start thinking about the new or next topic before we are actually doing much execution, this could be expressed as ‘it takes a short while to get going on a different piece of work’.

Recognising that facet of our own mental capabilities is key to grasping why we should limit work in progress.

To illustrate WIP limits affecting flow and throughput there is a great video by David Lowe (@bigpinots) here.

Atlassian also talk about ‘flow’ here, having enough understanding to be able to talk about why we want to limit concurrent work in progress and explain why it’s necessary can be crucial to convincing others.

Engineering Human Comms


This short post came about as I was musing on common themes I am seeing through my own experiences across a variety of companies. I am attempting to crystalise some of the ways in which I think about what communications means, and how we begin to think about changing it.

It may come as something of a surprise to learn that a lot of people in businesses complain about poor communication.

Even more so to me, that it is often the very people with exposure to, control of, and repsonsbility for the mechanisms providing access to communications i.e. people in technology. I’m thinking about telephones, email, wiki’s, twitter and social networks.

I supposed that this may lead one to think people in tech were not very good communicators, now when have I heard that before..?

It is intersting to grapple with this problem, I like to characterise it in the following way:

  • What – what is the message, fact, opinion, i.e. content?
  • Who – think about who we want to notice what we’re saying, and if they are the right people?
  • How – does your intended audience have a mechanic for hearing what you’ve got to say, i.e. what mechanism provides them a convenient way to consume your communications?

If we do a paper exercise and jot down some of the answers from the team around the above three points, we often find some interesting discrepancies, ranging from the technical detail to the actual interest of the audience.

Atlassian recently published a piece on using blogging via confluence to beneficial effect, their article is well worth a read especially if you are already a confluence user. The article is firstly focussed on the benefit of internal blogging, but I would argue that making time in the schedule for public facing content is highly valuable, it might even help you to attract better engineers, as it demonstrates publicly a capability of communication with the tech group.

There is nothing new or revolutionary about this approach, we can still see the Sun Microsystems site, arguably pioneering the sharing of engineering team information publicly, with their now defunct site (the earliest Internet archive snaphot all the way back from 1996 is here).

Other examples are the who have been pushing out systems and software engineering articles at since late 2012, and of course there is the requisite Netflix reference to cite as well a variety of their articles can be seen at their site.


Talking about Operational Features

I attended the Agile Cambridge 2015 conference toward the tail end of last year, and spent some time at a few excellent talks. I also presented on the topic of ‘Demystifying Operational Features’, this talk was really aimed at scrum masters and product owners, but hopefully all of the audience found it useful.

There is a video of the presentation on Vimeo linked below, provided by the conference organisers Software Acumen, thanks guys.

Demystifying operational features for scrum masters and product owners (Matthew Skelton & Rob Thatcher) from Software Acumen on Vimeo.

Technical Debt in Teams and Infrastructure

Having recently read a little on Technical Debt ( ), I began to think about the subject, and came up with a couple of points I’m interested in:

  1. Does technical debt affect Operability?
  2. Should the metaphor should be extended to infrastructure and teams?

I’ve tried to keep to some brief thoughts below, I’d be very interested in hearing some other view points in the comments.

Continue reading “Technical Debt in Teams and Infrastructure”