Whilst attending Pipeline Conference I noted down things which I found to give food for thought. To me this is the acid test of a talk – does it create a spark of thinking? Automation and Autonomy Dan North (@tastapod) gave the opening keynote “Ops and Operability“. Dan talked about both Ops and Dev in…
Google’s Ease-of-Use Email Encryption Project E2Email Goes Open Source
It looks like Google have been attempting to tackle a longstanding bug-bear of many people, the un-intuitive nature of personal email encryption.
I conceptually at least, really like PGP and have made a few attempts to get my social circle (some of whom might have the chops to hack it, so to speak) to adopt, but aside from spurious commentary about the why, it’s usually the lack of simple ‘just works‘ tools which have let me down.
There has been a bit of noise, for example:
about the E2Email project which promises a simple Chrome application, an easy to use (hopefully) Gmail client that exchanges OpenPGP mail.
I know it’s not pure GnuPG or PGP, but all the same this is quite an interesting development from Google and seems like maybe finally some weight in the ring in attempting to improve usability of email encryption.
I haven’t set up a build of this repo yet, so I’ll be interested to see what happens, and how useable it is in current form.
If you’re interested in encryption facilities inside Gmail, this could be worth following.
I was reading Quality Matters: The Benefits of QA-Focused Retros which got me thinking about this style of activity in the context of IT operations.
It feels like I’ve been seeing lack of ‘retrospective’ activities in this context in organisations I have worked at (both as consultant and employee) for years a very long time indeed.
Probably before I became aware of the term ‘retrospectives’ I would have characterised this activity as a ticket review, a look back through service tickets, incident reports etc. to look for patterns of failures or recurring failures of the same type, or different but similar, i.e. issues with a shared or overlapping root-cause, the idea being, lets learn from what went wrong. Not exactly rocket-science you might say.
Without diving too much into the ‘why are retrospectives MIA’, it’s relatively easy to see how small teams, who are frequently firefighting fail to see the value in looking back. It can be very time-consuming if you’ve fallen into the firefight cycle with its often accompanying lack of detailed reporting around what you’ve fixed and why it broke.
A different take on the Agile retrospective
It’s widely held that the idea of ‘retrospectives’ comes from the Agile world, in Operations whilst we can, and do work closely alongside Agile development, we need to consider how an Ops-focused retrospective may have a different set of outcomes.
It can be hard to see how we can adopt an agile sprint retrospective for operations use when we think about ‘stop-start-continue’ nature of the Agile sprint, i.e. where each team member involved considers what the team should stop, start or continue doing, because our needs in operations are somewhat different.
There is a pretty straightforward answer to that point, which is to do something different. The outcome of an operations focussed retrospective should rather be aimed at steering how we operate our systems in the future, how we plan and execute preventative work, and how we create feedback from our experiences running platforms.
So What is the benefit?
Our operations retrospectives have considerable value in being used to improve the feedback cycle to development and other stakeholders. Because Operations tend to be at the sharp end of system failures (both customer visible and responsible for the break-fix activity), operations teams are in an ideal position to do some analysis and thinking about the nature of problem which are being fixed.
Examples what could be incorporated into feedback are things such as lower human interaction with deployments, use of version checks, etc.
Ops teams should push for prioritisation of testing and have good practices around log management.
Of course, a lot of this is predicated on operations teams having enabling information and subsystems like version control (git, svn, mercurial et al), centralised (aggregated) logging etc in place.
But, even if we only simple have a ticket management system, a commonplace tool for operations, we should still be looking at our records and checking for patterns of failures, making the most of what we have will enable us to make better arguments for enhancing tooling and techniques which power retrospectives.
Today, the day of the UK EU referendum is marked by rolling thunder and lightning over London both the evening before, and on the day. Perhaps those are portents of a change coming to the land. Today the population apparently decides if the UK remains inside the EU or leaves to stand alone.
I have cast my vote.
Whatever the outcome of the voting, in mine and the eyes of many others I am certain the UK will be viewed as a different place.
The vitriol which has been whipped up by the media has exposed a simmering undercurrent of dissatisfaction and fear amongst many people, and undermined the moderating effect of rational thought.
I have seen family, acquaintances and strangers alike promoting and stating views which I had formerly believed were consigned to history, which saddens me greatly. Especially where those views are in direct contradiction to values of diversity, tolerance, debate and reason under which I was raised and educated.
For some of those people I have an understanding of their frustration and thinking but what I cannot understand at all is the sheer force of belief that leaving the EU will change things in the way think they want.
The clock will not be wound back to 1960 or some ‘Great Britain’ golden age.
Leaving the EU, our closest continental neighbours, and root-stock of many ‘British’ peoples will, I think, result in the UK becoming a culturally poorer, financially poorer and more polarised society. But hey, stopping people being united is an old and easy trick for conquering them.
If it’s done anything at all, this referendum has exposed just how illiberal, unsophisticated, and unwilling to inform themselves people can be, and how politicians will strive in any way they see fit for a grasp of power.
The greatest irony of all.. a central pillar of the argument has been around immigration; in a country, built from hundreds of years worth of waves of immigration, from the Nordic, to the Saxon, to the Commonwealth, the UK has long been a melting pot of immigrants, and always will be, maybe that’s part of what made it Great?
WIP Limits AKA [the] Context Switch overhead [problem]
The idea of suggesting that we do less is often a tricky one to get our head around, from a business perspective when we are under pressure to get things done, saying “no, this item has to go in the backlog” is not always easy, this is especially prevalent where we are firefighting.
So, why do I say “AKA context switch overhead”?
Digging in. Before I knew about ‘WIP limits’ as a thing, I used to explain this idea that increasing the number of things we were doing at once would lead to loss of effectiveness at delivery. My audience was largely Unix systems admin and network engineers so it was a bit technical, it went something along these lines..
“Think of a time slice scheduler in operating system terms, in our OS it has one job; to divide fixed CPU resources amongst the processes demanding CPU time on your machine. Every time the scheduler switches the CPU to work on a different process there is a fixed small amount of time required to change ‘context’ to the new process (think registers, memory, paging etc)..”
Often the penny would drop at this point in the conversation, people would begin to see that the overall throughput would be lower because the ‘end to end’ or ‘cycle’ processing time for any given process would be longer due to more time spent context switching and less on execution.
The effect of the number of processes on the cycle time
We can show the time scheduling of our imaginary computer mathematically (I hope) using the information below to produce a number which represents this ‘execution time window for processing’ or ‘cycle time’:
Execution time window : e
Available time: T
Number of processes: n
Delay for context switch: d
Using the formula ‘e = (T / n) – d‘ we can see the amount of time a process gets on the cpu for execution.
Lets plug some numbers into this equation to provide a simple example.
e = (T / n) – d
e = (10 / 10) – 0.1 = 0.9 seconds
e = (10 / 40) – 0.1 = 0.15 seconds
e = (10 / 80) – 0.1 = 0.025 seconds
e = (10 / 100) – 0.1 = 0 seconds !
As we can see our window of execution get smaller and smaller until it becomes zero, the way around this in our OS would be to introduce a priority scheme, which ranks processes in importance of execution (think of the Unix ‘nice’ command) such that the core system will always keep functioning, the side effect is that we will end up with some processes which *may* never get any CPU time. This happens to be a method we can readily integrate into our work planning, i.e. task and project prioritisation.
The effect of number of processes on throughput
Another part of the exercise is to look at the effect of increasing the number n , of processes we are running on the total available ‘processing time’, i.e. the time in which processes are executing on the CPU.
Using formula ‘processing time = T – (d * n)‘ we can see the impact on total available CPU time for execution
10 – ( 0.1 * 10 ) = 9 seconds
10 – ( 0.1 * 40 ) = 6 seconds
10 – ( 0.1 * 80 ) = 2 seconds
10 – ( 0.1 * 100 ) = 0 seconds !
As we see from the above, at 100 processes we are effectively bogged down in context switching only, there is no time for processes on the CPU to execute, meaning effectively zero throughput.
Again, we avoid this in both OS terms and work planning by introducing prioritisation.
Back to WIP limits
To illuminate our understanding of why limiting WIP is necessary we need to think in terms of flow and time to complete a task or activity. It is reasonable to assert that as humans switching our focus from one thing to another comes with a time penalty, a period in which we start thinking about the new or next topic before we are actually doing much execution, this could be expressed as ‘it takes a short while to get going on a different piece of work’.
Recognising that facet of our own mental capabilities is key to grasping why we should limit work in progress.
Atlassian also talk about ‘flow’ here, having enough understanding to be able to talk about why we want to limit concurrent work in progress and explain why it’s necessary can be crucial to convincing others.
This short post came about as I was musing on common themes I am seeing through my own experiences across a variety of companies. I am attempting to crystalise some of the ways in which I think about what communications means, and how we begin to think about changing it.
It may come as something of a surprise to learn that a lot of people in businesses complain about poor communication.
Even more so to me, that it is often the very people with exposure to, control of, and repsonsbility for the mechanisms providing access to communications i.e. people in technology. I’m thinking about telephones, email, wiki’s, twitter and social networks.
I supposed that this may lead one to think people in tech were not very good communicators, now when have I heard that before..?
It is intersting to grapple with this problem, I like to characterise it in the following way:
- What – what is the message, fact, opinion, i.e. content?
- Who – think about who we want to notice what we’re saying, and if they are the right people?
- How – does your intended audience have a mechanic for hearing what you’ve got to say, i.e. what mechanism provides them a convenient way to consume your communications?
If we do a paper exercise and jot down some of the answers from the team around the above three points, we often find some interesting discrepancies, ranging from the technical detail to the actual interest of the audience.
Atlassian recently published a piece on using blogging via confluence to beneficial effect, their article is well worth a read especially if you are already a confluence user. The article is firstly focussed on the benefit of internal blogging, but I would argue that making time in the schedule for public facing content is highly valuable, it might even help you to attract better engineers, as it demonstrates publicly a capability of communication with the tech group.
There is nothing new or revolutionary about this approach, we can still see the Sun Microsystems site, arguably pioneering the sharing of engineering team information publicly, with their now defunct playground.sun.com site (the earliest Internet archive snaphot all the way back from 1996 is here).
Other examples are the thetrainline.com who have been pushing out systems and software engineering articles at engineering.trainline.com since late 2012, and of course there is the requisite Netflix reference to cite as well a variety of their articles can be seen at their techblog.netflix.com site.
Alert and escalation, it's almost a case of plan,do,check,act..
Monitoring and alerting form part of the critical operational services for any modern technology savvy business. Whether these services are derived from internal tools, or from offerings such as Logentries and Upguard, the common factor is that an IT operations group needs to put some thought into how to get the best out of them.
Ultimately, monitoring and alerting are only as good as you make them and every organisation will have some specific tuning needed in order to suit the software and systems which are being monitored, as well as for the team or individuals who will receive alerts and log messages.
If you need to get the ball rolling with your colleagues, here are some easy entry points to the discussion:
- When (i.e. what hours)?
- How many repetitions?
- How can we silence alarms?
- When does my C-suite get woken up?
- Using alerts to carry ‘informational’ stuff decreases the impact of warning classes.
- Overly broad audiences i.e. don’t alert EVERYBODY at once unless you really mean it.
- Unstoppable alerting – obtain, derive, cajole, make sure you can stop an alert when dealing with issues to prevent unnecessary escalation and ensure this functionality is not mis-used.
- Tier your alerts, i.e “first responder”, “escalation level 1”, “escalation level 2”, “wake up the CTO”; or whatever roles and responsibilities fit for your organisation.
In short, We can’t expect to just use ‘defaults’ we need to plan how alerting will work.
Simplistic example, spot the holes!
Below I’ve created a fictitious ‘escalation map‘ that shows how alerts might be handled in a company:
Level 1: Front-line support
Incident detected at time T, clock starts, email and dashboard alarm generated; 15 minutes to acknowledge and resolve or escalation alarm to level 2 ; if not, we move to escalated incident state ‘escalated – level 2’
Level 2: Operations engineers / Application engineers + Front-line Team Manager
Clock is ticking until T + 60 minutes, when an escalation alarm is sent to level 3 if the issue not acknowledged and resolved ; if escalated, the incident state is now ‘escalated – level 3’
Level 3: Senior engineers / Product Manager(s)
Clock continues, a further 30 minutes to resolve, acknowledge and suppress etc ; if escalated, we now have incident state modified to ‘escalated – level 4’.
Level 4: Developers + Dev Manager
Get the developers on the line, use text messages, email, telephone, semaphore, whatever works, our hair is on fire and we’re drawing straws to see who calls the boss.
Level 5: CTO / IT Director
We don’t ever really want to be here.
Even scribbling out a short set of escalations like this can help shape the thinking around how we deal with incidents, and also give some focus on classifying them.
I see many people who think the first thing to do is shove all their systems into a shiny new monitoring and alerting mechanism without thinking about how it will be used.
It’s extremely useful to ‘have a plan’ for when things do go wrong, not if, because they will go wrong. When we know how long we’ve got to fix a problem before the CTO is called up, or before we know that n% of customers will see it’s manifestation can sharply focus the mind.