Whilst attending Pipeline Conference I noted down things which I found to give food for thought. To me this is the acid test of a talk – does it create a spark of thinking? Automation and Autonomy Dan North (@tastapod) gave the opening keynote “Ops and Operability“. Dan talked about both Ops and Dev in…
Google’s Ease-of-Use Email Encryption Project E2Email Goes Open Source
It looks like Google have been attempting to tackle a longstanding bug-bear of many people, the un-intuitive nature of personal email encryption.
I conceptually at least, really like PGP and have made a few attempts to get my social circle (some of whom might have the chops to hack it, so to speak) to adopt, but aside from spurious commentary about the why, it’s usually the lack of simple ‘just works‘ tools which have let me down.
There has been a bit of noise, for example:
about the E2Email project which promises a simple Chrome application, an easy to use (hopefully) Gmail client that exchanges OpenPGP mail.
I know it’s not pure GnuPG or PGP, but all the same this is quite an interesting development from Google and seems like maybe finally some weight in the ring in attempting to improve usability of email encryption.
I haven’t set up a build of this repo yet, so I’ll be interested to see what happens, and how useable it is in current form.
If you’re interested in encryption facilities inside Gmail, this could be worth following.
I was reading Quality Matters: The Benefits of QA-Focused Retros which got me thinking about this style of activity in the context of IT operations.
It feels like I’ve been seeing lack of ‘retrospective’ activities in this context in organisations I have worked at (both as consultant and employee) for years a very long time indeed.
Probably before I became aware of the term ‘retrospectives’ I would have characterised this activity as a ticket review, a look back through service tickets, incident reports etc. to look for patterns of failures or recurring failures of the same type, or different but similar, i.e. issues with a shared or overlapping root-cause, the idea being, lets learn from what went wrong. Not exactly rocket-science you might say.
Without diving too much into the ‘why are retrospectives MIA’, it’s relatively easy to see how small teams, who are frequently firefighting fail to see the value in looking back. It can be very time-consuming if you’ve fallen into the firefight cycle with its often accompanying lack of detailed reporting around what you’ve fixed and why it broke.
A different take on the Agile retrospective
It’s widely held that the idea of ‘retrospectives’ comes from the Agile world, in Operations whilst we can, and do work closely alongside Agile development, we need to consider how an Ops-focused retrospective may have a different set of outcomes.
It can be hard to see how we can adopt an agile sprint retrospective for operations use when we think about ‘stop-start-continue’ nature of the Agile sprint, i.e. where each team member involved considers what the team should stop, start or continue doing, because our needs in operations are somewhat different.
There is a pretty straightforward answer to that point, which is to do something different. The outcome of an operations focussed retrospective should rather be aimed at steering how we operate our systems in the future, how we plan and execute preventative work, and how we create feedback from our experiences running platforms.
So What is the benefit?
Our operations retrospectives have considerable value in being used to improve the feedback cycle to development and other stakeholders. Because Operations tend to be at the sharp end of system failures (both customer visible and responsible for the break-fix activity), operations teams are in an ideal position to do some analysis and thinking about the nature of problem which are being fixed.
Examples what could be incorporated into feedback are things such as lower human interaction with deployments, use of version checks, etc.
Ops teams should push for prioritisation of testing and have good practices around log management.
Of course, a lot of this is predicated on operations teams having enabling information and subsystems like version control (git, svn, mercurial et al), centralised (aggregated) logging etc in place.
But, even if we only simple have a ticket management system, a commonplace tool for operations, we should still be looking at our records and checking for patterns of failures, making the most of what we have will enable us to make better arguments for enhancing tooling and techniques which power retrospectives.
This short post came about as I was musing on common themes I am seeing through my own experiences across a variety of companies. I am attempting to crystalise some of the ways in which I think about what communications means, and how we begin to think about changing it.
It may come as something of a surprise to learn that a lot of people in businesses complain about poor communication.
Even more so to me, that it is often the very people with exposure to, control of, and repsonsbility for the mechanisms providing access to communications i.e. people in technology. I’m thinking about telephones, email, wiki’s, twitter and social networks.
I supposed that this may lead one to think people in tech were not very good communicators, now when have I heard that before..?
It is intersting to grapple with this problem, I like to characterise it in the following way:
- What – what is the message, fact, opinion, i.e. content?
- Who – think about who we want to notice what we’re saying, and if they are the right people?
- How – does your intended audience have a mechanic for hearing what you’ve got to say, i.e. what mechanism provides them a convenient way to consume your communications?
If we do a paper exercise and jot down some of the answers from the team around the above three points, we often find some interesting discrepancies, ranging from the technical detail to the actual interest of the audience.
Atlassian recently published a piece on using blogging via confluence to beneficial effect, their article is well worth a read especially if you are already a confluence user. The article is firstly focussed on the benefit of internal blogging, but I would argue that making time in the schedule for public facing content is highly valuable, it might even help you to attract better engineers, as it demonstrates publicly a capability of communication with the tech group.
There is nothing new or revolutionary about this approach, we can still see the Sun Microsystems site, arguably pioneering the sharing of engineering team information publicly, with their now defunct playground.sun.com site (the earliest Internet archive snaphot all the way back from 1996 is here).
Other examples are the thetrainline.com who have been pushing out systems and software engineering articles at engineering.trainline.com since late 2012, and of course there is the requisite Netflix reference to cite as well a variety of their articles can be seen at their techblog.netflix.com site.
Alert and escalation, it's almost a case of plan,do,check,act..
Monitoring and alerting form part of the critical operational services for any modern technology savvy business. Whether these services are derived from internal tools, or from offerings such as Logentries and Upguard, the common factor is that an IT operations group needs to put some thought into how to get the best out of them.
Ultimately, monitoring and alerting are only as good as you make them and every organisation will have some specific tuning needed in order to suit the software and systems which are being monitored, as well as for the team or individuals who will receive alerts and log messages.
If you need to get the ball rolling with your colleagues, here are some easy entry points to the discussion:
- When (i.e. what hours)?
- How many repetitions?
- How can we silence alarms?
- When does my C-suite get woken up?
- Using alerts to carry ‘informational’ stuff decreases the impact of warning classes.
- Overly broad audiences i.e. don’t alert EVERYBODY at once unless you really mean it.
- Unstoppable alerting – obtain, derive, cajole, make sure you can stop an alert when dealing with issues to prevent unnecessary escalation and ensure this functionality is not mis-used.
- Tier your alerts, i.e “first responder”, “escalation level 1”, “escalation level 2”, “wake up the CTO”; or whatever roles and responsibilities fit for your organisation.
In short, We can’t expect to just use ‘defaults’ we need to plan how alerting will work.
Simplistic example, spot the holes!
Below I’ve created a fictitious ‘escalation map‘ that shows how alerts might be handled in a company:
Level 1: Front-line support
Incident detected at time T, clock starts, email and dashboard alarm generated; 15 minutes to acknowledge and resolve or escalation alarm to level 2 ; if not, we move to escalated incident state ‘escalated – level 2’
Level 2: Operations engineers / Application engineers + Front-line Team Manager
Clock is ticking until T + 60 minutes, when an escalation alarm is sent to level 3 if the issue not acknowledged and resolved ; if escalated, the incident state is now ‘escalated – level 3’
Level 3: Senior engineers / Product Manager(s)
Clock continues, a further 30 minutes to resolve, acknowledge and suppress etc ; if escalated, we now have incident state modified to ‘escalated – level 4’.
Level 4: Developers + Dev Manager
Get the developers on the line, use text messages, email, telephone, semaphore, whatever works, our hair is on fire and we’re drawing straws to see who calls the boss.
Level 5: CTO / IT Director
We don’t ever really want to be here.
Even scribbling out a short set of escalations like this can help shape the thinking around how we deal with incidents, and also give some focus on classifying them.
I see many people who think the first thing to do is shove all their systems into a shiny new monitoring and alerting mechanism without thinking about how it will be used.
It’s extremely useful to ‘have a plan’ for when things do go wrong, not if, because they will go wrong. When we know how long we’ve got to fix a problem before the CTO is called up, or before we know that n% of customers will see it’s manifestation can sharply focus the mind.
I attended the Agile Cambridge 2015 conference toward the tail end of last year, and spent some time at a few excellent talks. I also presented on the topic of ‘Demystifying Operational Features’, this talk was really aimed at scrum masters and product owners, but hopefully all of the audience found it useful.
There is a video of the presentation on Vimeo linked below, provided by the conference organisers Software Acumen, thanks guys.
With the very recent revelation of code in Juniper software allowing decryption of ‘secure’ VPN traffic, I am wondering if, and how long it will be before the number of security issues being reported creates acceptance through just sheer blindness, i.e everyone loses their ability to be outraged, offended or concerned. Will we end up with the equivalent of a shoulder-shrug and ‘that’s just how it is’ with regard to our security?
It feels likely that the more security outrages that are reported, the more people will become acclimatised to the idea that nothing is secure.
I guess it would be the similar to the mere exposure effect, what I am imagining is a sort of familiarity which leads us to gloss over todays privacy concerns, there is another way to express this, ‘security desensitisation‘.