A characteristic of virtually all great software engineering organizations that I’ve experienced is that they relentlessly focus on operational excellence. The software and hardware systems we create are remarkably complex. They were built by mere mortals, so many defects were unintentionally designed in. In addition, while software doesn’t age, hardware does, inducing even more failures. And even if software doesn’t age, over the passage of time our software systems will be introduced to unforeseen scenarios that weren’t anticipated. All this means that the complex systems we all run are prone to failure. The only logical approach to combating this is, and the one Spock would surely take, is to be constantly worrying that we are always on the verge of the next failure, and to embrace a culture of continuously and relentlessly searching out vulnerabilities and remediating them. We exercise this approach in the VA, constantly seeking to improve our operational excellence.

We do this through the following:

  • We conduct standups every morning where we triage all critical system failures in the past 24 hours, ensuring they are being adequately addressed, seeking for the underlying cause of the failure, and developing an approach to ensure it doesn’t happen moving forward. 
  • We define OKRs (Objectives and Key Results) for the organization overall and for each of the teams within it. These OKRs are the critical improvement results we seek to accomplish in the next several months. They span from the high level (e.g., we will have a published vision document for all portfolios) to the very specific (e.g., we will reduce the incidence of change-based failures from X% to Y%).
  • We strive to build systems that are highly resilient, and we track the key aspects of that resiliency. We establish explicit uptime goals linked to the system’s criticality. Higher criticality systems have higher uptime goals, and we work to ensure the architecture of these systems will support these goals.
  • We constantly make priority-based tradeoffs, ensuring that we are focusing our efforts on the most impactful work we could be doing. We constantly have discussions like “Yes, we could fix this, but is the juice worth the squeeze?”
  • We are equally focused on our resource allocation. We stack rank the work we are doing and the work we don’t have the resources to do, so that we’re always focused on the most important work of the organization to serve our Veteran stakeholders. When more resources become available, we fund the next project off the list.
  • We create and maintain scorecards and dashboards that report the status of our systems and the work we do. Our scorecards are designed not to tell the wonderful story of the work we do, but rather to identify the places where there is more work to be done.
  • We exercise Operational Excellence in our approach to Cybersecurity. Nowhere else is there a more complex landscape than the cybersecurity landscape of an organization like VA. We are guided by a “zero trust” north star, and we use this north star to define our priorities. We continuously improve our rigor in identifying and engineering out vulnerabilities and improving our monitoring of and response to threats. 
  • We’ve established an Engineering Excellence community of practice, where we can identify best practices and propagate them through the organization.
  • In all we do, we practice a culture of “embrace the red.” We don’t seek to assign blame.  We seek to solve problems. 

Operational excellence and continuous improvement are at the heart of all we do. We haven’t achieved nirvana yet. Because of this, we also constantly seek to improve our approach to operational excellence itself.

Part 5. Create Clear Measures of Success
Part 7. Getting Laser Focused On Cybersecurity

Continue reading