I recently shared some thoughts on how we’re getting “back to basics” at OIT by focusing on our core mission. One of the examples I touched on was our daily standup meeting, where we review all major incidents and systems failures that VA has experienced over the past 24 hours. I’d like to expand a bit on this practice and how it helps us to deliver for Veterans.

Many readers are already familiar with the idea of a daily “standup” meeting from the world of agile. (Though in the era of Zoom and Teams, there’s surely less standing up and more sitting down.) The basic idea of the standup is to give teams a forum to align on a regular basis on projects and daily objectives.

At OIT we’ve taken this practice a step further — putting in place a cross-functional daily standup that allows us to collectively dissect incidents, identify root causes, and strategize preventive measures.

Conducting the Daily Standup

The primary purpose of the Daily Standup is to walk through all major incidents of the past 24 hours. Our goal is to ensure active incidents are being proactively resolved and make sure that the root cause of the resolved incidents is known. A typical meeting will find us exploring a range of topics — from analysis of the root cause at play in a recent incident, to how we can continue to document and share lessons learned.

This isn’t your average team meeting. For one thing, it generally includes hundreds of participants — it’s open to anyone across the organization. It helps identify needed assistance from other parts of the organization, and colleagues routinely chime in to offer their domain expertise to solve the problem at hand. It also is not a replacement for meetings that each team conducts as part of their individual processes. The purpose of the meeting is to align across the organization on the status of all major incidents, actions to be taken, and learnings — both for this incident and more generally.

This is not a forum for finger pointing. Rather, team members are encouraged to candidly acknowledge mistakes and work collaboratively towards solutions. This culture of embracing errors — what we call “embracing the red” — creates an environment where lessons learned from our missteps serve as catalysts for improvement. We want a forum where people are supported when they say “we screwed up” instead of being castigated for doing so.

Nothing ever gets resolved by slamming someone for making a mistake. Everyone will inevitably make mistakes, and these are learning opportunities. When we embrace our mistakes as opportunities to grow, we become stronger as an organization and improve our performance. This is not to say that mistakes have no consequences. If someone consistently makes mistakes, that’s a separate issue that needs to be dealt with through established HR processes.

The meeting host for the day introduces each major incident and passes it to the person responsible for that system. They walk attendees through the “who, what, where” — the chronology of the incident, the status, and, most importantly, what steps they are taking to make sure it doesn’t happen again.

If the incident was caused by a system that is not managed by OIT, we invite that team to present. We extend the same respect to them as we do to our own team members, driving toward a shared understanding of the impact that the outage had on our operations and reiterating how we depend on them for VA’s shared success. We ask the same questions — double clicking on the root causes and remediation plans — as we do of our own team members.

The same applies to contractors, whom VA depends heavily on to execute much of our work. They are deeply embedded within our teams. If the incident was caused by the contractor, they are invited to the standup to describe what happened and how they will ensure it doesn’t happen in the future.

If we are to maintain confidence in our contractors, we must hold them accountable. That creates a challenging dynamic when they’re on the call. We want to be respectful and acknowledge that they too are only human and can make mistakes. But our business relationship with them is designed to hold them accountable for not making mistakes. In our conversations, we make sure they understand that accountability is a major part of our culture, and we hold them to the same standard as employees. Our Office of Strategic Sourcing participates in the standups, so they can take any appropriate actions on our business relationships with contractors.

The Role of the Standup Leader

A single senior leader plays the role of the Operations Standup Leader (or Standup Leader) at each daily standup. They drive the meeting via their questions and suggested course of action. It’s a bit of a Socratic process. For each incident, their goal is to understand how the incident occurred, what we are doing to remediate it, and most importantly, what the implications are for how we’re operating today and what must change in that approach moving forward.

To do this most effectively, they must have in their mind as much of the organization’s systems, plans, and efforts to improve our operational rigor as possible. This enables them to ask more constructive questions and relate the incident to our progress on those issues. I believe it’s a hallmark of a strong engineering leader that they can both see the forest and the trees — dig deep on the details and see the patterns. This role tests those skills.

Here’s an example. If a certificate expiration brought the system down:

  • Why were those certificates not in our certificate management system?
  • If the team had their certificates in the system, but a downstream system failed to update the cert in response to the email the team sent out, what is wrong with our approach that enabled this to happen?
  • Should the team that owns the certificate have taken additional steps versus just sending out an email and considering their responsibility fulfilled? What exactly are the responsibilities of the team who owns the expiring cert?
  • Are we making enough progress in reducing the number of incidents caused by certificate expirations? If so, then the team deserves kudos. If not, then why not?
  • What should we change about our approach?

In all cases, the Standup Leader reinforces the established processes through their comments and suggestions, while also digging deeper to improve them. It’s the job of the Standup Leader to push for clear next steps — and to wring out every possible drop of learning from the team to help us get better.

There are a few specific questions, which seem to cross over many incidents, that are always in the Standup Leader’s mind:

  • If the incident is change related, what is it about the change process that caused this to happen? Does that team have a lot of change-related incidents?
  • Is this contractor causing a lot of incidents? Do we need to talk with their leadership about this? Do we need to document the issue?
  • Why was this incident caught by end users and not by monitoring? Why didn’t we have an automated alert that this was coming?
  • Why didn’t this system have a backup that was switched to seamlessly? Do we need one in this case, or is it not worth the investment?
  • Are other teams likely to be seeing a similar class of problem? Should we encourage them to reflect on whether they are resilient to this sort of issue?
  • Is this a symptom of a deeper issue that we need to follow up on offline in a longer meeting? Is this system failure prone, deserving of a deep dive? Does this team need to redouble their focus on reliability and resiliency?

Another responsibility of the Standup Leader: controlling the meeting pace. Our meeting only lasts a half hour, and we sometimes have a lot to cover. Some presenters go into too much detail and others not enough. The Standup Leader moves the discussion along, or to teases out enough details so we have a full snapshot of the incident.

It’s also the Standup Leader’s responsibility to guard against change fatigue in the organization. Will pushing on a change in response to a major incident or pattern of major incidents be worth the investment? Have you already pushed the team for several things that are more important than this? Are you sure your solution is the right one? The people working on the team live in their area every day, and you don’t. Do you want to offer your solution as a mere suggestion, as a mandate, or not at all? What’s the right balance between pushing hard on improvement and letting the team accomplish what’s already on their plate?

At the end of the day, it’s the Standup Leader’s responsibility to continually reinforce our “embrace the red” culture. We actively search for important issues that will improve our execution. We support those who come forward and “fess up” that they made a mistake, especially when they also describe how they’ll change their approach to make sure it doesn’t happen in the future.

Finally, the tone of the meeting is everything. Above all, it’s a supportive learning environment. At the same time, VA is a massive place, and if we’re going to execute with precision, then every incident needs to be analyzed with that same precision and, where appropriate, turned into action. We need to express an urgency for action — emphasize directly and indirectly the urgency that we become an “execution machine.” This execution precision and intensity is essential to us delivering on our sacred mission to serve the Nation’s Veterans, their families, their supporters, and their survivors.

A person holding a mobile device with the VA Health and Benefits app on the screen.Leading by Example: Creating Exceptional Digital Experiences at VA
Kurt DelBene with the VA seal in the backgroundFocusing Our Efforts with OKRs

Continue reading