Operational
Excellence
Introduction
We reuse shared solutions, foster a learning culture of continuous improvement, and ensure security is built into every product by making sure teams have access to the tools and resources needed to deliver secure, reliable, and scalable solutions for Veterans.
By leveraging shared platforms and capabilities, we make it easier to access what you need to deliver efficiently. The VA Way isn’t about checking off tasks. It’s about continuously improving and finding smarter ways to meet Veterans’ needs.
Read the attributes below to access the detailed guidance, tools, and resources for each.




Operational Excellence Attributes:
2.1 We measure and continuously improve performance using transparent, common metrics across teams
Consistent performance tracking is integral not only to successful product delivery, but also to enable effective distribution of product team efforts. To ensure our products are operating at the highest standards for Veterans, their families, and their caregivers, every product team will track a common suite of performance metrics.
Ultimately, VA strives for all systems to achieve “three-nines” of uptime (i.e., application or system is functional 99.9% of the time). Product teams must rigorously monitor their systems to catch issues in real-time and conduct preventative maintenance.
Best Practices
-
Measure operational performance using an approved application performance monitoring tool
-
Ensure you are collecting metrics that can indicate the operational health of the product, especially the four golden signals latency, traffic, errors, and saturation.
- Regularly review metrics and incorporate them into daily workflows in forums like standup meetings or staff meetings to engage staff in meeting metrics
- Use a Custom Development platform with out-of-the box monitoring, or integrate approved application performance monitoring (APM) tools with applications
- Develop robust incident management protocols for systems, including communication materials and templates to be shared with end-users
- Conduct thorough root cause analyses (RCAs) when errors are identified in production environments and communicate the results with end-users
- Continuously track indicators of reliability over time (e.g., CPU usage, web performance, error rate) as well as other performance metrics to quickly identify and resolve issues as they occur
- Discuss metrics frequently and use results to drive the direction of future development work
- Transparently document where metrics are stored and how to access them
-
Guiding Questions
- Is there a minimum bar for performance that you would expect within a particular metrics category?
- How are other teams measuring performance of similar products? What metrics are being tracked for similar products in the Critical 100?
- Is there a clear way for everyone on your team to access these metrics (e.g., dashboard, regular reporting cadence)?
- What does performance in a particular metric tell you about the current distribution of resources?
- Where else can you prioritize efforts to improve on metrics?
Useful Links:
Key Contacts:
2.2 We reuse solutions whenever possible, prioritizing efficiency
Reusing existing software enables OIT to deliver high-quality software faster, maintain existing solutions, and allocate resources more efficiently. Reuse can take several different approaches, like leveraging an existing SaaS or PaaS product, or adding a feature or capability to an existing Custom application where about most of the code can be reused.
Teams should prioritize reuse first. If no existing product can be used or modified to support the requirement, teams should explore if there is a SaaS solution that could be purchased or if custom development using a PaaS or custom software platform is appropriate.
Best Practices
- If applicable, work with portfolio Intake to identify whether an existing solution meets the need
- Leverage OIT resources to identify existing products that can be modified to address the problem at-hand (e.g., DVA Product Marketplace, CODE VA for platforms, VA System Inventory (VASI))
- Raise new development needs to the Product Line Manager (PLM) or other lead to flag that another solution does not exist
- If reuse is not an option, ensure the function and purpose is clearly and accurately documented in VASI for future reuse needs
Guiding Questions
- Can the need be filled by an existing solution?
- Have you considered all possible reuse options?
- What are other potential future uses for this need?
Training
- Tech Tuesday 46 | How Does IO Automate? We Have the Ansible to That Question. (September 24, 2024)
- Tech Tuesday 40 | Harness the Benefits of VASI (June 04, 2024)
- Tech Tuesday 27 | VA’s SaaS/PaaS Software Factory Advantage (Oct 10, 2023)
- Tech Tuesday 16 | VA Software Factory Model Overview (Apr 25, 2023)
Runbooks
-
VASI_Handbook_20210617.pdf: This handbook is for anyone who manages or interacts with systems in the VA’s environment. It ensures that VASI is managed as the authoritative source for all VA IT systems.
-
Key Contacts:
- Contact 1
- Contact 2
- Contact 3
OIT’s continuously improving suite of platforms, capabilities, and utilities supports efficient, high-quality software development. Collectively, these shared resources automate processes and provide features and solutions out-of-the-box.
Teams are expected to use shared resources as the default approach to:
- Reduce cost by avoiding duplicate work, and
- Ensure they are driving towards the best outcomes for end-users across reliability, security, and performance
Best Practices
- Use CODE VA to identify potential shared product development resources
- Leverage a Software Factory approved custom software platform as first choice when pursuing a custom application
- Work with platform resource teams to understand and utilize the full extent of features and capabilities available for your development project
- When planning a new feature, consider whether any existing utilities may fulfill all or part of the functionality your team is building
- Opt for SWF Utilities when engineering high-priority components of your project, such as security, APIs, and certificate management
Guiding Questions
- Have you worked with portfolio Intake to ensure that you are pursuing the optimal software development approach?
- What existing shared resources might you be able to leverage in developing this product?
- Does it make sense to migrate your product to a shared platform? Will it make sense in the future (in sustainment)?
- Where might you be able to leverage shared utilities in place of a new custom feature?
- Does your team use a resource that isn’t listed in CODE.VA, and have you contacted CODE.VA to add it?
Training
- Tech Tuesday 38 | Build Better Custom Software with CODE VA (April 23, 2024)
- Tech Tuesday 32 | Upgrade Application Security Testing with GitHub CodeQL (January 16, 2024)
- Tech Tuesday 4 | Lighthouse API (Oct 11, 2022)
- Tech Tuesday 27 | VA’s SaaS/PaaS Software Factory Advantage (Oct 10, 2023)
- Tech Tuesday 26 | Empowering DevEx: Building Custom Software and Utilities in the VA SWF (Sep 26, 2023)
- Tech Tuesday 22 | Collaborate with SPM on Intake for VA’s Software Factory (Aug 1, 2023)
- Tech Tuesday 25 | Setting VA’s Technology Standards with Enterprise Design Patterns (EDP) (Sep 12, 2023)
- Tech Tuesday 30 | Leveraging VA’s Testing Automation Tool Interface (ATI) (December 5, 2023)
- Tech Tuesday 45 | PKI: The Key to Better Certificate Management (August 27, 2024)
Runbooks
- VA Github Handbook
- CODE.VA Developer Starter Guide: This guide provides a little more context for Developers at VA to make your onboarding experience a bit smoother
- Utilities Starter Guide: This guide provides a list of Software Factory Utilities that product teams can use to enable capabilities and features in their application
- Become a SWF Utility: This guide is for owners of shared resources that would like their resource to be included as a Software Factory Utility
Useful Links:
Key Contacts:
- All Utility Owners can be contacted via a new e-mail address: OITSWFUtilityOwners@va.gov
2.4 We treat every mistake as a learning opportunity, fostering a blameless culture by embracing the red and collaborating across teams to prevent future problems
A culture of hiding mistakes can have a profound impact on an organization, including slower reaction times and reduced use of institutional knowledge to solve issues. OIT encourages teams to bring mistakes forward by creating a transparent culture from the top down.
Best Practices
- If you make a mistake, own it and share it with the team as soon as possible
- As a leader, if someone on a team flags a mistake they made, acknowledge their integrity in raising the error and transition to fixing the issue and preventing similar ones
- Participate in OIT’s major incident process, proactively declaring major incidents when there is a significant issue with your product
- If someone discovers a mistake made by someone else, shift the conversation to finding a solution rather than pointing blame
- Immediately flag when products or features are or may become delayed so leadership can appropriate resources asneeded
- Evaluate work based on final output relative to final cost, rather than mistakes made along the way
- Product teams must develop robust incident management teams and protocols to ensure that critical product errorsare resolved as fast as possible
Guiding Questions
- How can you encourage a culture of “embracing the red” daily?
- Does your team know how to participate in the Major Incident Process, including how to declare ongoing or recently addressed incidents?
- Have you defined a protocol or process to make it easier for individuals to report mistakes?
- Do opportunities exist to flag delays in product development? Where can you create more opportunities to do so?
Training
- Tech Tuesday 36 | Improving Change Safety Culture with Peer Reviews (March 26, 2024)
- Tech Tuesday 35 | Tackle Tricky Engineering Issues with Code Yellow (March 12, 2024)
- Tech Tuesday 18 | Major Incident Management (MIM) (May 23, 2023)
- Tech Tuesday 8 | OIT Detective Agency: Solving Problems Fast! (Dec 6, 2022)
Use Case
In FY24, the Enterprise Command Center (ECC) began an aggressive continuous service improvement initiative to investigate 100% of Major Incidents within five business days to identify opportunities to improve monitoring instrumentation.
A Critical-100 system experienced a Major Incident caused by a change in November 2024, resulting in customers unable to access the system. The monitoring instrumentation was successful, and all alerting functioned as designed. Furthermore, upon learning of the change, ECC updated the components for the application’s monitoring instrumentation, which reduced the average number of alerts per day for failed connections by 45%. By reducing alert noise, System Owners and Event Management Eyes on Glass can quickly analyze and respond to the remaining alerts that are most critical to the business.
Useful Links:
Key Contacts:
2.5 We treat security as part of the Veteran experience and a key outcome of the products we deliver
Security is the most important functionality teams can incorporate into their development plan. At VA, a cyberattack may compromise the reliability and privacy of the critical services upon which our nation’s Veterans, their families, and their caregivers depend. We should seek to treat Veterans’ identities with the same reverence we treat Veterans themselves.
VA is implementing a Zero Trust Architecture (ZTA) to stay ahead of ever-adapting cyber threats. There are four core ZTA principles:
- Never trust, always verify
- Enforce least privileged access
- Continuously and pervasively monitor
- Assume breaches
OIT expects that all product teams are familiar with ZTA and remain compliant and vigilant across all software development efforts.
Best Practices
- Become familiar with the DVA Cybersecurity Strategy, ZTA principles and Zero Trust First Cybersecurity Strategy
- Prioritize use of shared platforms that support comprehensive security monitoring out-of-the- box
- Ensure your product follows security best practices outlined in CODE VA
- Integrate available comprehensive security monitoring / logging tools (e.g., Splunk)
- Seek opportunities to “red team” your product, in which skilled security researchers probe for vulnerabilities as an attacker might to help identify issues
- Engage cybersecurity resources (e.g., the Office of Cyber Security (OCS)) early in the process, and work with other subject-matter experts across VA to ensure compliance with VA standards
Guiding Questions
- Are you considering VA best practices across cybersecurity strategy and ZTA in your software development approach?
- Have you engaged the appropriate security offices to ensure that your product is fully compliant?
- Have you made use of the full set of cybersecurity resources and tools provided by VA?
- Are you discussing security implications for each new product or feature that you develop?
Training
- Tech Tuesday 49 | Customer, User, or Both? Learn about VA’s New APDS (November 5, 2024) – just demo for APDS
- Tech Tuesday 45 | PKI: The Key to Better Certificate Management (August 27, 2024)
- Tech Tuesday 43 | Zero Trust Awareness Campaign: Secure Software (July 30, 2024)
- Tech Tuesday 34 | Understanding FedRAMP SaaS within the VA Cloud Ecosystem (Feb 27, 2024)
- Tech Tuesday 32 | Upgrade Application Security Testing with GitHub CodeQL (January 16, 2024)
- Tech Tuesday 28 | Anatomy of a Cyber Attack (ColdFusion) (Oct 24, 2023)
- Tech Tuesday 10 | Secrets Management w/ GitHub Advanced Security Demo (Jan 31, 2023)
- Security and Privacy Awareness and Role-Based Training (People Readiness Hub – Home)
Runbooks
Product teams should define transparent and modern technical strategies to ensure all systems are ready for continuous improvement. Teams should consider opportunities to coordinate with broader enterprise, product line, or portfolio strategy while also making technical roadmaps clear and transparent.
Best Practices
- Review key Enterprise Technology Guidelines when planning technical strategy
- Prioritize modernization of critical systems where possible
- Coordinate with leadership to ensure technical strategy is consistent with broader strategies (e.g., across tooling, tech stack)
- Share technical strategy for review and regularly update as needed
Guiding Questions
- Is your broader technical strategy aligned to the overarching ambition to create more modern and modular systems? Where and why does it diverge?
- How can you incorporate regular updates to your technical strategy on a day-to-day basis?
- If you are a product line manager or portfolio lead, what does your technical strategy look like? Are your teams aware of and aligned to this strategy?
- Should the systems under your purview be using the same technology stack? Is that possible?
Key Contacts:
None at this time
2.7 We ensure our teams know how to find the tools, documentation, and information needed to deliver products successfully
Teams should leverage resources like Product VA, CODE VA, Design VA, and the Digital VA website to ensure everyone on the team has the most up to date guidance and tools.
The VA Way is made up of all the best work being done in OIT and teams are encouraged to document their work thoroughly to continue to expand institutional knowledge and enable others to follow their example.
Best Practices
- Help new team members onboard by centralizing system-specific resources in an accessible location
- Create new-hire onboarding resources that identify “must reads” that will help new team members get up to speed quickly on key areas
- For custom applications, ensure there is up-to-date documentation that can help a new developer begin productively contributing code to the project quickly
- Contribute back to the OIT community by uploading and updating content or providing feedback to microsite administrators
Guiding Questions
- What learnings, tools, design, software, etc. might be helpful to share with the broader community?
- After tracking down a hard-to-find process or piece of information, what can you do to make that information easier for future team members to find?
- What tools, documentation, and information might be needed for a new individual joining your product team? How might you store this information in an accessible way?
- How can you solicit feedback around tooling / docs from my team day-to-day?
- Where are there gaps for OIT to improve on the current state tools, documentation, and / or product delivery information?
Useful Links:
Key Contacts:
None at this time