The concept of “four nines,” or 99.99% service uptime within information technology (IT) departments, is fast becoming a standard by which many large industry organizations measure success.  And VA’s Office of IT is striving for the same.  This means that an enterprise IT system, network, infrastructure element, or application will be fully functional 99.99% of the time – that’s only 52 minutes of downtime per year.  Another way to consider this is that just 1 in 10,000 transactions can fail.  In context, a webserver that sees one-million requests per day can tolerate 100 errors and still reach 99.99% uptime. 

The core of the four-nines concept is to prevent as many service interruptions as possible through backup plans, redundancy measures, and built-in system resiliencies.  While in most instances reaching 100% would be considered the goal, that is unattainable and unrealistic. A variety of issues can (and will!) arise during the year to cause an outage, regardless of the safeguards implemented.  But when they do happen, ideally the correct team is notified immediately and is able to respond as quickly as possible to minimize downtime. 

Four-nines and our OIT priorities

The four nines are crucial to OIT’s four priorities, each building on the others and connected by the heart of our mission: to deliver world-class IT solutions to Veterans, their families, and caregivers.  The Engineering Excellence priority seeks to improve the rigor of all systems within VA to create a more dependable network overall.  We accomplish this by mitigating change-based incidents and reducing avoidable major incidents to near-zero through more careful testing, release management, and adherence to change management processes. And as many of us continue to work remotely, improving the reliability of our systems is vital to the majority of work being carried out by VA employees.

There are more than 900 IT systems used across the VA enterprise.  Many of these systems touch thousands of users, which then in turn touch millions of veterans.  But to prioritize four-nines and focus efforts in a productive way, OIT has identified the “critical 100” – a singular list of significant systems that have the most reach and impact, such as infrastructure and authentication services and a few critical applications such as electronic health records systems.

These systems are considered the bedrock systems of VA, most of which are matrixed in some way with other critical systems.  Dependency mapping, a process by which one traces the origins of each system transaction and its destination, allows for a high-level view of these inter-relations, which quickly provides visual alerts to those with developing or impending issues.  A dependency map helps triage problems by identifying each system and/or pieces of infrastructure that others depend upon. The goal is proactive awareness of systems at risk so failovers can be poised to take over and/or the issue can be solved before any impact is felt by the end user. 

OIT leverages multiple tools, such as VA System Inventory (VASI), Line of Sight Tool (LOST), Configuration Management Data Base (CMDB), and others to view the dependency mappings from different perspectives. For example, the CMDB attempts to do this primarily through a fine-tuned combination of automated discovery and some manual interactions, while the LOST tool relies on system documentation for issue identification and remediation. 

The ongoing work to achieve four-nines will no doubt improve reliability across all operational divisions of OIT. Striving to improve system dependability VA-wide also supports a second OIT-wide priority for a Delightful End-User Experience. The impact of four-nines on the end-user is clear: by engineering the network with reliable backup options, our veterans and employees will have consistent access to the vital systems they need to access healthcare records, benefits information, and/or vital services and tools they need to work 99.99% of the time.

Why does four-nines matter?

Network reliability is important to all functional work we do. Any major outages that occur can impact thousands if not millions of end-users including employees and Veterans if a major system fails. For example, the Identity and Access Management System authenticates and authorizes more than 50,000 users per minute during core business hours.  If access is disrupted for one second, it will disrupt approximately 1,000 user actions, interrupting many of their workflows and causing them to have to try again.

Achieving four-nines is especially important for the OIT’s Enterprise Service Desk (ESD), as their agents provide critical IT support to more than 525,000 customers at nearly 1,300 VA healthcare facilities, 56 benefits regional offices, and 155 national cemeteries.  When systems fail, ESD is inundated with reported problems and must filter through the noise to identify the central problem. The team then has to escalate and continue to filter the noise while the problem is addressed, which hinders ESD from working on other less common end-user issues at the time.

Enterprise efforts for four-nines

VA’s OIT has implemented these key strategies to achieve four nines, and works continually to refine these processes for ongoing performance improvements:

Automate when possible.  One critical strategy to avoid lengthy downtimes is automation, a hallmark of system reliability.  Monitoring is key – proactively evaluating systems helps to identify a potential issue before it becomes a full-blown outage.  Once a process is well defined and tested, it can be automated, drastically increasing the efficiency and reliability while simultaneously decreasing the chances of mistakes and human errors.  This is necessary for daily system operations, maintenance, as well as diagnostics and triage. 

OIT has identified known error conditions of certain critical systems that require an immediate response.  The process used to involve humans on multiple teams to evaluate and escalate before declaring a High Priority Incident (HPI) or a Critical Priority Incident (CPI).  Today, there are several automations that systematically trigger a HPI/CPI without human intervention, getting an emergency response more than 30 minutes faster.  

The monitoring systems use artificial intelligence (AI) to oversee system functionality and create reports of the areas that need immediate attention.  In the case of a failure, the monitoring will alert the right person who can quickly address the issue within minutes…but minutes compile quickly over the year.  The greater the automation, the faster the issue is identified and remedied. 

Implement redundancy and failover mechanisms. Having redundant systems is vital to achieving four nines. If a system goes down, the backup (redundant) system carries the load, allowing for uninterrupted service to the users, and giving engineers the opportunity to quickly correct the problem with the main system.  If there is no service interruption, whether using the base system or backup solution, no downtime is associated with the incident.   

The teams have also implemented high availability and failover mechanisms such as load balancing, clustering, or virtualization to help distribute traffic and workload across multiple servers, which prevents a single point of failure.  For example, the VPN service that many remote workers use for remote access, has redundant servers located in four geographically distributed data centers.  If any one of several servers fails or even if the entire data center were to become unavailable, the VPN will automatically redirect users to another location and reconnect them with very little impact.  This is only possible with the clustering and geographical redundancy in place.

Perform regular updates and maintenance.  Regularly updating and maintaining our systems helps prevent downtime caused by software or hardware failures. This is extremely important given the interconnectivity and inter-dependency of most VA systems.  If a bedrock system is down, all those that build upon it will be down as well. This is again where redundancy is vital.

Establish a recovery plan.  Even with the best planning and execution, unexpected events such as natural disasters, cyberattacks, or human errors can cause system downtime.  Having a disaster recovery plan in place can help the team quickly bounce-back from an HPI or CPI and minimize the negative impact on system uptime. The plan includes procedures for data backup, data recovery, and system restoration.

As the team gets closer and closer to fully achieving four-nines, support for all VA IT users – including staff, contractors, caregivers, and veterans – will continue to improve in terms of dependability and reliability. 

OIT is the first organization in federal government truly striving for this achievement, simultaneously meeting an IT industry standard while also leading the way for a new, higher standard in government. Achieving this advanced level of uptime requires ongoing attention and resources, which OIT has committed to this effort across the enterprise. 

System issues are inevitable, especially when considering the size and scope of VA’s information technology.  With more than 400,000 employees using more than 900 OIT systems, comprised of more than a million devices, problems are unavoidable.  Even with the best planning and preparation, downtime will still occur, but our expert engineers and their teams are working tirelessly to limit the impact those downtimes have.

To Learn More

If people want to learn more about the four nines, VA OIT Direction of Operation Triage Group Jay Paluch spoke at length about his team’s efforts during May’s Talking Tech event.  Recordings are available on both LinkedIn and YouTube.

Topics in this story

In this article

Trust in Cyberspace? Not on Our Watch
VA’s API Platform: Seamless and Secure Access to VA Data for Veterans, Their Families, and Advocates

More stories