Fault-Tolerant Networks: Is There Such a Thing?
14 June 2001
David Neil   Bob Hafner
 
When ensuring business continuity, often only cursory attention is paid to the robustness of the communications network. We offer guidelines and considerations to improve the resilience of an enterprise's communications facilities.

 Tutorials
Note Number:  TU-13-7964
Related Terms:  NSPs and Services; WAN
Download:  PDF 

Fault-Tolerant Networks: Is There Such a Thing?

When ensuring business continuity, often only cursory attention is paid to the robustness of the communications network. We offer guidelines and considerations to improve the resilience of an enterprise's communications facilities.

Bottom Line

Key Issue
What WAN designs and technologies will network planners adopt to best handle disparate (and changing) traffic, connectivity and application requirements of their enterprises?

The e-business explosion and the requirement for 100-percent availability is causing attention to be focused on one of the availability weak spots, the network. This problem is compounded by certain strategic trends within communications. One of these trends is toward public networks, and over the longer term, to converged voice and data networks. Enterprises are therefore becoming increasingly dependant on network service providers (NSP) and the level of redundancy and resilience that they have built into their networks (see Note 1). Most enterprises assume that any public network is fault-tolerant. So what must enterprises consider and examine when designing their networks?


Note 1
Typical Availability of Data Networks
  • Internet using multiple ISPs — 95 percent
  • Internet using single ISP — 99 percent
  • Basic single-carrier service (e.g., frame relay) — 99.9 percent
  • Frame relay with ISDN for backup — 99.95 percent
  • Fully duplicated network — 99.99(5) percent

It is important to note that as the reliability increases, so does the cost. An understanding of the business needs and impact of network outages will help determine the level of reliability required and how much the enterprise is will to pay for it.

Key Facts

There are some general principles that enterprises should follow when looking at this problem. The first principle is to only plan for failures of a single component; planning for failures of multiple components is too complex and expensive, as well as impractical. The second principle is to not worry about the core of the NSP's network. Through the use of technologies such as a SONET core, fully redundant hardware and a design that seeks to ensure the resilience of the network, the NSP's core network is about as fail-safe as it can be. However, there are three key areas of weakness in networks. The first is the access line, and the second is the software that runs on the network devices located in the service provider's network cloud. The third is problems caused when changing or modifying the network.

The access line is very vulnerable to being cut during construction or excavation. The access line doesn't have the redundancy of the core because it is generally used by only one customer. Therefore, providing access redundancy can actually double entire network costs. Enterprises should always have a backup for the access line that will allow operation at least in a degraded mode. For large offices, the backup will be a circuit of a size capable of running at least mission-critical applications concurrently. Should the backup circuit be supplied by a different NSP? Should there be separate cable entries? Should the access lines be routed through separate central offices (CO)? Yes, if 100 percent availability is desired. Typically, the NSP will route all circuits through its closest CO or point of presence (POP). A problem in the NSP's CO or POP could affect all circuits from that location. (This happened in Toronto, Canada, where a fire in a downtown CO resulted in many enterprises being without communications for an extended period of time). When deciding how to handle this problem, enterprises have two alternatives. They can either work with their NSPs to ensure that there are no single points of failure in the connectivity, or they can use multiple providers. Even if different NSPs are chosen, network designers should ensure that the NSPs do not share common facilities at any point; many network providers offer colocation facilities to other NSPs. For smaller offices, the best method is to have automated-dial backup facilities. These can be either regular dial, ISDN dial if available, cable modems or xDSL. ISDN is preferable over pure dial because of its higher speed, but it will still be relatively slow at 128 Kbps. Cable is another alternative, but the quality of cable modem connectivity varies greatly from one supplier to another. xDSL's availability is limited to urban areas. Enterprises may be able to justify using an inverse multiplexor. With an inverse multiplexor, speeds up to T1 are possible. If you need T1 or higher speeds for backup, it can make more economic sense to have two separate T1 facilities and balance the load across the two circuits. New competitive carriers typically offer cheaper bandwidth than the incumbents, but the quality of their support and service is unproven.

Safeguarding against software problems in the network is difficult. A key issue is how the NSP tests these changes before full-scale implementation. This is a process issue, and enterprises should ask their NSPs to explain their testing processes and fallback capabilities. Having contractual service-level agreements (SLAs) for network availability is a must. The SLAs must include penalties for failing to meet the targets. Enterprises do have to be mindful of a new practice with respect to SLA penalties. The service provider offers 100 percent availability, with a penalty of 10 percent for failure to meet that number. The reality is that the service provider does not expect to achieve 100 percent all the time, but rather feels that this is a discount of up to 10 percent. SLA penalties cannot be capped at 10 percent; they must be representative of the impact and with penalties up to 100 percent of the cost (at least) for extended outages.

Any network is vulnerable to problems such as broadcast storms, where the volume of interdevice-signalling traffic increases to the point where it is the only traffic being transmitted. This happened several years ago to the frame relay networks of both AT&T and WorldCom. Problems of this magnitude are rare in single-carrier networks, but occur more often in the Internet. A multivendor or multiservice approach can ensure continuity of service. Dial backup facilities also overcome this problem by dialing into a different network — either directly into the target computer's dial facilities or into a network, such as a separate virtual private network (VPN), that uses alternate routes to connect into the target systems. Problems that occur as a result of changes to the network are also process issues. These should be addressed in the same way as software issues.

A final area of consideration is when an enterprise contracts with an NSP for Web hosting, application service provision or other services. In these situations, the network and the computer systems are usually both provided by the one NSP. A failure anywhere in the network, including the NSP's core network, can result in a complete loss of connectivity. Enterprises must examine this issue with the NSP and determine what alternative communication facilities the NSP offers. This should be a key criterion during the selection criteria — one that is often overlooked. Enterprises should take note that some NSPs do not offer alternate communications facilities to their Web-hosting and other value-added services.


Related Research on Planning for Network Failures
  • Network Failures: Be Afraid, Be Very Afraid (SPA-09-1285)

Bottom Line

One of the worst mistakes a network designer can make is to assume that an NSP's networks are fault-tolerant. Network designers must work with NSPs, reviewing the network schematics and identifying the weak points and single points of failure. Failure to do this can result in the enterprise losing network facilities for an extended period and suffering a severe business loss.

This research is part of a broader article consisting of a number of contemporaneously produced pieces. See COM-13-6392 on www.gartner.com for an overview of the article.


This research is part of a set of related research pieces. See AV-14-5138 for an overview.