The October 2011 report of the WMATA Riders' Advisory Council Chair contained a note indicating that a RAC member had requested (using the RAC's investigative powers) documentation from the agency concerning, among other items, the "development of a real-time collision-avoidance system for Metrorail trains", further defined as a system "designed to serve as a continuous backup system that would provide alerts to potential safety issues, and which would supplement Metrorail’s primary electronic system to prevent crashes".

This piqued my interest, because it doesn't sound like the sort of thing which is usually expected of a rail signalling system. Rail signalling systems are (as I'll reiterate later) meant to be fail-safe, meaning that any failure must lead to the most restrictive signal indication being displayed. So long as the system is properly maintained (and that'll turn out to be an often erroneous assumption), the system should not permit unsafe conditions to exist.

So, what sort of backup system could be implemented, and do we really need one? More importantly, are there more serious, systemic problems, which will hobble any system implemented?

Rail signalling basics

Before answering those questions, it's important to clarify some points about rail signalling systems. Most mass transit systems, including Metrorail, use what is known as fixed-block signalling. I won't provide a detailed introduction to signalling principles here; for that I would recommend this series of articles on nycsubway.org.

However, in brief, a fixed-block signalling system ensures safe train movement by dividing the track into a series of blocks, of defined length, and permitting only one train to enter any given block. In this way, a certain spacing between trains is maintained. In order to detect whether a block is occupied by a train, various methods are available, including track circuits and axle counters, although on American transit properties (including Metrorail), track circuits are by far the most common. The basic concept of operation for a track circuit is that the presence of a train's wheels in a block will short the two rails together, completing a circuit—and when this circuit is completed, the signalling system can know that the block is occupied. There are considerably more complex variations on the scheme, like the audio-frequency track circuits used on the Metrorail system, but the basic principle remains the same.

Of course, one of the most serious problems in rail signalling is the signal passed at danger—colloquially, when an operator 'runs a red signal'. To ensure safety in this circumstance, many systems, including Metrorail, use some form of automatic train protection. This can take the form of a train stop, a mechanical device intended to automatically apply the brakes on any train which violates a signal, or, as is the case on Metrorail, it can take the form of onboard equipment which reads the speed codes transmitted along the track, and automatically brakes the train if the operator should attempt to go faster than permitted.

Fail-safe systems and human factors

In any discussion of rail signalling safety, it is also important to remember that one of the fundamental principles of a rail signalling system is that it must be fail-safe—that is, any failure must cause the most restrictive signal indication to be displayed, bringing trains to a stop, before any harm can be done.

So, if signalling systems are meant to be fail-safe, how is it that they are still implicated in so many rail accidents? Aside from the June 2009 crash on the Metrorail Red Line (which we'll get to later), there have been many accidents in which fail-safe signalling systems failed. In almost all of these, the root cause is proven to be not random failure of the signalling system, but rather human error in the engineering, installation, maintenance, or operation of the signalling system.

1995 Russell Hill crash in Toronto

Consider, for example, the 1995 subway crash in Toronto. The train operator in that crash had indeed passed a red signal, but that should not by itself have led to an accident; the train should still have been stopped by the train stop.

Why wasn't it? The train stop, a part designed by Ericsson, and intended to be fail-safe, like the rest of the signalling system, had a serious design flaw brought about by a change requested by the TTC which could lead the mechanism to fail to operate correctly under certain wheel/rail interface conditions (particularly when rounding a curve at speed, with worn wheels and rail, as was the case in the accident). In brief, the fail-safe signalling system failed not only because of mechanical failure, but also because of a human failure to anticipate that the mechanical failure could occur and could compromise the fail-safe nature of the system.

1988 Clapham Junction crash

The 1988 Clapham Junction crash, though very different in terms of root cause, bears certain similarities. At Clapham Junction, poor work practices meant that a signal was wired incorrectly, leading other signals to display false indications. Here, again, human error compromised the signalling system.

The signal maintainers engaged in the Clapham Junction re-signalling which was underway at the time of the crash were over-worked and under-supervised; the work practices in use did not respect the safety-critical nature of the work.

2009 Metrorail Red Line crash

Finally, there's the 2009 Red Line crash on the Metrorail system. As the NTSB report documents, "[on] the day of the accident, parasitic oscillation in the track circuit modules for track circuit B2-304 was creating a spurious signal that mimicked a valid track circuit signal, thus causing the track circuit to fail to detect the presence of train 214". Why did this happen? The NTSB implicates the manufacturer for failing to provide WMATA with a testing procedure that would have detected this failure condition, but further, and more significantly, notes that "WMATA failed to institutionalize and employ systemwide the enhanced track circuit verification test developed following the 2005 Rosslyn near-collisions, and this test procedure, had it been formally implemented, would have been sufficient to identify track circuits that could fail in the manner of those at Rosslyn and Fort Totten".

In short, human error compromised the system. The human error did not exist solely at the level of routine maintenance procedures; there were design deficiencies as well: "WMATA failed to recognize that the near-collisions at Rosslyn in 2005 represented an unacceptable hazard that had not been considered in the fail-safe design of the automatic train control system, and WMATA failed to communicate that hazard to the affected divisions in the organization for resolution".

Backup systems

At the time of the Red Line crash, there were two "backup systems" in place which could have given some indication that all was not well with the B2-304 track circuit. The first is a feature of AIM, the software used for automatic train supervision (that is, dispatching and routing trains, providing information on train location for dispatchers at the Metrorail OCC, and for passengers through the PIDS and other systems, etc.). In order to be able to track trains throughout the system, AIM uses information on block occupancy. It is, however, a non-vital system—while it receives information on block occupancy, it is not directly involved in setting signals to ensure the safe operation of trains; that function is delegated to wayside equipment in each train control room.

However, AIM is able to detect certain failures of track circuits based on the data it receives. This includes conditions where a track circuit has failed in the unoccupied state (a so-called "non-reporting block"), and where a track circuit has failed in the occupied state (a so-called "always-reporting block"). Both of these conditions generate alarms in the Metrorail OCC—according to the NTSB report, roughly 5000 and 3000 per week, respectively. More telling are the relevant footnotes in the report, which indicate that "about 100 track circuits (of about 3,000 track circuits systemwide) account for the majority of the ARB track-circuit-failed-occupied alarms" and that "the same track circuits that account for the majority of the ARB track-circuit-failed-occupied alarms also account for the majority of the NRB track-circuit-failed-vacant alarms".

Here, again, there is a hint of human error. Of course, some of the blame can probably be placed on the design of the AIM system; alarm handling in industrial control systems is notoriously poor, and has been implicated in many accidents. Inevitably, alarms pile up, and they are not analyzed to detect trends, nor to classify the problems by severity. But the sheer number of ARB and NRB alarms should have been, by itself, an indication that something was amiss—yet nothing was done to identify the problem track circuits, nor to actually resolve the problems with those track circuits.

The other "backup system" in place was a tool known as the "loss-of-shunt tool", developed by WMATA to analyze historical data from AIM to detect situations where a track circuit underneath a train indicated that it was unoccupied—or, in other words, had become incapable of detecting the presence of a train. In the years between the 2005 Rosslyn incident (described in more detail in the NTSB report) and the 2009 Red Line crash, the loss-of-shunt tool was run periodically at first, and then discontinued—or if it was being run, there was no documentation to show that those runs were being performed and the results analyzed. Again, human factors compromised the safety of the system.

In the wake of the Red Line crash, Internet mailing lists, blogs, and message boards across the region came alive with all manner of proposed solutions for preventing a similar accident, from using radar to detect trains (unworkable for a variety of reasons), to using GPS (unworkable underground, and insufficiently precise above-ground), to other even more harebrained schemes. Some simply recommended that speeds on the system be slowed to a crawl, in essence reflecting a complete lack of faith in the signalling system.

But the reality is that, as the NTSB report showed, the Red Line crash was brought about because human error compromised the signalling system—first by failing to properly install track circuits in the manner prescribed by the manufacturer (instead, WMATA mixed and matched equipment in a manner which led to dangerous consequences), and then by failing to properly maintain the system, as evidenced by WMATA's nonchalant and inadequate response to the AIM ARB and NRB alarms.

Human factors and SUBSAFE

The solution, then, is to demand better human performance in the maintenance of vital systems. Certainly, we can develop another layer of signalling technology, designed to provide extra protection against collisions, but it, too, will eventually fall victim to human error if we do not correct the human factors.

The US Navy has a program known as SUBSAFE, which is designed to comprehensively prevent the loss of submarines due to flooding. The program, a response to the loss of the USS Thresher, rigorously controls the design and construction of submarines to ensure that they will remain watertight, and to ensure that if flooding should occur despite those controls, that it will be possible to recover from that flooding, and prevent the boat from being lost. Though engineering design and review practices are a major component of the SUBSAFE program, human factors are considered in the program as well. As RADM Paul E. Sullivan, Deputy Commander for Ship Design, Integration, and Engineering described in testimony before the House Science Committee in 2003:

In the submarine Navy, many individuals understand safety on a first-hand and personal basis. The Navy has had over one hundred thousand individuals that have been to sea in submarines. In fact, many of the submarine designers and senior managers at both the contractors and NAVSEA routinely are onboard each submarine during its sea trials. In addition, the submarine Navy conducts annual training, revisiting major mishaps and lessons learned, including THRESHER and CHALLENGER.

NAVSEA uses the THRESHER loss as the basis for annual mandatory training. During training, personnel watch a video on the THRESHER, listen to a two- minute long audiotape of a submarine’s hull collapsing, and are reminded that people were dying as this occurred. These vivid reminders, posters, and other observances throughout the submarine community help maintain the safety focus, and it continually renews our safety culture. The Navy has a traditional military discipline and culture. The NAVSEA organization that deals with submarine technology also is oriented to compliance with institutional policy requirements. In the submarine Navy there is a uniformity of training, qualification requirements, education, etc., which reflects a single mission or product line, i.e., building and operating nuclear powered submarines.

I am not aware of any rail mass transit systems which routinely challenge their signal maintainers to recall the consequences of carelessness on the job. At the same time, it almost goes without saying that WMATA lacks a safety culture; this has been an element of nearly every NTSB report on a Metrorail accident since the first. Adding another layer to the signalling system will do no good if it will be maintained by the same maintainers who are presently bullied into covering up (or simply ignoring) safety-critical problems.

BART and backup systems

In its early years, BART experienced many failures of its train control system, including repeated failures to detect trains (a condition which could lead to a rear-end collision). Unlike on Metrorail, these failures were not due to problems with maintenance procedures, but rather the inherent design of the system:

The existing train-detection system uses the vehicles to interconnect the running rails with a low impedance to indicate train occupancy of a block. But since the track-to-track voltage is lower...than that conventionally used in such circuits, the presence of thin resistive films on the wheels or tracks can cause detection failure.
G. D. Friedlander, “A prescription for BART”

This led BART to develop the Computer Augmented Block System, "in which the central computer contributes to a safety function by maintaining a one-station separation between trains. This headway restriction, which averages about two miles in the BART system, is required because of reported losses of vehicle detection by the ATP system". Aside from limiting headways on the line, the CABS "will not permit single-track operation, [so] a vehicle train control failure slows down the entire line until the affected train can be removed at a pocket track or yard" (D. B. Turner, “Reliability improvement of BART vehicle train control”).

After CABS came SORS, the Sequential Occupancy and Release System (further described here), a more advanced system which will not permit a block to show as "unoccupied" until its track circuit detects an unoccupied condition and the next track circuit in the direction of travel detects an occupied condition. In this manner, if a train travels into a track circuit which fails to detect it, the track circuit behind will not be cleared, and will thus continue to protect the train.

Conclusions

WMATA could develop a backup system to protect against track circuit malfunctions, and in fact there have been repeated reports of a plan to turn the loss-of-shunt tool into a real-time system which would be able to stop trains when it detected a loss-of-shunt condition; this would in effect turn it into a reactive version of the BART SORS.

There are other options, like abandoning track circuits entirely in favor of axle counters, as is the practice on some European systems, or adopting communication-based train control systems which do not rely on conventional means of train detection at all.

But all of these systems must be maintained by humans; any of them can become unsafe when poorly maintained. WMATA could implement a version of BART's SORS, and yet there could be another crash a few months later; there would be much hand-wringing, and once again, the root cause, the lack of a meaningful safety culture, will go uncorrected. The Metrorail signalling system, if properly designed, installed, maintained, and tested, is safe. Rather than demanding that a backup system be installed, one that will fall victim to the same poor maintenance practices that felled the track circuit at Fort Totten, riders should demand that WMATA correct the (very serious) human factors which have persisted for decades and allowed unsafe conditions to develop and persist on the system.