Jack Schofield 

Drat, it’s down again

When computer systems crash on a large scale the repercussions can cost millions. Jack Schofield reports on the latest spate of software 'disasters' and asks who is really to blame?
  
  


The Halifax has just delayed the launch of its internet bank, Intelligent Finance, for a month because it doesn't work properly. It's easy to understand why Jim Spowart, IF's chief executive, is wary: the Prudential's Egg collapsed when it was launched, and last month, Cahoot, the Abbey National's online bank, went down in less than 90 minutes, swamped by the number of would-be customers. He doesn't want to make it three in a row.

These are merely the latest in a long series of well-publicised problems with websites including Boo, the failed fashion retailer, Microsoft's Hotmail email service, and eBay, the giant auction site. America Online, the world's largest internet service provider, once went down for 19 hours, depriving net addicts of their vital "fix" of email and chat.

Computer problems stretch far beyond the internet. Last month, travellers were delayed at many British airports after air traffic control computers at West Drayton, near Heathrow, shut themselves down because data in a software upgrade did not fit the space allowed for it. In April, the London Stock Exchange was brought to a standstill on what should have been its busiest day after overnight computer processing took longer than expected, and share prices were scrambled. In both cases, minor glitches led to major problems.

The British government has had even more trouble with computer projects, this month's possible loss of millions of tax records being a relatively minor example. Recent disasters include Pathway, the Post Office's swipe-card project, which was being developed with the Benefits Agency, problems with the national insurance recording system, the immigration and nationality directorate computer, and the Home Office's passport system.

The parliamentary public accounts committee, looking into these problems, found more than 30 government computer projects that had gone seriously wrong. The committee's chairman reportedly observed that "a number of the departments appear before you so often that they are almost like old friends".

But while a lot of big computer projects go wrong, in the sense that they come in late or over budget, so do ones that have nothing to do with computers: big projects are inherently difficult to manage. Governments may do worse than private industry because civil servants are less good at project management, or because governments attempt more big computer projects than private companies, or because government computer failures are harder to hush up.

Whatever the case, everyday life is now so dependent on computers that their failure could conceivably lead to real disasters, and the recent century date-change problem - known as the Millennium Bug - had many people contemplating the breakdown of society.

Actually, most computer systems work most of the time, and it is journalistic hyperbole to describe service failures as disasters.

Arthur Lawrence, chairman of the British Computer Society's specialist group on safety-critical systems, says: "People who don't know anything about it tend to think disasters are happening all the time, when they're not. A computer failure is an inconvenience or an embarrassment rather than a disaster. Most people in the safety-critical computing industry are working to ensure that disasters can't happen."

A spokesman for National Air Traffic Services (Nats) was also quick to stress that while the failure of the air traffic control computers "did cause some difficulties, obviously, there was no loss of safety. And, in fact, on the day in question, there were no safety-related incidents".

Air traffic controllers had to revert to "a manual system, which was slower, so the number of flights was curtailed, but there wasn't actually anything dangerous".

The spokesman also argued that the IBM mainframe-based flight data processing system had run at 99.99% availability in 1999, which was an improvement on the 99.97% availability achieved the previous year. "This was the first major failure in 15 years."

In the computer industry, 99.99% availability - which allows downtime of 53 minutes a year - is not considered particularly noteworthy. The more challenging and much more expensive target is "five nines" or 99.999% - an average annual downtime of nearer five minutes - but Nats only quotes figures to two decimal places.

So far, building computer systems that provide "five nines" or 24/7 (24 hours a day, seven days a week) operation has been a specialist job. Over the past 20 years, the main suppliers have been Tandem, which was taken over by Compaq, and Stratus, which provided IBM with hardware it resold as the System/88. In a fault-tolerant system, you can remove a major component - a processor chip or hard drive - and it will keep going. But such systems are expensive, because they are based on specialised software and highly-redundant hardware.

David Chalmers, from Stratus, says things are about to change. Previously mainly banks and other financial service providers bought fault-tolerant systems because they were handling transactions that could be valued - someone taking money from a cash machine, for example. "On the web," Chalmers says, "every transaction has a value. When eBay went down, its market capitalisation shrank by $3bn."

In effect, the web has brought many companies' "back office" computer systems not just in to the "front office", it has put them on a global stage. "Today when your system goes down, everybody knows," says Chalmers. "The online economy needs what we do."

Naturally he has something to sell: ftServer computers that Stratus plans to launch at the end of the year. The new design will be based on Intel's next-generation 64-bit processor, codenamed McKinley, and Microsoft's forthcoming data centre version of Windows 2000.

By using industry-standard parts, Chalmers claims the ftServer will provide fault tolerance that, instead of being more expensive, will be significantly cheaper than the mainframes or clusters of minicomputer servers commonly used today. "We took a long hard look at ourselves," says Chalmers. "We'll keep doing what we've always done, but we've changed our view of price."

The ftServer depends on a deal Stratus has signed with Microsoft that will allow customers to run multiple copies of Windows 2000 in lock-step - so that if one system fails, another takes over transparently - but only pay for one. It also depends on other suppliers signing up to produce ftServer hardware: NEC of Japan has been the first to license the technology.

But, as Chalmers concedes, "there's a lot more to availability than smart tin". Computer system failures are often not the fault of the hardware or even the operating system, but of incorrect procedures or specifications. In other words, the software "works as designed", but the design was wrong.

John Higgins, director general of the Computing Services and Software Association (CSSA), says: "On the face of it, computers go wrong and people tend to blame the computer, but what does that mean? Really you have to look behind the computer, and usually you'll find somebody's trying to introduce some big change in procedures and working practices and there are bound to be hiccups. It tends not to be the technology."

Either way, it's important to find a solution, because New Labour is betting the future of the country on an IT strategy that involves making all government services available online by 2005. This is going to cost billions.

The public access network - a "national network of over 700 UK Online centres in libraries, schools, shops and even pubs", according to e-minister Patricia Hewitt - will cost £450m. The government has published Successful IT: Modernising Government In Action, a Cabinet Office guide intended to "improve the way government handles IT projects" www.citu.gov.uk/itprojectsreview.htm.

This mandates two huge changes. First, projects have to focus on delivering "business benefits" rather than installing computer systems. The report says: "Too often we have seen an approach that looks only at part of the change programme (for example, bringing in new technology) and does not integrate this with other elements (such as culture change) or take an overall view of the whole change process. Achieving and maintaining this integration is a vital, and ongoing, management task."

Second, each project must be run by an individual, a boss, not by some buck-passing committee. Indeed, the report requires that all "IT-based change programmes and projects" have "a senior responsible owner". This is an approach that works well in private industry, but may not survive entrenched government bureaucracy.

But as Higgins says: "The stakes are getting higher: it's not as though we'd computerised all the things we're going to computerise. "We've got very ambitious programmes in the public sector and in the private sector, with people looking for increased efficiency and increased exploitation of the technology, so we've got to get better at implementing it."

 

Leave a Comment

Required fields are marked *

*

*