• Archive

Why Your Digital Data Could One Day Disappear - Dark Ages II: When the Digital Data Die

Think your digital data is stored permanently? Think again, says Harvard Medical School professor Bryan Bergeron. Magnetic media such as floppy discs have a shelf life, are susceptible to technological obsolescence, and vulnerable to attack. That's why companies should start thinking now about how to best preserve their data. And try not to think about the digital Dark Age scenario.

Managing risk is part of life. For example, the mere act of crossing the street has a certain risk, normally outweighed by the desire for whatever lies on the other side. A man dares to cross a street, frequented by fast-moving cars and trucks, risking the potential of being hit and maimed or killed, in order to deposit a letter in a mailbox. However, the risk exposure, or the likelihood of being hit, multiplied by the loss of life, as measured in pain or lost income to the man's family, is obviously acceptable to him. The man's hesitance in crossing the street, if any, is typically minimized if a traffic light is present and reduced even further if a police officer in the intersection is directing traffic.

The probability or likelihood of the man being hit is still finite—the driver of a speeding car might be drunk or in a police chase or traveling in the wrong direction. In the latter case, the man can take a proactive role in minimizing his risk of being killed by looking both ways before he crosses. He can also wait a moment after the light changes, signaling cars to stop, to verify that the drivers are aware of the light and are actually stopping, for example.

Humans are very poor at estimating probabilities, much less probabilities multiplied by potential loss. For example, clinicians are notorious for not being able to intuitively resolve statistical probabilities in determining the best treatment options for patients, even though they may understand and work with statistical formulas and measures. For instance, if the probability of polio is very low in the general population, but a clinician sees five cases of polio in his office in a week, he's much more likely to think that his next patient may have polio as well. That is, his perception of the probability of a disease is influenced by recent events, even when the published statistical data are available. This characteristic isn't limited to clinicians; it's a general human trait. Most people are sensitized by recent events, and that sensitivity affects their perceptions from that point on.

Given the appropriate tools, however, most people can develop an appreciation for risk assessment and management. The theory of risk management involves fundamentals ranging from Bayes' theorem, chaos theory, and game theory, to probability and utility theory. For example, utility theory deals with individual differences in risk avoidance. Some people are simply more risk averse than others. They tend not to take chances, and when they do, they minimize their risk exposure by buying insurance, for example. Others are much less risk averse and may even gain pleasure from gambling and potentially beating the odds.

Although the formal approach to risk management has its place, when it comes to digital data loss, a common sense approach to managing risk seems at least equally applicable. In this regard, a reasonable risk management strategy is to first determine what data are at risk, what they're worth, and then use that information to determine what resources should be expended to reduce the risk to an acceptable level.

At one extreme, the data at risk may consist of one megabyte of downloads from a government Web site containing information that isn't likely to change in the next decade. Given that printed versions are available free for the asking, the data are worth whatever time it took to locate and download. Threats to the data include a hard drive crash, a viral infection, and theft. In this scenario, dedicating a 30-cent floppy backup for the data, labeled and stored in a shoebox in an office desk drawer with other archives, seems reasonable. Devoting a $10 Zip disk or taking the time to burn a CD-ROM of the data seems a little unreasonable.

In contrast, consider that the one-megabyte file has the only version of Jane's Ph.D. thesis, and it exists only on her laptop's hard drive. What's more, the laptop is about a decade old, there are no replacement parts or even batteries for the unit, and it's been acting up lately. In addition, as in the scenario above, there is a chance that the hard drive could crash, a virus could infect the system and her file, and her laptop, a portable device that she carries everywhere, could be stolen. Given these facts, the data deserve at least to be archived to several floppies, one stored at her home, one at her office, and another sent to a friend or relative. Assuming that the data could not be replaced, the cost and time involved in making several CD-ROMs seems very reasonable. What's more, at least one good quality printout on acid-free paper should be made as well. In a worst-case scenario in which all of the digital data perish, Jane will be able either to rekey the thesis into her computer or, more reasonably, feed the printed manuscript into a flat bed scanner equipped with optical character recognition (OCR) software to recreate the digital file.

These two scenarios illustrate a practical approach to managing risk [which is] further described below.

1. Identify and quantify the data at risk. The first step in managing the risk of digital data loss is to identify and quantify the data at risk and to specify the application used in writing the data. Record its identifiers: storage location or path, file size, file and folder names, date of last save, and data type such as text, image, sound, video, or animation. Here's an example: Data consists of a 1.02-MB text file, "JanesThesis.doc," saved on March 13 of 2001 as a Microsoft Word 2000 document, residing on the hard drive in Jane's laptop. Note that the data identifiers become valuable data that should be archived for future use.

In larger organizations, there may be tens of thousands of text documents, hundreds of electronic presentations, project management files, thousands of images, and hundreds of thousands of email messages at risk. In these situations, it often makes sense to perform regular data audits, patterned after periodic financial audits, to keep track of the data of an organization. Since data are assets, they should be subject to traditional accounting practices. Not only should the origin and current location of data be known, but also, like other assets, data should be categorized and organized in a way that supports the auditing process. That is, someone with the appropriate authority should be able to retrieve any particular data in the organization within minutes.

2. Establish the value of the data. Some data have more value than others. The value of data includes not only the production costs (time and materials) involved in data creation, but also replacement costs. Data may have sentimental value as well, based on emotional factors. For example, an image of a little boy on a beach captured with a digital camera may have very little inherent value. Perhaps the boy is unknown and the beach one of dozens of similar beaches in southern California. However, if the photographer is world-famous for his work, the image may be priceless.

Conversely, if the photograph is of a little boy who was recently killed in a hit-and-run accident, then the photo may have immense sentimental—but by no means priceless—value to his parents and grandparents. That is, his parents would not likely sell their home and car to buy the photograph. Finally, if the photograph was taken by a novice art student as part of a homework assignment and is one of dozens of similar photographs taken of the boy, then the value is relatively low unless the student eventually becomes a great photographer, or the boy a celebrity. In each case, the value of the data is a judgment call. Collectors and private citizens routinely gamble on keeping objects that may appreciate in value and attain the status of antiques. In the commercial arena, the value of data is often less fuzzy. A 1,000-image digital library may be worth $350,000 to a marketing firm, based on annual sales, cost of acquiring the images, and other easily quantified data.

The value of data is often highly time dependent. For example, if the photograph is needed for the cover of a major news magazine, then it may be worth tens of thousands of dollars to the publisher—as long as the photo makes it to press by the deadline. After the deadline has passed, the digital photograph may be worthless to the magazine. Similarly, if the photograph was taken by a true master, then the value of the data may increase sharply after the master's death, a common pattern in the art world.

3. Identify and rank the threats. Assuming that the data in question have some value, consider the possibilities of data loss and rank the threats according to the data type involved and why and how the data are maintained. For example, Jane probably ranks hardware failure, theft, accident, and operator error higher on her thesis threats list than software failure, corporate espionage, or network failure. Note that threats, like value, are time dependent. For example, the likelihood of Jane's laptop failing increases with time. At some point, her laptop will fail, even if she stores it in a safe, simply because the battery will eventually break down and corrode.

4. Determine how to best resolve each threat. At this stage, the techniques and processes best suited to resolving each threat are determined, within the constraints of available technologies. For example, the threat of natural disasters can be mitigated by creating frequent backups that are stored offsite, by establishing a disaster recovery center, and by taking out insurance on the hardware, network wiring, and other information infrastructure.

A reasonable backup scheme for a large organization might entail making a nightly tape backup that is stored offsite in the safe of a trusted and bonded employee. Similarly, a data recovery center should include backup servers and devices capable of reading backups, located a reasonable distance from the main site. What constitutes a reasonable distance depends on the natural threats to the primary center. For example, if the business is located on the bank of a river that has a history of flooding its banks every few years, then the offsite disaster recovery center should be located outside the river's flood plain.

Another approach is to resolve threats by outsourcing, as in using an external Storage Service Provider (SSP). However, outsourcing involves other work, such as performing a background check on the SSP. The vendor's market share, critical reviews of its products, financial status, and word-of-mouth reputation should be researched since they are critical to determining the long-term viability of the company. The assumptions made on outsourced service should be backed up with a service guarantee regarding the speed and reliability of their service. Any agreement with an SSP vendor should include an escrow account for the data and documentation on how the data can be restored to another SSP or other service provider. An insurance policy from a third-party insurer is a traditional way of managing risk as well.

5. Determine resource requirements. Once data have been identified, their value established, and the risks of loss identified and ranked, the next step is to determine what resources are required to protect the data. The goal at this stage is to identify the cost and time issues involved in archiving data, based on assumptions of how data will actually be preserved. For example, properly archiving 1,000 digital images to CD-ROM, including creating a standalone database indexed to image number with two keywords per image, might take $4 per image for digitizing, CD-ROM blanks, database software and about ten minutes per image for indexing. Thus, the project would entail $4,000 and four 40-hour weeks from someone who is familiar enough with the content to index it properly.

6. Establish return on investment (ROI). The next step in determining whether or not it's worth the time and money to archive data is to determine the return on investment, or ROI, for each piece of data or library. The formula for return on investment, usually expressed as a percentage, is:

ROI (%) = [Value ÷ Investment] x 100

In calculating the ROI, the ratio of the cost savings over the money invested, all values should be converted to monetary figures. For example, four 40-hour weeks might be equivalent to $5,000, including overhead. Referring to the example mentioned in #5, the 1,000 digital images, if the value of the image library is $350,000 and the total investment is $9,000 ($4,000 for digitizing and media supplies plus $5,000 for 160 hours of work), then the ROI is:

ROI = [$350,000 ÷ $9,000] x 100 = 38.9 x 100 or 3,890%

That is, the investment returns almost 39 times the amount of money invested. However, not all returns are this significant, or the decision to archive the data so obvious. Consider that if the value of the library is $10 per slide or $10,000, then the ROI is:

ROI = [$10,000 ÷ $9,000] x 100 = 1.1 x 100 or 110%

That is, the ROI is only 110 percent. At this level of return, it isn't obvious whether or not to archive the data. If the decision maker is highly risk averse, he may decide to invest the $9,000 with the hopes of saving $1,000 later. However, he would also incur a certain cost of $9,000. A more risk tolerant person might simply decide to save the $9,000 and invest it elsewhere, instead of incurring a fixed lost opportunity. Similarly, if the data have no sentimental value, but only economic worth, then it may be cheaper to simply insure the data through a commercial agency, assuming that the policy cost is significantly less than the value of the data.

7. Identify data to archive or protect. Given the ROI figures for each file or data collection, the next step is to rank the data by ROI and then define the ROI cutoff for the archiving process. The cutoff value is in part determined by personal preference (remember, some people are more risk averse than others). The decision should also be based on simple economics. For example, if the budget will only support $5,000 worth of archiving activity, then data with the greatest ROI should be archived first, walking down the ROI ranking, until the budget is exhausted.

8. Archive the selected data. Once the data destined for archiving have been identified, the next step is to do the work. If an external SSP is involved, then this stage primarily involves moving data to their database through the Internet. If the archiving is done internally, then the appropriate backup procedures should be followed, with backup tapes or CD-ROMs moved offsite.

In larger organizations, there may be tens of thousands of text documents … and hundreds of thousands of email messages at risk.
—Bryan Bergeron

9. Establish proactive measures. A critical step in managing risk of data loss is to take a proactive stance in working with data. That is, instead of thinking of data only after it has been processed, formatted in final form, and is ready for archiving, it should be considered at all stages as part of an ongoing process. Proactive measures call for purchasing name-brand hardware, making frequent backups, using antiviral software to scan documents before they are loaded onto a hard drive, and saving files frequently as they're being processed to protect them from a software glitch or hardware failure. Something as simple as setting the autosave feature within Microsoft Word to save a file every five minutes can guard against massive data loss in the event that the operating system freezes or the power is suddenly disrupted. Similarly, obvious safety precautions, such as not loading software and files from unknown sources before scanning them for viruses, have few up-front resource requirements. Such sources include email attachments and files transferred on floppy or zip disks.

10. Evaluate and modify the approach. Feedback from periodic evaluations is key to the success of any management project Too often, the only feedback in a data loss management project comes too late, when it's determined that the backup procedure doesn't actually work. There are accounts of actual cases where backups made by companies over periods of years contain nothing at all.

Obvious questions to ask—and answer—during the evaluation stage include:

  • Do the backups work?
  • How well do the media tolerate storage?
  • Are the storage conditions optimal for the media?
  • Can the system be restored to normal from the backup tapes or CDs alone?
  • How long does it take to fully restore the system?
  • What resources, in terms of people and time, are required?
  • If the system isn't perfect, what is the success rate and how can it be improved?
  • What is the cost of the program, and what is the cost of a restoration, in terms of lost productivity and computer system downtime?

Unfortunately, surprisingly few companies—and even fewer individuals—have a data loss management plan. That is, they may have an inkling of what should be done and perform backups relatively regularly because everyone knows that backups are important. However, unless there's a written plan that can be read, examined, scrutinized, and followed, one or more of the above stages will be skipped or shortchanged. In particular, without a written plan in place, there is a much greater likelihood that there will be an actual data loss incident. As described here, there are ways to control and understand the economic ramifications of data loss and reduce the chances of it recurring in the future.

Excerpted with permission from Dark Ages II: When the Digital Data Die, Prentice Hall PTR, 2002.

[ Buy this book ]

Bryan Bergeron is an assistant professor at Harvard Medical School. He also teaches for the Health Sciences and Technology program sponsored by Harvard and MIT. He teaches entrepreneurship and medical technology.

Books by Bryan Bergeron

Seven Questions for Bryan Bergeron

In an e-mail interview with HBS Working Knowledge editor Sean Silverthorne, author Bryan Bergeron details how both individuals and corporations are not doing enough to protect our digital data, and suggests a doomsday scenario threatening even civilization itself.

Silverthorne: In general, how perishable is our digital data?

Bergeron: The question is really how accessible is our digital data. Data may reside, physically intact, in the orientation of iron molecules on a floppy disc. However, even before the iron oxide flakes off of the plastic foundation (in, say five to twenty years, depending on handling and the environment), the computer that wrote the data—and can read it—will be extinct. For example, I recently purchased a Commodore 64 system, at nearly the cost of a current PC, simply to be able to read some commercial program I wrote in the mid-1980s. The discs were still readable, but, save for a few Commodore 64 systems in attics or basements, the discs were practically unusable.

Q: How actively should a CEO and other top management become involved in addressing this problem?

A: The CEO should first take the time to understand the nature of the problem. Assuming his or her company relies on digital data, then the cost of losing that data should be weighed against the cost of assigning someone to archive it. It's up to the CEO to take charge and define a vision for the long-term care of digital assets.

Q: Should there be a VP of Data Management?

A: In many corporations, the database management function is held by the Chief Knowledge Officer (CKO). Regardless of the name, someone in senior management should assume responsibility for the digital intellectual assets of a company. One problem with assigning the task to a CIO is that that he or she is typically removed from the process and the metadata that make the data valuable from an archival or reconstruction perspective. For example, a CIO might oversee a datawarehouse or datamart, but have no idea of how the data are used, if at all. Similarly, the most valuable data typically receives the same treatment as the least valuable data. You want someone who can differentiate between high and low value data so that they can be stored accordingly.

Q: Does the arrival of "pervasive computing'' worsen this problem, or help address the solution?

A: It makes the problem more acute. With more data out there, not only will it be more difficult to archive, but the noise level will increase as well. The more pervasive digital data becomes, the less valuable any one data element becomes, and it becomes a challenge to determine what data to store and what to let go. It's like digital data over the phone system. There are several gigabytes of voice data created every minute on the digital phone networks in the U.S., but what value is it? There are certainly pearls of wisdom—and often talk of drug deals and terrorists plots—in the data, but without the proper filters and other tools, the data are lost, from a practical perspective.

Q: What company or organization has addressed this situation correctly?

A: The National Archives is probably the gold standard of what can and should be done. You can also look at what Bill Gates has done with his photography collection—in terms of moving it to an underground cave. However, most companies can't afford caves, but can afford to use the services of companies like IBM, that provide access to secure storage environments.

Realistically, I believe that some of the Fortune 100 companies do a good job of maintaining their data, primarily through outsourcing their data archiving activities to third-party specialists. For example, several of the companies involved in the Twin Towers incident on September 11 had their data archived by IBM in underground, temperature-and humidity-controlled caves. Most of these companies were up and running in a matter of days.

Q: The consequences of lost data go way beyond the individual business—your book suggests even that civilization itself could collapse. What's the scenario for this—and how likely is it?

A: Just consider what would happen if a Webworm somehow erased every server in the U.S. Most companies couldn't replicate their Web sites. At best, they might have archived the individual data elements, the images and text, for example. However, the meta-knowledge is missing. Consider also what would happen in a war, or even if terrorists somehow disrupted the power grid for a year or more. Even a severe economic downturn could significantly reduce the amount of data available to the next generation. Consider when the dot-com boom ended, companies were tossing out PCs—and the data on their hard drives—for pennies on the dollar. All of those data are gone. All of the person-years of work, never to see the light of day.

I also think that the events of September 11 demonstrate how fragile our electronic, digital civilization really is. Two airplanes disrupted communications, power, the stock market, government, and myriad other processes for weeks.

Q: Any surprises while researching the book?

A: Yes. That most people aren't concerned. I think that many people are just beginning to reach the point were the fragility of digital data is becoming obvious. CDs in their personal collection are starting to skip. The first photos taken with their old digital camera and stored on some non-standard memory device aren't accessible. Perhaps their Ph.D. thesis, written on an original IBM-PC and stored on 5 ¼ inch discs, is no longer accessible because the hardware isn't available.