Maintenance of Data Integrity

To address the threat of technological obsolescence, Simons (2006) recommends that researchers create an archival master in an enduring file format and deposit the archival master in a preservation archive. A preservation archive is an established institution committed to long-term preservation of the digital object; a distinguishing characteristic is that a preservation archive will have a technology migration plan on which to found its claims of long term digital accessibility. Thus it contrasts with a ‘web archive,’ which is often only a website serving information from a database or file directory. Web archives rarely serve genuinely interoperable material, and they regularly disappear in response to changes in institutional servers or in the responsibilities of the archive creator.

Enduring File Format

What is an “enduring file format”? In the acronym created by Simons, it is a file that offers LOTS. In other words, it is Lossless, Open, Transparent, and Supported by multiple vendors. Each of these desiderata deserves some discussion.

Lossless. A lossless file format is one in which no information is lost through file compression. It is uncontroversial to say, for example, that an archival master should be uncompressed and unedited.1 However, copies may, of course, be made from the archival file, and these can be altered to serve as working or presentation copies2. Professional archivists usually recommend that the archival master be copied once, to make a ‘presentation master,’ and that compressed and edited copies be made from the presentation master, not the archival master. Although digital copying does not harm the original file if done correctly, use of a presentation master is probably good advice: some media programs compress automatically when they save a file;3 and to find this out too late is to irrevocably lose part of the information on the archival master. Although uncompressed file formats are preferable to even those with lossless compression,4 lossless compression is an option if uncompressed files are so large (e.g., video) that their storage is impractical. Lossless compression algorithms typically remove only redundant information (e.g., pixels of the same color in an image) and allow the full content to be recovered through the use of a decoding algorithm. ‘Lossy’ compression, on the other hand means that the so-called ‘irrelevant’ information can never be recovered; thus it is to be avoided for highly valued material. Although the difference between a compressed file and an uncompressed file may be indistinguishable to human ears and eyes, in creating a scientific archive of irreplaceable material (e.g., songs and ceremonies of a vanishing culture), we should remember that the scientific instruments of the future may be able to extract more information from the ‘noise’ on an uncompressed file than we are currently able to perceive. Table 2 shows some common extensions of uncompressed file formats and formats employing lossless and lossy compression.

Type Uncompressed Compressed (Lossless) Compressed (Lossy)
Audio: .wav, .aiff, .au (pcm)5 .ape, FLAC, TTA .mp3, .aac6, .wma
Images: .bmp, tiff w/o LZW .tiff (or .tif) w/LZW

.png

.gif (grayscale)

.jpg
Video: Rtv JPEG-2000 MPEG-2, DV, MPEG-4
Text: .txt .zip NA

Table 2: File extensions of compressed and uncompressed formats (Aristar-Dry, 2008)

Openness refers to the fact that some file format specifications are publically available; for example, html, XML, pdf, and rtf are all ‘open standard.’ This means that any software engineer can develop programs that can read these file formats. By contrast, information in proprietary file formats will be lost when the vendor ceases to support the software. “Open standard” is different from “open source,” i.e., software whose source code is publicly available. Examples of open source software include Open Office and Mozilla Thunderbird. Open source software usually creates files in open standards. And proprietary software usually doesn’t (though there are exceptions, e.g. Adobe pdf). But for long term intelligibility, open standards are more important than open source software. Table 3 below lists some open and proprietary software. Note that some of the most commonly-used software (e.g., Microsoft Word, Excel and PowerPoint) is proprietary and commercial and therefore the least likely to be preserved in the future.

Development Open Proprietary
Open .txt, .html, .xml, .odf, .csv NA
Commercial .rtf, .pdf .doc, .xls, .ppt

Table 3: Open and proprietary standards (Aristar-Dry, 2008)

Transparency. The file format requires no special knowledge or algorithm to interpret, because there is a one-to-one correspondence between the numerical values sent to the computer and the information they represent. Plain text, for example, has a one-to-one correspondence between the characters and the computer-readable binary numbers used to represent them. Similarly, the PCM (pulse code modulation) codec, which is employed by .wav, .aiff, and cdda files, has a one-to-one correspondence between the numbers and the amplitudes of the sound wave. Thus plain text files (.txt) can be read by any software program that processes text. And PCM signals can be interpreted by virtually all audio programs. By contrast, .zip and .mp3 files require implementation of a complex algorithm to restore the original correspondences. Today many programs provide automatic decoding of the common encoded formats. But we cannot be certain that these programs will not become obsolete. In the distant future, some of the encoding algorithms may be lost; and, at that point, interpreting compressed and opaque files will become a costly scientific endeavor.

Note that transparency is not possible with some advanced visualization techniques (e.g., 3-D or CT scanning, GIS).

Support by multiple vendors: Just as lack of compression and transparency are paired in file formats, use of open standards and support by multiple vendors go together in software development. Open standards are more likely than proprietary standards to have wide vendor support, because development using open standards is typically less costly. If a file format is open, there is no inherent barrier to creating another program that handles it. It is not necessary to reverse engineer the format or purchase the specification from the developer. The more software applications that handle a file format, the less likely that format is to fall victim to hardware and software obsolescence.

Best versus good practices. Ideals or “best practices” are not always obtainable; researchers may need to consider “good practices.”

Technical recommendations are a moving target. Because technology changes rapidly, regular consultation of up-to-date websites is recommended. See some general resources worth investigating.

[Previous: Best practices for storage infrastructure]   [Next: Recommended Next Steps]

  1. Arts and Humanities Research Council, (2009). []
  2. If the working copy is the primary copy—as, for example, during the ongoing creation of a database—it is important to export the information regularly into an enduring file format. For databases (which are usually managed by proprietary software) this means to export the data regularly into properly documented plain text. A .txt file with informative XML markup is ideal, but often the XML automatically output by a program will be only minimally helpful to someone trying to make sense of the file. In that case, a file including metadata identifying the fields and tables should be created and stored with the database output. []
  3. For example, Acrobat 7.0 will automatically compress large pdf files (see: http://www.planetpdf.com/forumarchive/166948.asp). Most importantly, however, as of this writing, most video capture programs automatically compress the audio track along with the video when it is downloaded to a computer. For that reason, linguists and musicologists are advised to make a separate audio recording, using a device like a hand-clap at the beginning to aid in synchronizing the files later on. See: http://emeld.org/school/classroom/video/field.html#1006 []
  4. As noted by a Senior Media Specialist at the Getty Museum, “Uncompressed data is trivial to decode, compressed data often is not. This makes for easier long-term viability of the file . . . . “ Furthermore, uncompressed data is less prone to loss: “Lossless compression means that a single bit in the compressed file may represent multiple bits in the uncompressed version. This magnifies potential damage caused by bit corruption. In an uncompressed file a single flipped bit will have little overall impact on the renderability of an image. In a lossless compressed file depending on whether the corruption is in the dictionary (in the header) or in image data it can have a larger effect. And in a lossy compression scheme a single bit corrupted can be extremely noticeable.” (Howard, 2003). []
  5. Technically, .wav and .aiff are container formats, file structures which allow combining of audio/video data, tags, menus, subtitles and some other media elements. They could theoretically contain compressed audio formats, but in practice they usually contain PCM (pulse code modulation) data, which is an uncompressed format. []
  6. Apple audio codec (.aac) and Windows media audio (.wma) both have a lossless version. Confusingly, both the lossless and the lossy compression formats use the same file extension. []

Leave a Reply