Storage/Backup Issues

Storage/Backup and Long-Term Preservation Breakout Group Report

Synopsis:

The Storage/Backup and Long-Term Preservation Breakout Group was charged to explore a series of related questions that concerned storage, the brick and mortar of any long term digital preservation system. As noted in the OAIS standard (CCSDS 650.0-B-1) for a digital repository and reference model for a digital information object, Storage, is one of six interconnected component (Ingest, Administration, Data Management, Access, and Preservation Planning) of the reference architecture. No component stands as an isolated archipelago. That swiftly moving streams of conversions in the breakout group meandered from component to component demonstrated the strategic importance to approach this subject as an interconnected web. Breakout group conversation and discussion focused upon a spectrum of topics, which included (1) best practices for storage infrastructure, (2) metadata standards to represent the logical context and understanding of digital files in human form, (3) business models to sustain long term preservation activities, (4) data models to store repository data, (5) planning models to identify, execute and validate preservation treatments and (6) domain specific challenges to establish a trustworthy storage/repository infrastructure for the Anthropological community. Across the sub-fields of Anthropology, the components of a storage infrastructure (hardware, media types, configurations and software to manage storage) needed for backup and preservation functions was contextualized by drivers and requirements of the other components of a long-term preservation system. As an illustrated example it was not possible to have dialogue that concerned storage media options for preservation with out also understanding the storage requirements, instrumentation, and tagging standards need by the archeologist to capture, describe and ingest field data.

Summary of Breakout Group Discussion by Topics:

Topic 1: What are the best practices with regard to storage and backup?

Best practices emerge over time as a result of a deeper understanding of a problem and outcomes from pilot projects or test beds established for experimentation. While the Anthropological community is just beginning to explore storage solutions for LTP (long-term preservation) the Digital Library community has for nearly a decade explored the principal issues and challenges that surround storage and backup of digital data. The principal problems that need to be addressed are well known and include (1) technological obsolescence; (2) media decay (3) replication, and (4) evolving standards to manage large storage pools or networked storage grids. The worst-case scenario for storage and backup identified by the group was locally managed storage. This modality is associated with a high probability of data loss over time. In this mode best practices followed by traditional data centers to protect data and secure unauthorized access to data is nearly impossible to maintain. The group ruefully noted that a significant number of students and researchers still managed their own storage. Hence the challenge here is to educate the community on the need to abandon this practice and adopt alternative solutions such as participation in grid storage networks. At the opposite end of the spectrum and across the Atlantic the European community has successfully demonstrated the efficacy of grid storage for LTP of digital data. The infrastructure for grid storage has trusted governance, which establishes best practices to deal with data management problems, associated with the aforementioned problems inherent in storage hardware and software used to manage storage. One member of the group characterized grid storage as “being alive”, continuously being refreshed and secure since access and replication where an integral part of the management functionality of the grid. In addition, participation in the grid also relieves the student or researcher with the responsibility to plan and manage his or her own media migrations. While storage grids do exist in the United States (see the NSF program on Grid storage at http://www.teragrid.org/about/ http://www.teragrid.org/about/ ] the group also discussed Commercial Cloud Storage as another option for LTP. This solution is just beginning to gain traction in the US Academic community since it is a potential cost saver. A powerful motivator while the country wrangles through a deep recession. Cloud Storage provides the opportunity to outsource the storage function to large commercial vendors like Amazon and Google that run their own storage grids. For this storage option trust is a significant issue. Commercial vendors are subject to the natural business cycle and no firm is completely immune to failure or takeover. How to access or recover data when a business fails is of serious concern to the academic community. In addition secure access to data was another problem identified with commercial cloud storage. In response to these concerns the Mellon Foundation recently sponsored a planning grant to understand how the academic community could take advantage of cloud storage without being at the mercy of the business cycle and to technically explore how commercial cloud storage could be overlaid with a service interface that would protect data from unauthorized access and automatically replicate data when a firm went out of business. Details about this initiative are available from the http://DuraSpace.org DuraSpace website. The breakout group also discussed storage media and configuration options for LTP. Optical disk, magnetic disk and tape have all been successfully used for data storage and backup. In most instances these media are combined to form a hierarchical storage system. Typically these systems deploy magnetic disk for fast online access to data and tape or optical disk to store off-line data that is infrequently accessed. The goal is to build a configuration that satisfies LTP requirements at a price performance that is affordable and sustainable. Finally the group unanimously recognized that storage and backup did not equate to long-term preservation of digital data. In the absence of a logical layer, such as PREMIS to overlay storage, over time digital data would become more difficult to: discovered, search, accessed or understood as hardware software and community standards evolved and made older storage and access system obsolete.

Topic 2: Does the PREMIS standard provide sufficient metadata to support the long-term context and access to anthropological data.

PREMIS (PREservation Metadata: Implementation Strategies) is the de-facto standard for the digital library community that specifies metadata entities recommended to ensure the long-term preservation (discovery, access, rendering and understandability) of digital data encapsulated in a vast array of file formats. An in-depth understanding of the PREMIS standard was not present in the group. This made it difficult to realistically evaluate PREMIS as a standard, which could be successfully applied to preserve anthropological data. However, in the absence of any other recognized standard the group maintained that leveraging and extending this standard for the Anthropology community was strategically the right course of action. The breakout group leader did have expertise in this area and with very broad strokes introduced the PREMIS entities (Intellectual, Objects, Rights, Agents and Events) to the group. There was a focus upon the Object Entity, which specifies metadata about the hardware and software environment needed to create and preserve a digital object. The Object entity also identifies software needed to access and render a digital object. Most importantly the Object Entity identifies the encoding standards for an object’s file format and characterizes a digital object as a simple file or a complex. A PDF file with an embedded image that could not be rendered independently of the PDF file serves as a good example of a complex object. On another note a policy question that needs to be resolved by some standards committee is how much of what elements, of this very elaborate standard, are need by the Anthropological community to meet their preservation purposes. It is not practical or affordable to capture data for all of the sub-elements in the PREMIS standard.

TOPICS 3-5: Repository Functions (Ingest, Access, Preserve) and Associated Data Models

Repository software used to ingest, save or preserve and access digital content used in the cultural heritage community is mostly open source. Repository software offerings that have gained significant traction in the digital library domain are (1) Fedora (2) DSpace (3) Greenstone (4) E-prints (5) Plone and (6) ContentDM from OCLC. It is important to note that the Fedora and DSpace communities have recently combined to form a consolidated community called DuraSpace. All of these application have out of the box client interfaces to there underlying data stores to simply the ingest, storage and search/access to data. In addition these repository systems have Application Programming Interfaces (APIs) that can be used to build customized web applications or web services for any of the aforementioned functions. Protocols such as OAI-PMH, OAI-ORE and SWORD, to name a few, have also been developed by the digital library community to make these systems interoperate so that data can be exchanged between systems. The group recognized that these power tools in the right hands could create highly customized systems tailored to meet the special requirements of the Anthropological community. However the group also recognized and discussed that there was a steep learning curve to understand these technologies and the cost to hire developers was also very expensive. The group maintained that one way to overcome these challenges was to appeal to granting agencies to provide additional support to build specialized systems based upon open source technologies that could be leveraged by other anthropological research projects. Although repositories have mostly the same functionality there are important differences in how the aforementioned systems represent stored data that is technically referred to as a data model. Just as the ability to search and discover is tightly bound to the representation of data the ability to preserve data is tightly coupled to a data model that facilitates preservation planning and preservation treatments.

Upon introduction from the group leader there was a discussion of the http://www.planets-project.eu PLANETS project, which has published a preservation data model and created a tool http://www.ifs.tuwien.ac.at/dp/plato/intro.html PLATO for preservation planning. Important characteristic of the data model were discussed which included (1) the ability of the model to provide two distinct views of stored data; one from the end-user perspective that facilitates search and discovery of preserved data the other from a preservation perspective which enables preservation treatments (media or format migrations) at the file set level that do not impact the end-user view or understanding of the data. Risk of data loss is inherent in any preservation treatment and the planning tool PLATO was designed to attenuate the risk. “The planning tool Plato is a decision support tool that implements a solid preservation planning process and integrates services for content characterization, preservation action and automatic object comparison in a service-oriented architecture to provide maximum support for preservation planning endeavors.” Again in the absence of other available standards the group maintained that is was strategic for the Anthropological community to leverage this standard for their community purposes.

Topic 6-7: The Trusted Digital Repository.

The scale and available resources of the Anthropological community will encourage researchers to participate in community-sponsored preservation repositories. The digital library and archival communities have over the past ten years done significant research in this area. For many organizations in these two communities the OAIS model from the Consultative Committee on Space Data Systems has become the de-facto standard. While this standards define the functional components of a preservation system it is agnostic as to how its modules are to be implemented. Nor does the OAIS standard directly address the issue of what constitutes a trusted digital repository. Without a means to verify a preservation repository’s capability to keep data alive over long periods of time as technology evolves researches will be chary to support and make deposit to preservation systems. This is a no win situation for the researcher or the community since it encourages preservation activities at the individual level. TRAC or Trustworthy Repositories

Audit & Certification: Criteria and Checklist is a The goal of the RLG-NARA Task Force on Digital Repository Certification has been to “develop criteria to identify digital repositories capable of reliably storing, migrating, and providing access to digital collections. The challenge has been to produce certification criteria and delineate a process for certification applicable to a range of digital repositories and archives, from academic institutional preservation repositories to large data archives and from national libraries to third-party digital archiving services.” To the Anthropological community this standards may not be appropriately scaled and alternative solutions by the community to assess trustworthiness of a repository are being pursued.

Recommended Next Steps:

The breakout group recommends that Anthropological community should take the following next steps to advance their understanding of the long term preservation of digital data in their domain in their domain:

Create a task force to propose an entity to recommend a long term plan and business model for funding and sustaining LTP specific to Anthropology

Create a standards body that will review proposed standards for LTP of anthropological data across the sub-domains

Anthropology should encourage leveraging the technical infrastructure of both commercial organizations and sister disciplines to promote LTP.

Anthropology should take the opportunity to extend open standards and open source software to promote LTP.

Anthropology curriculum should be expanded to include best practices and standards for digitization and LTP of digital data.

Appendix A

Break Out Group Membership

David Gewirtz

Georgetown University Library

Head Information Technology

Laura Welcher

(Director of Development and The Rosetta Project)

Dean Snow:

Penn State University

Professor of Archaeological Anthropology

Michael Fischer:

Professor of Anthropological Sciences in the Department of Anthropology at the University of Kent and is currently Director of the Centre for Social Anthropology and Computing, the University of Kent at Canterbury.

David R. Hunt:

Smithsonian Institution

Museum Specialist/Physical Anthropology Collections Management

Mark Mahoney:

Wenner-Gren Foundation for Anthropological Research, Inc.

Resource Coordinator at the Wenner-Gren Foundation:

Toward an Integrated Plan for Digital Preservation and Access to Primary Anthropological Data (AnthroDataDPA: A Four-Field Workshop). Group participants and their affiliations are given in appendix A.

http://www.oclc.org/research/projects/pmwg Link to the PREMIS website.

http://www.ifs.tuwien.ac.at/dp/plato/intro.html From Welcome to Plato, the Planets Preservation Planning Tool.

From the TRAC forward

Draft Storage/Backup and Long-Term Preservation Breakout Group Narrative 1


Return to Chair Reports

Leave a Reply