Issues and Problems

We now turn to more specific issues regarding AnthroDataDPA.

Preservation and access

1. Data are rapidly degrading in quality and being lost on a continuing basis. Much has already been lost irretrievably. We badly need functional repositories for digital data as soon as possible. These repositories need to be open to a broad range of depositors and backed up by institutional (including funding agency, university, professional association) commitments.

2. Formal repositories are needed and investigator- or project-oriented data-silos are not and will not be financially or technically sustainable, nor will they likely provide the sorts of access—and access control—that are needed.

3. A major issue is whether preservation and access should be undertaken by means of centralized or distributed repositories. However, a unified repository structure for all anthropology is unlikely to be the best solution. The scope of anthropological repositories should be based on shared needs for functionality and the nature of the data at issue. The fields of anthropology are sufficiently divergent in terms of research goals and the data used to address research questions that trying to unite them now is neither realistic nor necessarily desirable.

4. Data should be deposited in a trusted repository during or as soon after data collection as possible in order that the needed metadata can be accurately and inexpensively collected and that a secure copy of the data is maintained. However the repository should provide the ability for the investigator to have exclusive access to the data (or for the investigator to directly control access to others) for a reasonable period of time to permit publication. What is a reasonable time for investigator control may differ by subdiscipline depending upon the dominant publication modes. Enforced mandates from funding agencies and better guidance from professional societies would be most helpful in defining appropriate limits. With public funding, perhaps 3-5 years after the termination of the grant collecting the data is a reasonable limit, with 5 years for dissertations. In any case, 10 years seemed like an absolute maximum to restrict access to protect the investigator’s publication interests.

5. To preserve data for long term use, researchers must ensure long term ‘intelligibility’ in both human and computational terms. (See technical sections on Maintenance of Data Integrity, Best Practices for Storage Infrastructure). “Human intelligibility,” refers to the ability of future researchers to understand the information; this is too often compromised by the lack of documentation accompanying the digital file. “Computational intelligibility” refers to the ability of future hardware and software to interpret the file format; and this can be compromised by the pace of technological change. Since the 1996 report of the Taskforce on Digital Archiving1, it is commonplace to remark on the ‘digital dark age,’ Preservation is threatened by the rapid obsolescence of physical recording media and the equally rapid obsolescence of operating systems and file formats. Simons noted that physical media have declined in durability over the years, contrasting the long term legibility of inscriptions in stone with the many different types of storage media in use in the past 25 years (5.25” floppies, 3.5” floppies, Zip drives, Memory sticks, CDs, DVDs, Blu-ray discs).2 The obsolescence of operating systems and file formats is even more striking: current version of MS Word cannot read documents created in Word 1.0.

Decisions Regarding Depositors

While the group agreed in principle to the idea that all anthropological materials should be digitally preserved, it was recognized that prioritization of projects is unavoidable. The following criteria should be used to set priorities. The relative importance of each criterion must be determined on a case-by-case basis, considering the nature of the material, the resources available, and the goals of the project. 3 They are listed here in no particular order.

1. Ease of digitization: Some records are ‘low-hanging fruit’ that may take relatively little effort to digitize because of their condition, organization or description.

2. Format of material: Certain formats (e.g. magnetic tape) are inherently unstable and are likely deteriorate. Material in fragile formats may be prioritized in the interest of preservation.

3. Fragility of material: Records that are damaged or that have been stored in less-than-ideal conditions may be fragile and subject to deterioration.

4. Current level of access: How accessible are the records already, both to potential researchers and to the creators of the records? Will digitizing increase accessibility?

5. Frequency & intensity of anticipated use: Digitization can prevent damage from frequent handling of material. While future use can be difficult to anticipate, factors such as the identity of the creator or interest in the subject matter can be predictive.

6. Rarity or uniqueness of subject matter: If the records document a completely unique subject area (e.g. the only known recordings of an extinct language), they may be given priority. In most cases primary data should be given preference over derivative analysis.

7. Material in finite custody: An archive may wish to digitize material that is to be repatriated or is only in temporary custody, assuming that such digitization does not violate any agreement with the owners of the material.

8. Prioritize value of material within collections: In addition to prioritizing collections, material within collections can be prioritized. In a very large collection, the volume may preclude digitizing all at once. In such cases, a representative sample or a select subset can be digitized first.

Fostering Interdisciplinary Collaboration

Whether it is a committee, a consortium of archives, a series of ongoing workshops or an affinity group, there are several areas of activity that would benefit from central leadership.

Preparing material to be archived: A central organization can help anthropologists prepare material to be archived. This includes recording information and describing context that could otherwise be lost or recorded inaccurately (such as the purpose of the research project and dates, places and descriptions of each item or file).4

Match material with archives: A central group can help address the problem of ‘orphan’ archival material (records with no archival home). We can increase the portion of the anthropological record that is archived through outreach and collaboration. For this purpose, it would be appropriate for teams of archivists and researchers to focus on a specific domain.5

Adapt recommendations and standards: There are many existing standards for digital archiving. It is unreasonable to expect individual anthropologists to interpret and implement these standards on their own. A central group can identify relevant standards, adapt them if necessary to make them relevant within the context of anthropology, and work to encourage their adoption among anthropologists.6

Identify challenges to digital archiving: What are the challenges or barriers to progress in digital archiving? Are these challenges mainly social (e.g. related to peoples’ expectations and conceptions of archives)? Are they technical (related to infrastructure, user interfaces)? What sorts of resources are necessary to undertake a major digital archiving project?

Develop portals: While it is probably impractical to propose a single digital archive for the discipline of anthropology, it is possible to create portals to data or metadata.7

Education and Outreach: There is a need for outreach to scholars and other practitioners in the discipline of anthropology to increase awareness about digital archiving. Initial steps to educate anthropologists (such as panel discussions and workshops at regional and national conferences) are within immediate reach and should begin in the next year.8 Also, materials should be prepared to incorporate into classroom curricula, such as Field Methods and Research Design courses.

As we will discuss in the section “Funding and Support,” larger-scale efforts will take some planning, including application for funding. Furthermore, if such efforts are to be successful in the long term, anthropology will have to work to develop a sustainable community model bringing together all of the stakeholders in anthropological data DPA.

What to Do About Data in the Meantime?

In the absence of a central coordinating institution, which is the current case, the best solution is to find a trusted repository —perhaps even one’s university library—and, if possible, provide copies of data to other institutions. As already discussed, if at all possible, it is wisest to avoid going it alone. If you have not decided on a repository, you should follow the guidelines discussed in this working report. The absolutely worst solution is to store data in proprietary formats without publicly available file format specifications that may not be readable in the future. If the media are not upgraded, the data may also be lost.

Unresolved Issues

The two biggest areas in which the breakout groups did not arrive at a consensus revolved first around copyright, or more broadly, the ownership claims and interests of professional researchers and second, the type of metadata that are needed for searching across platforms. In the latter case, the metadata breakout group simply felt that that the topic was too difficult to tackle within the short time of the workshop.

Regarding ownership claims and interests of professional researchers, there was more genuine disagreement over the degree to which unrestricted, anonymous access to research data should be allowed. Although all agreed on the importance of DPA, the two perspectives can be summarized as:

1. The library perspective—knowledge should be shared as widely as possible. Withholding data works against core scientific principles.

2. Concern over “free-riders”—field researchers and data collectors may suffer because of the significant amounts of time they spend to collect data. Others who “use” their data can publish faster. Any DPA efforts must seriously address credit, incentives for depositing data, and knowing who accessed the data.

The various arguments are summarized in the Copyright Working Group Report.

The copyright working group also discussed the ambiguity of copyright laws with regard to data, datasets, and metadata. For example, in the U.S. copyright does not apply to “facts” but rather to “expressions.” Certain forms of metadata, such as metadata describing the meaning, methods, and limitations of a dataset would be likely covered by copyright. Other forms of metadata, particularly technical metadata (e.g., file formats, collection structures) would probably not be covered by copyright. Laws in other locales complicate the sharing of data. For instance, the EU has database protection laws that protect compilations of data. The desirability of some form of standardized licensing, such as Creative Commons, was mentioned.

Other questions that need to be pursued further are:

How do needs vary by subdiscipline? Disciplines vary in the ways they handle location, scale, temporal transgression, and representation in one, two, or three dimensions, not to mention in the kinds of data which are of primary interest. They also vary in the degree to which they have discussed and resolved ethical issues with regard to standards and access.

What is the proper role of universities in preserving and providing access to digital records? What are the current roles and the proper roles of individual researchers, academic departments, university libraries and university presses?

What are the cultural impediments to cyberinfrastructure development? How do we accommodate notions of ownership, senior grumpiness, lack of training, academic competition, fear of contradiction, and fear of preemption.

How do we treat sharing? Should prepublication sharing be encouraged or merely facilitated? Less controversially, how do we treat post-publication sharing? “We recommend that it becomes mandatory for scientific papers to explain where and how to access data and resources generated as part of the investigation. We are aware that some journals already have strong policy positions in this area, insisting that large data sets must be deposited in public databases, and that all reasonable requests for materials from other researchers must be fulfilled. There is however, heterogeneity with both policy and enforcement; surprisingly, many journals have no written policy on the availability of either bioresources or primary data9

How does replicability influence best practices? How do we accommodate differences between fields that advance by generating new databases to replicate research as opposed to fields that advance through the accumulation of shared data. Should even replicable data be preserved?

