Digital Preservation and JHOVE2

 

 Next-Generation Characterization Workshop 23-27 May 2011

 

JHOVE2
 

JHOVE2 Tutorial (25-26 May 2011)

Purpose and intended outcome

 

The JHOVE2 tutorial provides an intensive introduction to the concepts, deployment, and use of the next-generation JHOVE2 characterization framework and application. Participants will be exposed to the following topics:

  • JHOVE2 concepts: source units, reportable properties, characterization strategy, assessment.
  • Demonstration of the JHOVE2 application.
  • Architectural review of the JHOVE2 framework and Java APIs.
  • Deployment and configuration. Integration of JHOVE2 technology into existing or planned systems, services, and workflows.
  • Third-party development of conformant JHOVE2 modules.
  • JHOVE2 open source community building, maintenance, development, and sustainability planning.

Digital preservation entails the pro-active management of digital information over time to ensure its continuing usability. Since digital information needs to be mediated by technology in order to be useful, it is inherently fragile and at risk of potential obsolescence with respect to continual technological change.
JHOVE2 is a Java framework and application for next-generation format-aware characterization of digital objects (http://jhove2.org/). Characterization is the process of deriving representation information about a formatted digital object that is indicative of its significant nature and useful for purposes of classification, analysis, and use. Effective and efficient means of characterization is a key component of any digital preservation program.
JHOVE2 supports four specific aspects of characterization:

  • Identification. The process of determining the presumptive format of a digital object on the basis of suggestive extrinsic hints and intrinsic signatures, both internal (e.g. magic number) and external (e.g. file extension).
  • Validation. The process of determining the level of conformance to the normative syntactic and semantic rules defined by the authoritative specification of the object's format.
  • Feature extraction. The process of reporting the intrinsic properties of a digital object significant for purposes of classification, analysis, and use.
  • Assessment. The process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policy rules.

The object of JHOVE2 characterization can be a file, a subset of a file, or an aggregation of an arbitrary number of files that collectively represent a single coherent digital object. JHOVE2 can automatically process objects that are arbitrarily nested in containers, such as file system directories or Zip files.
The JHOVE2 project seeks to build on the success of the original JHOVE characterization tool (http://hul.harvard.edu/jhove) by addressing known limitations and offering significant new functions. These enhancements include:

  • Streamlined APIs incorporating increased modularization and uniform design patterns.
  • Object-focused, rather than file-focused, characterization, with support for arbitrarily-nested container formats and formats instantiated across multiple files.
  • Signature-based identification using DROID (http://sourceforge.net/projects/droid).
  • Rules-based assessment to support determinations of object acceptability in addition to validation of format conformity.
  • Extensive user configuration of modules, characterization strategies, and formatted results Performance improvements using Java buffered I/O (java.nio).
  • Performance improvements using Java buffered I/O (java.nio).

The JHOVE2 project is a collaborative undertaking of the California Digital Library, Portico, and Stanford University, with generous funding from the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP). JHOVE2 is made freely available under the terms of the BSD open source license.