DSpace
In the year 2000, Hewlett-Packard teamed up with MIT's library to develop a system to meet the latter's need for an institutional repository. The result was called DSpace, for “Document space.” When the repository was launched in 2002, the software was simultaneously released as open source for others to use.
Chapter 3 illustrated DSpace from the end-user's perspective for searching and browsing. We return to one of the documents viewed there, Kofi Annan's Master's thesis, and consider the steps required to enter it into the system.
Figure 7.6 shows snapshots from the document submission process. To have reached this point, the digital librarian (or end-user, since institutional repositories often allow registered users to submit documents) has first had to authenticate herself using a facility built into DSpace. Stages in the upload sequence are displayed along the top of each screenshot. The first three stages, labeled “describe,” gather metadata from the user.
Figure 7.6a shows the second step, soliciting the title, author, and type of document (article, book, thesis, etc.).
Next, the librarian selects the file to upload. In
Figure 7.6b she has pressed the browse button and located the file—in this case a PDF document—in a pop-up window. Pressing Next produces
Figure 7.6c (after an uploading delay), where she can check the uploaded data—for example, its file size (roughly 6 MB in this case) and type. Clicking on the hyperlinked filename shows the uploaded version; through this page the librarian can also rectify errors, such as uploading the wrong file or its type being incorrectly assigned. The
Show checksums button below computes the MD5 checksum of the uploaded file, which allows the librarian an opportunity to verify the integrity of the bytes that have been transferred. To do this, the librarian computes the checksum for the original file on her computer and then checks that the value she arrived at matches the value displayed in DSpace.
Figure 7.6d, the next page, shows a summary of the information entered (and derived). Each section has a button that allows the librarian to edit the information. An additional step (not shown) displays the license that is to be used and requires the librarian to confirm that permission is granted, whereupon the item is finally submitted to DSpace and (ultimately) can be viewed as illustrated in
Chapter 3 (
Figure 3.22h). The exact procedure depends on how DSpace is configured. Often there is a further round of checking and quality control before the item is published in the repository. If the librarian leaves the submission sequence partway through, the information she has entered so far is stored under her DSpace username and the next time she logs in she can pick up the process where she left off.
The metadata values entered in DSpace are matched to Dublin Core elements, and the software includes an OAI-PMH server. Because it's written in Java, it is Unicode compliant, and the interface has been translated into different languages. DSpace is servlet-based and makes extensive use of JavaServer Pages (JSP). Its developers recommend that it be used in conjunction with the Tomcat Web server, using PostgreSQL or Oracle as the database management system.
DSpace treats documents as black boxes to which metadata is attached. It is therefore important to identify each document's MIME type correctly, because this dictates which helper application a Web browser will use to view it when it is opened. However, DSpace does recognize Word, PDF, HTML, and plain text documents and can be configured to support full-text indexing using the open source indexing package Lucene. Documents can be submitted by librarians or registered users individually over the Web or may be ingested as a batch by command-line scripts that run on the DSpace server.
Fedora
Fedora (Flexible Extensible Digital Object Repository Architecture) began in 1997 at Cornell University as a conceptual design backed up by a reference implementation. In 1999, the University of Virginia library used this reference implementation as a foundation for developing a digital library tailored to their needs. The two groups (note, incidentally, another pairing of a technology research group with a large-scale library, as for DSpace) ended up working together to produce an open source digital library toolkit for others to use.
Fedora is based on a powerful digital object model, encapsulated as a repository that is extremely flexible and configurable. Moreover, a repository stores all kinds of objects, not just the documents placed in it for presentation to the end-user (we shall see examples shortly). In contrast to DSpace, Fedora is not ready to use out of the box, a deliberate design decision by its developers that means Fedora can be used to develop a wide variety of digital libraries with radically different functionality. However, there is a price to be paid: you cannot use Fedora without programming support by IT staff.
Figure 7.7 shows the default Web interface for a repository that contains a collection of historic maps. This interface is not intended for end-users, but rather to indicate the possibilities and to allow developers to explore the system's capabilities. In
Figure 7.7a, the user has called up the search page and sought documents that include the word
Africa (searching across all indexed fields). The array of check-boxes determine which fields are displayed in the result list, in this case the document's identifier (persistent identifier, or pid) and title. In
Figure 7.7b, the user has clicked on the third item,
Africae tabula nova, and is viewing its associated information.
A
datastream in Fedora is MIME-encoded data associated with an object: for instance, a source file, such as an image or PDF file, or XML data.
Disseminators are methods that act upon an object, and they are linked to Web Services that provide dynamic capabilities that access the object's datastreams.
Figure 7.7b, called the Item Index view, is the result of running the default disseminator, which lists the object's datastreams. Datastream names are hyperlinked, and clicking them serves up the underlying MIME-encoded data source. For instance, clicking DC yields an XML datastream representing the object's Dublin Core metadata. In this example we would see that <dc:Subject> is set to
Africa, which explains why the item was found despite the fact that its title (
Africae) is a Latin derivative of the stem (
Africa). Clicking on the
url datastream brings up
Figure 7.7c, which displays a high-resolution version of the digitized image.
Figure 7.7d shows the disseminators associated with the object. This image is associated with the content model UVA_STD_IMAGE, which is why these disseminators appear. This content model, provided by Fedora, supports a wide range of image manipulation operations—resizing, zooming, brightening, watermarking, grayscale conversion, cropping, and so forth—through the RELS-EXT datastream (listed in
Figure 7.7c). The default Web interface lists the disseminators along with input components (text boxes, radio buttons, and the like) that match the disseminators’ parameters. To crop the image, for example, the user enters the
x,
y,
width, and
height parameters and presses the
run button to generate a cropped excerpt.
Ingesting, modifying, and presenting information from this rich repository of objects is accomplished through a set of Web Services. The services fall into two core groups, one for access and the other for management, and both RESTful and full SOAP-based versions of the services (
Section 7.4) are provided. A further protocol exists for expressing relationships between objects in terms of the Resource Description Framework (RDF,
Section 6.4). These protocols have allowed a variety of digital libraries to been developed, including Pergamos (
Section 1.3) and the U.S. National Science Digital Library mentioned in
Section 7.2, which contains over 4 million items.
The capabilities provided by these protocols can be investigated using the Fedora Administration application illustrated in
Figure 7.7e. The Administration application is another aspect of Fedora that is targeted toward IT developers rather than end-users or digital librarians. After they log in, users are presented with a blank virtual desktop upon which windows appear as they interact. For example, new objects can be created and existing ones can be edited or purged from the repository. The repository can be searched, and items can be selected and manipulated. Sets of files can be ingested or modified; they are streamed to the Fedora server using the protocol. Files must be in either FOXML, a native Fedora XML format; Fedora METS, a METS extension defined by the project; or the Atom Syndication Format, an XML extension used for Web feeds. Authentication is provided by the Extensible Access Control Markup Language. The Administration application also gives access to consoles that connect directly with Fedora's management and access facilities.
In
Figure 7.7e the user has repeated the query for
Africa: the result list is displayed near the bottom. Double-clicking the first item produces the partially obscured window at the top left, through which datastreams and disseminators can be viewed and altered. Here the DC datastream is selected, which presents XML metadata that can be edited directly within the text box or using an external XML editor. Exploring the datastreams for this object, the user views the RELS-EXT datastream (not shown) and finds that the associated content model is UVA_STD_IMAGE, which means that disseminators for cropping, watermarking, and so on are available for this image as well.
Fedora's Content Model Architecture is a way of establishing properties that are shared by a set of digital objects. It is built upon Service Definitions and Service Deployments, which, like the content model itself, are merely additional objects stored in the repository. The former provide abstract methods; the latter, concrete implementations. For example, a Service Definition tailored for images might provide a method for generating thumbnails at runtime suitable for previewing in a Web browser. Further definitions would be provided to handle different image types: perhaps one for JPEG, GIF, and PNG formats (since Java provides standard support for these) and a separate one for JPEG 2000 files, which require additional library support and may not mesh well with the built-in ones. Since the mechanism is accessed through Web Services, it might be implemented in an entirely different programming language. Regardless of which mechanism is used, the fact that images can have associated thumbnails can be included in the content model and shared by a variety of source documents.
Seeking to understand more about the content model associated with the map, our user calls up the UVA_STD_IMAGE content model object, displayed on the right of
Figure 7.7e. The RELS-EXT datastream in this object shows the RDF statements that are currently assigned, such as the
hasService predicate. Through the interface, additional statements can be added or existing ones can be edited or deleted. Alternatively, the displayed identifier information can be used as the source of new retrieval requests to continue the user's exploration of the repository.
Like DSpace, Fedora is written in Java and makes use of servlets. It ships with the McKoi relational database, a light-weight Java implementation, but can be configured to use other database management systems. Full-text indexing is achieved through a third-party package (GSearch) into which the Lucene, Solr (built on top of Lucene), or Zebra indexing packages can be plugged.
Because Fedora is not a turnkey software solution, additional development is necessary to shape it into a usable end product. However, following the open source philosophy, such modifications can be packaged up and made available to others in the form of ready-to-run software tools. Fez is a Web interface to Fedora that provides an institutional repository. It uses PHP and MySQL and includes notes for users migrating from DSpace or other repository systems. Muradora provides a Web front-end to a Fedora repository that re-factors authentication and authorization into pluggable middleware components. It provides a Shibboleth authentication module (
Section 7.5) and also extends the above-mentioned Extensible Access Control Markup Language; in addition, it has rich user-oriented search and browse capabilities. For many of these systems, installation is quite complex. For example, in addition to a Fedora installation, Muradora requires software support for LDAP, DB XML, and XForms.