University repositories: An extension of the library cooperative

As libraries are asked to create super archives, OCLC provides software and support

by Tom Storey

aerial photo of the Ohio State University

Located at the center of campus, the Ohio State University Main Library will be the virtual hub of the university with the OSU Knowledge Bank.

In 2001, a group of senior administrators at Ohio State University approached Joseph J. Branin about establishing a digital archive to advance distance education.

While leading a task force on continuing education and distance learning, the administrators discovered that digitized objects needed to support e-learning were part of a growing body of e-scholarship that should be collected, integrated, organized and preserved for university faculty and students.

Sounds like a job for the library, they concluded.

The OSU Knowledge Bank was born!

“What is most important about our story is that a group of senior administrators recognized the need to manage the University’s digital assets and acknowledged the library’s expertise and experience to lead the effort,” says Mr. Branin, Director of Libraries. “In essence, whether we work in library administration, collection management, reference or technical services, we are now taking on new roles as knowledge managers and creating an enterprise-wide knowledge management system for the university.”

“a group of senior administrators recognized the need to manage the University’s digital assets and acknowledged the library’s expertise and experience to lead the effort,”

“We will manage all types of information, not just the structured, published information we have traditionally been asked to collect, organize and preserve in the past.”

Over the next year, Mr. Branin led a planning committee with representatives from the offices of information technology and academic affairs. Lorcan Dempsey, Vice President, OCLC Research, also was a member, as was Michael Dennis, an official from the Chemical Abstracts Service. The committee studied other repository efforts, inventoried digital projects underway at the university, collected faculty suggestions and input and developed an action plan.

The OSU Knowledge Bank is gathering support and funding and hired its first project in July. Several pilot projects supporting this interdisciplinary, multimedia storehouse of knowledge capital will be underway by the end of August.

Ohio State University Knowledge Bank

Scenarios similar to this have been playing out at universities around the world as libraries seek to manage the explosive growth of e-content created by faculty and researchers. The idea behind repositories is to maximize the impact of research by gathering the intellectual output of a university into a searchable online collection and linking it to other repositories. The effort is part of a wider open archives initiative that promotes interoperability among computer systems.

To support library efforts, OCLC developed OAICat and OAIHarvester, two software applications that provide an open systems framework for repositories by supporting the Open Archives Initiative protocols for data storage and harvesting.

“This is an area where the OCLC membership believes we have an opportunity to bring considerable value to the library community through leadership, a platform for collaboration and a range of supporting services,” says Jeff Young, the Consulting Software Engineer in OCLC Research who developed the software. “WorldCat could conceivably be an access point to the e-content stored in archives.”

A number of universities are using the software in their efforts. Following are reports from five of them.

sheet musicDuke University and the Sheet Music Consortium

In March 2003, Perkins Library at Duke University began using OAICat software from OCLC Research as part of its involvement in the Sheet Music Consortium (SMC). SMC is building a centralized repository of music scores from the collections of its six member libraries at Duke University, Brown University, Indiana University, Johns Hopkins University, UCLA and the Library of Congress. The OAI-compliant repository is hosted by UCLA and is a gateway to the collections at each library.

Duke uses the OCLC open source software to make descriptive metadata from 19,000 music score records in a local database harvestable by UCLA. Duke’s records are part of a special collection of 19th and early 20th century American sheet music. Many of the original sheets are digitized, allowing users direct access to the music as well as covers and advertisements that offer evidence of the cultural context in which the songs were published. Duke’s participation in SMC marks its first public use of the Open Archives Initiative–Protocol for Metadata Harvesting, or OAI-PMH, the architecture on which OAICat is based.

“The back-and-forth between Duke and OCLC demonstrates the promise of institutional cooperation on open-source software projects.”

After evaluating several OAI software packages, Perkins Library selected OCLC’s OAICat because of its highly portable, Java Web application framework and because it is being used in a growing number of digital library applications, including MIT’s DSpace initiative. The Duke project team consists of Music Cataloger Lois Schultz, Information Systems Manager Jim Matthews, Research Content Development Head Paolo Mangiafico and Metadata Architect/Programmer Will Sexton.

Of particular importance to Mr. Sexton was the open source nature of the OCLC software.

“We found that the OAICat code did not process database queries in a way that was compatible with our sheet music database,” he says. “An e-mail discussion resulted in Jeff Young at OCLC altering the OAICat source code to accommodate a user-defined processing of data. I was then able to extend the code so that it could handle results from the Perkins database, and Jeff added the extension into the OAICat base distribution, making it available for all of the project’s users. The back-and-forth between Duke and OCLC demonstrates the promise of institutional cooperation on open-source software projects.”

digital library web banner

OAICat at Virginia Tech

Photo of Kunal Garach

Kunal Garach

As a graduate student supervised by Dr. Edward A. Fox who heads the Digital Library Research Laboratory (DLRL) at Virginia Tech, Kunal Garach has used OAICat to build data providers for the Virginia Tech ImageBase and the HCI Bibliography (HCIBib) collections, which are compliant with the protocols of the Open Archives Initiative. He has seen the software package grow from its nascent stages into a very useful tool.

“OAICat supports many different implementations and requires minimal customization compared to most of the other tools available,” says Mr. Garach. “Its biggest advantages are its platform independence and scalability. Since it has been implemented in Java, all it requires to run is an application server, and its scalability is undoubtedly the best I have seen.”

Mr. Garach’s first experience with OAICat was in 2002. He and two other graduate students selected OAICat as their development platform for a class assignment to build a data provider for the Virginia Tech ImageBase collection, a digital library of images. OAICat stood out because it was the only tool implemented in Java—most others were implemented in Perl, at the time—and it supported the latest version of the protocol for metadata harvesting (PMH v2.0). As the first group to officially use OAICat, they helped resolve a lot of bugs and “real world” issues that made the software more flexible and robust.

Earlier this year, Mr. Garach used OAICat to build a data provider for the HCIBib collection, a searchable index of more than 20,000 bibliographic records about Human-Computer Interaction resources. This time he found that, in the true spirit of open source software development, OAICat had been extended with new functionality and now supported multiple implementations, one of which was a file system implementation that was exactly what he required.

“Building the data provider for the HCIBib collection was a breeze. OAICat required minimal intervention and customization from my end.”

“I have no doubts about the utility of OAICat because it is very easy to customize and a lot is to be gained from its simple structure and ease of use.”

DSpace at MIT

Introduced in November 2002, DSpace is a university repository system designed to capture, store, index, distribute and archive the massive amounts of intellectual output created in digital form by MIT faculty and researchers. A joint project of MIT Libraries and the Hewlett-Packard Company, the system provides a flexible, open source storage and retrieval architecture that can be adapted to a range of data formats and research disciplines. Each research community uses a customized portal that matches its practices to submit items into DSpace.

The goal is to organize and share via the Web the more than 10,000 pieces of scholarly digital content produced each year by MIT researchers, most of which is hidden from search engines and not included in reference databases. The content includes books, theses, articles, images, data sets, teaching material, multimedia publications, visualizations, simulations and other models. Since November, more than 2,000 items have been deposited into DSpace and the open source code has been downloaded by more than 3,500 organizations and individuals worldwide, over ten percent of which have contacted the DSpace team with interest in deploying the system for their institutions.

“In just a short time we have begun to see the fruits of the open source process with several institutions helping us debug and improve the system,” says MacKenzie Smith, Associate Director for Technology, MIT Libraries and DSpace Project Director. “This is sound progress, and we are all excited by the interest and goodwill towards DSpace that we’re seeing.”

DSpace supports the Open Archives Initiative’s Protocol for Metadata Harvesting. OAI support was implemented using OCLC’s OAICat open source software, which makes DSpace item records available for harvesting by other OAI-compliant harvesters.

“The availability of open source code like OCLC’s OAICat is vital to the ability of the library community to take advantage of new standards like OAI,” says Smith. “If each institution had to develop this code for itself our progress would be much slower and many of these standards and protocols would flounder. That’s the whole premise of the DSpace system—we’ll get much further, faster as a community working together.”

Adds Robert Tansley, Architect and Developer, HP Labs, “The OAICat software saved us a great deal of time and effort in making the DSpace software OAI-compliant. We were greatly impressed with the responsiveness and helpfulness of Jeff Young, the developer at OCLC.”

The use of OCLC OAICat software at Université Laval Library

The Université Laval Library, Quebec, Canada, is using OAICat to build a preprint repository and an electronic theses and dissertations (ETD) collection.

The preprint system is scheduled to debut later this year and will enable campus research centers and academic departments to control which documents they index and publish, and at what level—locally or on the Internet. It also will allow researchers to enter and modify metadata for their documents.

The system will use OAICat linked with MySQL, a standardized computer language for requesting information from a database. With the help of Jeff Young, OCLC Consulting Software Engineer, Université developers modified OAICat so that the software could manage multiple entries in each element of the Dublin Core metadata format.

Launched in November 2002, the ETD system supports electronic submission and dissemination of theses and dissertations published at the Université. To date, the collection contains approximately 30 documents, most of which have been entered since March.

For both projects, a librarian and a computer analyst are involved. Pierre Lasou, Librarian, and Reda Benjelloun, Digital Project Coordinator, are directing the ETD project. Reda Benjelloun and Nicolas Bélisle, Computer Analyst, are leading the preprint project.

“To help us control costs and improve the accessibility of Université resources, we want to conform to Open Archives protocols and use existing tools rather than develop our own architecture,” says Pierre Lasou, Librarian. “Our electronic thesis and dissertation collection, for example, is now being harvested by two service providers, one of which is OCLC for the Networked Digital Library of Electronic Theses and Dissertations Union Catalog, which gives wider visibility to our collection.

“For now, we have two implementations of OAICat, but we plan to integrate them into a single repository for better management.”

Laval Library chose OAICat for three reasons, according to Mr. Lasou. First, they believe that the involvement of OCLC guarantees the integrity and quality of the product. Second, OAICat was developed as an open source project and is written in Java. This technology, Mr. Lasou says, ensures the viability of the software and its scalability by making it customizable. And third, the support associated with the software is efficient, quick and relevant.

The Archaeology Data Service at the University of York

Ancient iron potteryThe Archaeology Data Service (ADS) at the University of York has experimented with OAI repositories and harvesters to integrate important research data on archaeological exploration and fieldwork in ancient coinage systems.

“There is potential value of an open systems environment that allows heritage agencies to share data with various categories of user,” says William Kilbride, User Services Manager. “Information about archaeological sites, monuments and objects is held by many different organizations, such as museums, county councils, national agencies, universities and archives.

“These organizations share an enthusiasm for online dissemination of data that often is offset by the issues of bringing diverse data sets together, such as metadata standards, communications protocols and semantic interoperability. For some time now, ADS and partners have been looking at how to overcome these issues.”

In one test, Keith Westcott, Curatorial Officer, ran into some difficulties getting the OCLC OAICat program to work with the Oracle database software used by ADS. Different protocols and labels for date formats and field identifiers needed to be resolved. He worked with OCLC’s Jeff Young to sort out and revise the code.

In another project, Dr. Westcott encountered some headaches in harvesting and transferring data. The targeted repository failed to recognize ‘from’ and ‘until’ dates and therefore did not support selective harvesting. To get any metadata, ADS had to harvest it all, more than 50,000 records returned in batches of 500. Then, a small Java programme was written to store the data in a database.

Nonetheless, Dr. Westcott and Dr. Kilbride consider the trials a success and are optimistic about the future of open source software.

“The Open Archives Initiative supports custom metadata standards but the data provider and harvester simply need to agree on their schema,” Dr. Kilbride says. “Yet, if full advantage is to be taken of OAI, it would be better for a wider agreement—and deploymentof a richer metadata schema specifically for the cultural heritage sector.”

ADS collects, describes, catalogs, preserves and provides user support for digital resources created from archaeological research. It is working with national and local archaeological agencies and research councils to build and host an online catalog and research archive of archaeological data, such as text reports, digitized maps, aerial and site photographs and images of excavated artifacts.