Wednesday, May 18, 2022

OAI-PMH / OAI-ORE Harvester


OAI-PMH and OAI-ORE are the standards for the description and exchange of metadata and digital objects for archives. DSpace is compatible with OAI-PMH and OAI-ORE. It means, it's possible to import metadata and digital objects (e.g. text, images, data, and video) into DSpace. A DSpace administrator can import metadata from an e-journal/e-book/institutional repository (e.g. arxiv.org, doabooks.org, doaj.org). Harvesting metadata and digital objects from an external source will enrich the institutional repository run on DSpace and it also enhance the user experience. Following are the steps to harvest content from an external source.

1. Verify that the external source allows OAI and ORE harvesting. 

External sources allow harvesting OAI/ORE give guidelines. Check the website and find the instructions. For example, the Directory of Open Access books harvesting guidelines available at the following link, https://www.doabooks.org/en/doab/metadata-harvesting-and-content-dissemination

Checking the OAI-PMH interface of the external source is another method to verify the harvesting feature available or not. The format of the URL is, 

http://[full-URL-to-OAI-PMH]/request?verb=ListRecords&metadataPrefix=ore 

Add the OAI-PMH URL to the address and copy it into the browser address bar. For example, here is the OAI-PMH URL of the Directory of Open Access books,

https://directory.doabooks.org/oai/request?verb=ListRecords&metadataPrefix=ore

2. OAI-PMH / OAI-ORE Harvester Configuration

The configuration file located at /dspace/config/modules/oai.cfg. Open the oai.cfg file using any text editor. I am using Mousepad text editor. It's the default text editor available with Xubuntu. If you are working with other Linux based OS, you can install it by applying the following command,

sudo apt install mousepad

Open the oai.cfg file, 

sudo mousepad /dspace/config/modules/oai.cfg

and uncomment the line [Remove the # symbol], 

oai.ore.authoritative.source = oai

Save and close the file.

3. Harvest content from the user interface

Here I am going to show you how to harvest metadata from https://arxiv.org. arXiv is an Open Access archive for scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

Create a collection to harvest an external source. I have created a community with the name Open Access Resources and created a collection inside it, arXiv. 


1. Edit the collection and click on the tab Content Source

2. Check the option, This collection harvests its content from an external source.

3. Find the URI of arXiv from the page, https://arxiv.org/help/oa. URI is http://export.arxiv.org/oai2.

4. Give a set ID for the selective harvest; e.g. Use the ID, physics to harvest only Physics. To see all the sets available at arXiv, visit http://export.arxiv.org/oai2?verb=ListSets.

5. Select the default metadata format (Simple Dublin Core).

6. arXiv supports only OAI, select Harvest metadata only. Download metadata and bitstreams (images, text, documents) possible from ORE supported repositories. 

7. Save the configuration.

8. Click Import now button to start harvesting.

Visit the collection after importing the metadata.

Click on the article title to see the detailed view.

Reference

https://wiki.lyrasis.org/display/DSDOC7x/OAI

No comments:

Post a Comment