JournalTOCs Blog

News and Opinions about current awareness on new research

Archive for the ‘disruptiveInnovation’ tag

Measuring the usefulness and effectiveness of the API: A retrospective view of prototyping the use cases

with one comment

The project identified two use cases in the context of helping Institutional Repository (IR) managers to ensure that their content is complete and up-to-date.  The first Use Case tried to find an answer to the need for IR managers to gather articles for the IR as they are published. The second Use Case looked into the need for IR managers to be alerted when deposited “submitted” articles have been published in scholarly journals. The project developed and prototyped a lightweight RESTful API to solve or alleviate both cases, by making use of content that is already completely freely available, namely journal TOC RSS feeds.

The first Use Case was tested using information provided by the British Geological Survey repository NORA (NERC Open Research Archive) and by the University of Warwick repository WRAP (Warwick Research Archives Project). In the case of the WRAP repository only data from the Department of History was used. The methodology used for testing this use case was presented in the project workshop and made available in the JournalTOCs Workshop: Presentation 3 – Testing the First Use Case blog post.  Basically the methodology involves using two kinds of searches. One “batch” search and one set of “search by keywords” (the keywords are terms extracted from the institution name). The batch process, which combines searches by author, institution and subject, needs to be configured in advance and run offline. The search by keywords is done online and doesn’t require any previous configuration. The analysis of the results show that only 28% of the articles were positive results (articles that were really authored by researches from the institution). On the other hand 52% of the results produced by the best combinations of terms used by the search by keyword approach were positive results (Interestingly, for the NORA case, it was noticed that the extra effort of running a batch process had only identified two more authors than the quick search by keyword).

From the results obtained for the first Use Case, we can consider that searching by keywords is the most suitable option, despite only producing 50% positive results on average. The “batch” search does not justify the invested cost needed to be done by the IR manager and the API developer. It requires doing a setup for each repository. This setup is time consuming for the IR manager because she needs to identify the authors and the subjects that are relevant to her IR. Some IR managers have manifested that they may not be even able to get a list of authors for their own institutions. However, the main reason why the “batch” approach and in general any search by author fails is that the API is unable to unambiguously identify authors and their affiliation from the TOC RSS feeds. This is a problem beyond JournalTOCs capabilities. Our project has only confirmed the emerging need for having a means for uniquely and reliably identifying authors. We believe that the correct identification of authors will enhance the effectiveness of our API and in general enable proper discovery and reusability of research output. It is encouraging to know that the extremely difficult task of correctly associating research output with their legitimate authors is being carried out by the Names Project at the national level.  Based on these evidences it is not worth running a “batch” search based on authors’ names. (The problem could also be alleviated if the publishers would implement the ticTOCs recommendations and authors’ affiliations in their journal TOC RSS feeds.) The outputs obtained from this Use Case suggest that integrating the API results directly into the repository workflow will be not possible until the unambiguous identification of authors is happening. What the IR manager can do is to use the API to setup an RSS feed tailored for his institution and based on searching by keywords taken from his institution name. In this way the API would alert the IR manager when new articles including the name or similar names to his institution name are published online.

Identification of researchers

In the Second Use Case we aimed to alert IR managers when submitted articles had been published. (In this context a “submitted” article is an article that has been submitted to a scholarly journal and in some cases accepted by the peer-review process but not yet published). Using sources from Sherpa/RoMEO we created a local directory of 108 repositories, most of them from the UK, including details for their OAI servers and RSS feeds. Our first approach then was to setup a process to periodically collect and analyse the RSS feeds produced by the repositories. It quickly became evident for us that those RSS feeds were not suitable sources for our work. The problems found in these RSS feeds are discussed in detail in the ‘Do we need a “best practice” for generating RSS’s URLs for IR search results?’ blog post.

Our second approach to tackle the second Use Case was to use OAI-PMH to harvest the IR OAI servers and thus identify recently deposited articles from the repositories. The first harvesting uncovered interesting findings. First of all, the OAI repositories were not using a standard way to identify or categorise “submitted” articles, even among repositories using the same software platform. Therefore, there was no way to tell for sure whether an article was in fact a “submitted” one. Secondly, we ran a quick survey among 20 IR managers from a sample of harvested IRs. None of them were letting authors to deposit submitted articles directly to their repositories. Most of these managers were only taking published articles, making the distinction between submitted and published articles almost null. Having not succeeded with identifying “submitted” articles we decided to apply the look-up tool against each article found in the repository (this approach was only tested with two repositories and there is no evidence to suggest that it is an scalable solution, even when, at the present time, repositories have only a few thousands records). Two new obstacles were identified when doing the matching against the complete content of repositories that we harvested using OAI-PMH. The first one was the low number of positive results obtained by this method and the second one was the inability to identify for sure new records from the OAI servers. The two IR managers informed us that using only the title of the article to match harvested articles with the metadata collected from the RSS feeds were not giving enough positive results. Adding the keywords and the abstract and authors (if available) in the search query only increased the number of false positives. On the other hand, automatically identifying new records in an OAI repository was a challenge task due the inconsistencies made by the repositories when cataloguing the fields that were supposed to be used to identify new records and the dates when the updates have been done. In conclusion, the second Use Case produced relevant results only when the API was used by the IR manager to manually send search queries to the API and if these queries included specific keywords taken from the title of the article and the results were filtered by the journal title. In these cases there are high chances to obtain either positive results or null results (the number of negative results is always much smaller than the number of positive results). However, again the second Use Case has also highlighted the need for having access to rich metadata to uniquely and unambiguously identify authors.

In general the most pressing concerns of repository managers were to get content for their repositories in the first place and then to have high quality metadata. Even with the limitations mentioned in the previous paragraphs, the API has demonstrated to still be able to assist in both those aims, as expressed in the feedback sent to the project by the majority of IR managers that have tested the prototype. The users have also appreciated the ability of the API to process heterogeneous and incomplete metadata to produce reusable consistent and “clean” metadata on current publications.

Interestingly new use cases for the API were identified by the own users. In the following paragraphs, we will mention briefly some of these use cases or potential spin-offs.

1. Providing relevant metadata to Research Information (RI) systems. Representatives from ATIRA, a Danish software company that commercialise the PURE RI system, approached the project to request us to adjust some of the API’s calls to support two functionalities of PURE: (1) to automatically complete journal’s metadata when the user is cataloguing a new article with PURE and (2) to provide cataloguers with an additional or alternative source of bibliographic references, alongside other data sources such as Web Of Science, Scopus and Biomed Central.

2. Sherpa/RoMEO has interest in using the API to link journal titles and ISSNs to their publishers. Peter Millington, the SHERPA Technical Development Officer found that the data returned by the API was very useful and easy to use. However, he identified the following functionality issues (1) The API doesn’t return all the types of journal title query that RoMEO offers and needs (e.g. “contains”, “starts”, “exact phrase” queries) (2) There are some keywords that are ignored by the API to support queries made by IR managers but that are needed for RoMEO queries. The exclusion of some stop words such as “journal” is particularly unhelpful in this respect. (3) RoMEO has also requested us to implement a new call to support queries on publisher names and get back a list of their journals.

3. Expanding the “users” call to get back a list of articles per user. The API is able to perform searches by email address of a registered user and to return a list of journals that user has added to his MyTOCs folder. The call is being used by a large number of different types of users (e.g. librarians, students, researchers, etc.) Some of these users have requested us to expand the functionality of this call to provide users with the option to request for a list of articles in addition to the default option of returning a list of journals.

4. Using the API to provide library users with the capability of searching for the latest articles published in most of the journals for which the University has current subscriptions. That means that the user will always be able to access the full-text of the articles returned in the search results. This application was requested by the Institution leading the project, Heriot-Watt University. The API should be able to inter-operate with A-Z journal lists, link resolvers and off-campus access control mechanisms such as EZYproxy. In addition, users will be given the option to obtain their search results in RSS format. The library is keen to use the free service offered by the API because the library will not need to transfer its holding to any database external to the library or to modify their current database systems in order to use the API. Any UK University would benefit from the development of this API application. The only requirement is that the API is provided with restricted by enough HTTP access to the library database holding its current journal subscriptions.

5. Embedding search results in Current Awareness Subject based services. The “institution” call has also highlighted a new use case or area of application for the API. This application has already attracted a lot of attention from the community of students and academics in Engineering, Computing and Mathematics since TechXtra launched its new service TechJournalContents, which is fully based on the API. TechXtra is a free service providing access to research, learning and teaching resources in engineering, mathematics and computing. The brand new service TechJournalContents was well received by TechXtra users and has already been mentioned in more than 50 relevant blogs. We would like to enhance the API subject classification database to support other different subject-based services.

A final thought from the project is that each of the above use cases and in general any service based on reusing the journal TOC RSS feeds will greatly benefit from any effort that publishers could make to implement the ticTOCs Metadata Recommendations and the project recommendation outlined in the Author Affiliation blog post. Publishers need to realise that the required effort is very small compared to the benefits brought by reusable TOC RSS feeds, in particular for their own business and for the research community in general. The question on “convincing” publishers to produce valid, consistent and rich journal TOC RSS feeds is still unsolved.

convincing publishers

Written by Santiago Chumbe

December 11th, 2009 at 5:25 pm