Academic Search APIs and Retriever augmented language models (For SMU Libraries Hackathon)

11 Aug 2023

In the past, we talked about the latest generation of AI or Large Language Model powered search engines like Bing Chat, Elicit.org, Scite Assistant and we briefly covered Scispace in this issue.

While the details differ, they all generally include a retriever or search engine part to find relevant information to help ground the generated results (also known as Retrieval-augmented generation).

If you are looking to create your own version of this for SMU, what retriever or search could you use? Here are our recommendations of sources with APIs:

OpenAlex API
CORE API
Scopus API
Primo API (with Alma API)

Instead of working directly with large language models, you might be using the LangChain framework as it gives you access to many built-in search engines tools, but you could consider creating your own adapters for academic search APIs.

	OpenAlex API	CORE API	Elsevier's Scopus API	Primo/Alma API
Description	Comprehensive source of Scholarly content but no full text	Same as OpenAlex but also includes full text of Open Access papers	Scopus is the source of citation data used for University Rankings by Times Higher Education and QS rankings.	By default, Primo API is the same as the default search on SMU Libraries home page. Alma API can be used to check individual loans and fines
Content – Metadata (e.g., Title, Abstract, Author)	242 million works	291 Million papers	82 million papers from mostly high impact prestigious journals	SMU libraries holdings (default)
Context – Full Text	None	32.8 million	None. ScienceDirect API with full-text is available.	There is some full-text
License	Open	Open	Commercial	Commercial
Access Requirement	None needed, just add your email to API call	Register for API Key	API and institutional token needed. See instructions.	Ask SMU Librarian for API key
Strengths	Comprehensive, easy to use, high-rate limits – 100k per user per day	Similar in size to OpenAlex, has open access full text	Returns only articles from top tier journals. Citations are “official”	By default, returns only content available to SMU Libraries (including some open access)
Weaknesses	Comprehensive scope means it might return articles from dubious journals	Comprehensive scope means it might return articles from dubious journals	May miss out relevant content	Only shows SMU available content
API documentation	Available	Available	Available	Available

Other Scholarly APIs to consider include

Semantic Scholar API (has open access full text like Core)
Web of Science API (Similar to Scopus, however we have access only to basic API lite with limits)
Google Books API (Probably the best source of book data)

OpenAlex API

We first covered OpenAlex in a piece in Jan 2022. OpenAlex rose from the ashes of Microsoft Academic Graph as a replacement.

What is there to like about OpenAlex? As a source it is one of the most comprehensive sources of Scholarly metadata with over 242 million records, drawing from over 232K sources include large datasets like Crossref, the ISSN Network, and MAG as well as preprint servers, Institutional and disciplinary resources. As such it includes not just peer reviewed published papers, but preprints, post prints, conference papers and more.

Google Scholar is widely believed to be the largest single source of academic content so why not use that? The reason is it does not offer an API, scraping the data is possible but difficult.

In terms of API use, there is much to commend about it, it is designed to be easy to use around the entity model of works, authors, sources, institutions, concepts, publishers, and funders and is well-documented.

You do not need to authenticate with an API key to use it, just add your email to be put in the polite pool, which allows a generous 100k API request a day per user.

You can use the API directly or via third party Python or R packages.

Some limitations

It is comprehensive in scope perhaps exceeded only by Google Scholar (which has no API) but it may include content that is dubious or in what is sometimes called predatory journals
It does include any full text but only metadata which typically means title, author, journal and often abstracts
Not all the metadata is complete – for example only half of all records in OpenAlex has an abstract (though some are not article type content)
It includes a sizable number of academic monographs but is not as comprehensive as systems like Google Books

CORE API

The main drawback with OpenAlex is that it does not have access to any full text. Is there’s a solution? There is no complete solution, but CORE API is close. Core claims to be the “world’s largest collection of open access papers” and this is probably true. But the API not only provides access to over 32 million full-text indexed documents, but it also has over 291 million records of scholarly metadata rivalling that of OpenAlex.

Use of CORE API is slightly trickier, and you will need to request for a free CORE API api key in advance.

In terms of metadata records, it has similar limitations as OpenAlex, but in terms of the full text available there is a access to a limited subset of papers that are Open Access.

BTW CORE has leveraged the power of GPT-4 together with CORE API and announced CORE-GPT which generates answers to questions using data from CORE. Though at the time of writing there is no public demo. See the technical details at arxiv.org.

Scopus API

While OpenAlex and CORE API are quite large and comprehensive in terms of metadata record coverage, some might argue they prefer the results to come from a more curated database or sources. Scopus is one such source. A cross-disciplinary citation index with over 80 million records (again no full text), its data including citations are the source of ranking data used by university rankings like Times Higher Education and QS.

Though in terms of records, it is around one-third the size of OpenAlex and CORE, it provides quality control by only including articles from reputable and prestigious journals.

You will need to apply for an API key AND an institutional token in advance to use it. See instructions here.

Like OpenAlex API, Scopus API gives you access only to metadata records and not full text.

There is a related API, the ScienceDirect API which will give you access to the full text of articles (including those behind paywall) accessible to SMU on the Science Direct platform.

NEW: Scopus just announced a pilot “genAI-powered search model helps you find research papers by query and synthesizes the findings of decades of research into clear digestible summaries, in seconds.”

Primo and Alma API

Being able to discover scholarly content is great but what if you only want to restrict results to what is accessible to the SMU community?

This is where the Primo API comes into play. By default, the basic Primo Search API call will return results from our default library search at https://search.library.smu.edu.sg/discovery/search?vid=65SMU_INST:SMU_NUI.

What contents does this cover? As you might expect the default Primo search limits results to books, ebooks, articles and more that is available to the SMU community. This includes both paywalled and open access items.

On top of metadata, Primo API allows access to CDI (Central Discovery Index) which indexes up to 65K characters of material full-text if available (typically journal articles and online books).

The default SMU Primo search shows only the subset of CDI with content accessible to SMU. It is possible to set the API to return the full CDI results, though this will include items SMU cannot access to full text.

Access to this API is restricted, please contact SMU Libraries if you want access.

Conclusion

We have curated these set of academic search APIs based on our understanding and familiarity with them, but many others exist. Feel free to chat with one of our library mentors during the hackathon.