Knowing where to search. Comparing the absolute and relative subject coverage of 56 databases

NA

By Aaron Tay, Lead, Data Services

Gusenbauer, M. (2022). Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases. Scientometrics, 1-63. https://doi.org/10.1007/s11192-022-04289-7

 

With the bewildering number of databases and academic search engines (I shall use database to refer to both from here on) out there, which should you use? There is little doubt based on past studies that in terms of absolute coverage, Google Scholar is the biggest (for example, see this recent study “Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations” ), but does that mean you should always use Google Scholar only? In fact, you should also consider relative coverage in each subject for the database as well as absolute coverage, but how do you compare?

Why read this?

Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases (Gusenbauer, 2022) is the only paper I know so far that not only does it provide an objective method to compare databases by absolute and relative subject coverage but also does it for as many as 56 databases! t does not include just traditional subject databases (e.g., Psycinfo, Econlit, ACM digital library), preprint servers (e.g., Arxiv) , Publisher Journal Platforms (e.g., Sage, Wiley, Springer), traditional citation indexes (e.g., Scopus, Social Science Citation index) but also some of the latest up and coming huge free search engines that aggregate records such as Lens.org, CORE, BASE, Semantic Scholar and the now defunct Microsoft Academic.

Another way of thinking about it is absolute coverage is important if you care about getting as many relevant results as possible even if it means spending time going through irrelevant results. On the other hand, if you want to maximizse relevant results without going through too many irrelevant results (recall), you want a database with high relative coverage of the subject (precision).

When should you use databases with high relative coverage by subject?

So, what should you consider when choosing a database?

Firstly, it depends on the information task you are trying:

  • If you are doing a lookup search where you have a known item in mind, use the search engine with the biggest absolute overall coverage , or absolute coverage in a subject (if your search is subject specific) is reasonable (see Table 4 for details on the absolute coverage of 56 databases).
  • If you are doing exploratory searching in one known specific subject area, you might benefit from searching a database with the highest relative coverage in that area (see Table 5 for details on the relative coverage of 56 databases). When you are looking in more cross-disciplinary areas, choosing multidisciplinary databases with higher absolute coverage in multiple subjects would help serendipity (see Table 4). Either way, exploratory searching benefits heavily from search and browse features like powerful filters, citation searches, so databases that support that are helpful (see paper “Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources”  for detailed analysis of search and browsing features).
  • The trickiest task is if you are doing systematic reviews, where you aim to maximise recall with reasonable precision/effort. In such a case, you would use a mix of databases with high relative subject coverage (which helps with high precision results) AND large multidisciplinary databases with high absolute coverage (this will increase recall but at the price of lower precision). 
  • Other factors to consider include the record type you are looking for and consider, if the database is included. Is it just peer reviewed journals? Or is grey literature allowed? In which case for example you wouldn’t want to use just Web of Science. (See Table 2 for characteristics of databases).
  • Also consider the chronological coverage of databases. If you are looking for recent papers, using preprint servers like arxiv is fine for computer science. But if you want historical papers, Web of Science or ACM digital library might be better. (See Table 2 for characteristics of databases).

Why Gusenbauer’s paper is important

Given all these factors, where can you get up-to-date information on:

  • Absolute and relative coverage of databases by subject
  • Database characteristics like chronological coverage and type of records covered
  • Search and browse features available for databases?

A fairly recent paper, Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources, provides a comprehensive listing and study of 28 different databases and platforms for the last two points.

But far harder to come by is data on absolute and relative coverage by subjects. While some databases report absolute numbers, it is hard to ensure they are comparable, and most do not attempt to quantify their relative coverage besides vague boosts like they are the top database on subject X.

This is where the recent paper by Gusenbauer (2022) comes in. It provides absolute coverage by subject of 56 databases (see Table 4) and relative coverage by subject (see Table 5). To see the 56 databases covered and their abbreviations (see Table 2) as well as the subject abbreviations (see Table 1) They propose an innovative sampling-based approach to estimate both absolute and relative coverage of databases.

The method is extremely sophisticated, but this is a rough sketch of the idea:

  • They specially test and select 14 unigram keywords to sample for each of the 26 subject areas (Scopus All Science Journal Classification or ASJC) The key idea is this. Assume one of the 14 keywords for the area of “Physics & Astronomy” is “nanotechnology”. Because Scopus articles are already classified via ASJC subjects, they can filter to all articles classified under “Physics & Astronomy” and show that 0.02% of such articles have the word “nanotechology” in the title. This is the recall of the keyword.
  • To ensure better accuracy in the model, they do the same for the other 13 keyword samplings for the same subject.
  • Now, say they want to determine the absolute coverage of “Physics & Astronomy” records in Google Scholar. They will run the 14 keywords assigned for the subject and look at the total query counts. They then project this to an estimated absolute coverage of the subject by using the recall rates established in Scopus.
  • The main complication here is that the selected keywords are not 100% representative. For example, while records with the title “nanotechnology” are mostly in Physics areas, it is possible some may not be. And some keywords would be worse than others, iI.e might have low precision. In fact, some subjects like “Decision Science” have less least discriminant or unique language so keywords used have less precision and pick up irrelevant results. How do they adjust for this? Perhaps by dropping or decreasing the weight of lower precision keywords?
  • The answer is that they can map Web of Science journal categories with Scopus ASJC subjects and use the WOS results as a control to guide decisions on precision cutoffs to adjust their results.

There are further adjustments such as to adjust for multi-assignment of subjects in Scopus and to normalise for the absolute size of the database which will improve estimates of absolute coverage of subject (but not relative coverage). For more details read the paper “Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases”.

Librarian’s take : This is one of the most impressive papers I have read studying in-depth coverages of databases. The statistical method the paper uses is extremely sophisticated and far beyond my statistical skill to evaluate. What is assuring is that they validate their method in many ways including comparing the same database across different platforms (e.g., Medline over OVID, Ebsco, Proquest) , using the sampling keyword search using not just titles, but abstracts and all fields etc and finding consistent results which bodes well for internal validity.

In terms of externality validity, besides using WOS as a control set, the results that emerge from this method for specialised databases are very plausible. For example, it correctly identifies CINAL as having the highest relative coverage in Medicine and Nursing subjects which aligns with how the content provider- EBSCO describes the database and supports what many experienced librarians believe. Still, there are some results that point to less intuitive results, at least on first glance.

Also, as the paper described in the limitation section, this method does not handle the question of quality but merely assigns content to the 26 subject areas. People who argue that Google Scholar may have the highest absolute coverage of physics but still refuse to use it because it is all low quality (e.g. predatory articles, preprints) can still do so.

References

Gusenbauer, M. (2022). Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases. Scientometrics, 1-63. https://doi.org/10.1007/s11192-022-04289-7

Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta‐analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Research synthesis methods, 11(2), 181-217. https://doi.org/10.1002/jrsm.1378