This item contains both metadata and fulltext PDF content (from the public web) related to research on COVID-19 and past influenza pandemics. This content backs the https://covid19.fatcat.wiki search interface.
Rough numbers:
- over 51,000 metadata records from 2020-04-10 release of CORD19 corpus
- over 79,000 metadata records total (union of the above plus fatcat.wiki keyword matches)
- over 45,000 fulltext PDF files and derived PNG thumbnails and pdftotext text files
The upstream "CORD19" dataset from AI2 / Semantic Scholar is used only to identify works for inclusion. Primary metadata comes from fatcat (which aggregates other open sources), while fulltext PDF content comes from historical and ongoing public web crawling into the Wayback Machine (https://web.archive.org).
This corpus is provided as a convenience to librarians, archivists, researchers, and others building automated tools to understand the state of research. All users and re-distributors of this content must verify licensing terms on the fulltext content, particularly around non-commercial limitations or special time-limited exceptions by publishers for COVID-19 research purposes.