11.6M
12M
Aug 4, 2017
08/17
by
Internet Archive Web Group
Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
3M
3.0M
Sep 21, 2017
09/17
by
Internet Archive Web Group
IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
'dat' is a distributed web data archiving and transfer tool, originally developed by Code for Science, a grant-funded US non-profit. This collection preserves a selection of early and experimental dat archives. Note that important dat metadata is contained in a '.dat/' subdirectory, which is not displayed under "download" file listings by defaults, but can be browsed and downloaded from archive.org over HTTP(S) as expected.
Topics: dat, distributed web
A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal