Skip to main content

5
UPLOADS


More right-solid

More right-solid

Show sorted alphabetically

More right-solid

Show sorted alphabetically

SHOW DETAILS
eye
Title
Date Archived
Creator
MSAG-PDF-CRAWL-2017
collection
1,855
ITEMS
11.6M
VIEWS
by Internet Archive Web Group
collection

eye 11.6M

Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
3M
VIEWS
by Internet Archive Web Group
collection

eye 3M

collection

eye 1.9M

IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
Dat Early Days Collection
Dat Early Days Collection
collection
4
ITEMS
6,360
VIEWS
collection

eye 6,360

'dat' is a distributed web data archiving and transfer tool, originally developed by Code for Science, a grant-funded US non-profit. This collection preserves a selection of early and experimental dat archives. Note that important dat metadata is contained in a '.dat/' subdirectory, which is not displayed under "download" file listings by defaults, but can be browsed and downloaded from archive.org over HTTP(S) as expected.
Topics: dat, distributed web
CiteSeerX URL Crawl 2017
CiteSeerX URL Crawl 2017
collection
207
ITEMS
1.1M
VIEWS
collection

eye 1.1M

A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal