Skip to main content
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

Web PDF Training Sets
Web PDF Training Sets
collection
6
ITEMS
196
VIEWS
by Internet Archive Web Group
collection

eye 196

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 22

favorite 0

comment 0

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
6.5M
VIEWS
by Internet Archive Web Group
collection

eye 6.5M

Open Access Journal Test Crawl (2018)
by Internet Archive Web Group
data

eye 8

favorite 0

comment 0

DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

"Full" crawl logs (for every hit) from CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-CRL' item for reports etc.
Scholarly TDM Corpora
Scholarly TDM Corpora
collection
44
ITEMS
31
VIEWS
by Internet Archive Web Group
collection

eye 31

Access-restricted text and data-mining corpora. If you are interested in getting access to work with this content, contact info@archive.org
UNPAYWALL-PDF-CRAWL-2019-04
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

Web PDF GROBID Corpus (July 2019)
Web PDF GROBID Corpus (July 2019)
collection
10
ITEMS
17
VIEWS
by Internet Archive Web Group
collection

eye 17

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
101,719
VIEWS
by Internet Archive Web Group
collection

eye 101,719

UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
7.1M
VIEWS
by Internet Archive Web Group
collection

eye 7.1M

OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
3.2M
VIEWS
by Internet Archive Web Group
collection

eye 3.2M

PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
523,609
VIEWS
by Internet Archive Web Group
collection

eye 523,609

DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 21

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 7

favorite 0

comment 0

Crawl reports and logs for CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-full_crawl_logs' item.
UNPAYWALL-PDF-CRAWL-2019-04
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

Web PDF GROBID Corpus (June 2019)
Web PDF GROBID Corpus (June 2019)
collection
10
ITEMS
54
VIEWS
by Internet Archive Web Group
collection

eye 54

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
7.2M
VIEWS
by Internet Archive Web Group
collection

eye 7.2M

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

Internet Archive Research Publication Crawls
Internet Archive Research Publication Crawls
collection
21,257
ITEMS
134M
VIEWS
by Internet Archive Web Group
collection

eye 134M

A series of open web crawls targeting journal articles, technical memos, essays, datasets, and other research publications. This collection contains WARC and CDX files that end up in Wayback ( https://web.archive.org ). See also bibliographic metadata corpuses at  https://archive.org/details/ia_biblio_metadata
OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 5

favorite 0

comment 0

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
2.2M
VIEWS
by Internet Archive Web Group
collection

eye 2.2M

OMICS-DOI-LANDING-CRAWL-2019-04
OMICS-DOI-LANDING-CRAWL-2019-04
collection
4
ITEMS
14,750
VIEWS
by Internet Archive Web Group
collection

eye 14,750

This crawl started in April 2019, as an informal collaboration with Crossref. Crawling a smallish number (100k) DOI redirects and landing pages (plus PDF outlinks, and maybe a couple other hops) for a single large publisher (OMICS, which has multiple subsidiaries). Intent is to get reasonably good capture that can be used as canonical preservation copies of the landing pages. Secondary goal is to get decent fulltext capture coverage.
DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

SCIELO-CRAWL-2020-07
data

eye 1

favorite 0

comment 0

OAI-PMH-CRAWL-2022-10
data

eye 0

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
CiteSeerX URL Crawl 2017
data

eye 5

favorite 0

comment 0

Configuration, Reports, and Logs for CITESEERX-CRAWL-2017 crawl.
OAI-PMH-PATCH-CRAWL-2021-12
OAI-PMH-PATCH-CRAWL-2021-12
collection
75
ITEMS
535,208
VIEWS
collection

eye 535,208

DATASET-CRAWL-2022-01
DATASET-CRAWL-2022-01
collection
5
ITEMS
6,069
VIEWS
collection

eye 6,069

OA-DOI-CRAWL-2020-12
data

eye 0

favorite 0

comment 0

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 2

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 6

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

OAI-PMH-CRAWL-2022-10
OAI-PMH-CRAWL-2022-10
collection
72
ITEMS
62,745
VIEWS
collection

eye 62,745

UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
1.3M
VIEWS
collection

eye 1.3M

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
1.2M
VIEWS
collection

eye 1.2M

DOI-CRAWL-2022-02
DOI-CRAWL-2022-02
collection
25
ITEMS
403,743
VIEWS
collection

eye 403,743

JOURNALS-PATCH-CRAWL-2022-01
JOURNALS-PATCH-CRAWL-2022-01
collection
104
ITEMS
1.4M
VIEWS
collection

eye 1.4M

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

MSAG-PDF-CRAWL-2017
data

eye 11

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
TARGETED-ARTICLE-CRAWL-2022-03
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
220
ITEMS
650,705
VIEWS
collection

eye 650,705

UNPAYWALL-PDF-CRAWL-2022-04
UNPAYWALL-PDF-CRAWL-2022-04
collection
41
ITEMS
189,812
VIEWS
collection

eye 189,812

DOAJ-CRAWL-2020-11
data

eye 3

favorite 0

comment 0

ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
data

eye 2

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 3

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-03
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2021-05
data

eye 10

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
OAI-PMH-PATCH-CRAWL-2021-12
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 2

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
TARGETED-ARTICLE-CRAWL-2022-03
collection
9
ITEMS
83,499
VIEWS
collection

eye 83,499

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
47
ITEMS
443,681
VIEWS
collection

eye 443,681

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 4

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 0

favorite 0

comment 0

OAI-PMH-CRAWL-2022-10
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
TARGETED-ARTICLE-CRAWL-2022-07
collection
43
ITEMS
217,124
VIEWS
collection

eye 217,124

SCIELO-CRAWL-2020-07
data

eye 3

favorite 0

comment 0

CiteSeerX URL Crawl 2017
data

eye 12

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.