Skip to main content
SHOW DETAILS
eye
Title
Date Archived
Creator
DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 9

favorite 0

comment 0

This item contains output files related to the DOI-LANDING-CRAWL-2018-06 crawl of Crossref DOI redirect landing pages: - list of Crossref DOI numbers attempted - an index of DOI, URL, and final HTTP status codes
UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

See also the crawl logs item for this crawl.
UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
12.4M
VIEWS
by Internet Archive Web Group
collection

eye 12.4M

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

DOI-CRAWL-2022-02
DOI-CRAWL-2022-02
collection
25
ITEMS
311,914
VIEWS
collection

eye 311,914

JOURNALS-PATCH-CRAWL-2022-01
JOURNALS-PATCH-CRAWL-2022-01
collection
104
ITEMS
1.1M
VIEWS
collection

eye 1.1M

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
DOI-LANDING-CRAWL-2018-06
collection
279
ITEMS
3.6M
VIEWS
by Internet Archive Web Group
collection

eye 3.6M

CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
2.2M
VIEWS
by Internet Archive Web Group
collection

eye 2.2M

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

by Internet Archive Web Group
collection

eye 6,874

This collection contains web crawl data for a random selection of 500k (0.5 million) Crossref DOI redirects, including the doi.org redirect requests. The intent of this crawl is to gather loose statistics on the number of failing redirects, number of host websites that block automated crawling, and a corpus of HTML landing pages for metadata extraction (eg, "signposting" HTTP headers, linked data HTML metadata, semantic markup). Total size of (uncompressed) WARC data is 50 GB,...
CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

"Full" crawl logs (for every hit) from CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-CRL' item for reports etc.
DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-12
data

eye 0

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 6

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
1.2M
VIEWS
collection

eye 1.2M

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
1M
VIEWS
collection

eye 1M

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 1

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 2

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
44
ITEMS
370,303
VIEWS
collection

eye 370,303

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
2M
VIEWS
by Internet Archive Web Group
collection

eye 2M

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 0

favorite 0

comment 0

SCIELO-CRAWL-2020-07
data

eye 2

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
TARGETED-ARTICLE-CRAWL-2022-07
collection
0
ITEMS
90,617
VIEWS
collection

eye 90,617

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

Crawl reports and logs for CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-full_crawl_logs' item.
UNPAYWALL-PDF-CRAWL-2020-03
data

eye 0

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
6.8M
VIEWS
by Internet Archive Web Group
collection

eye 6.8M

SCIELO-CRAWL-2020-07
data

eye 0

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 21

favorite 0

comment 0

DATASET-CRAWL-2022-01
DATASET-CRAWL-2022-01
collection
2
ITEMS
5,425
VIEWS
collection

eye 5,425

OAI-PMH-PATCH-CRAWL-2021-12
OAI-PMH-PATCH-CRAWL-2021-12
collection
75
ITEMS
465,100
VIEWS
collection

eye 465,100

UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
16.8M
VIEWS
by Internet Archive Web Group
collection

eye 16.8M

Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 5

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
data

eye 1

favorite 0

comment 0

DOAJ-CRAWL-2020-11
data

eye 2

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 20

favorite 0

comment 0

OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
3M
VIEWS
by Internet Archive Web Group
collection

eye 3M

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
219
ITEMS
474,917
VIEWS
collection

eye 474,917

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 0

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
TARGETED-ARTICLE-CRAWL-2022-03
collection
9
ITEMS
71,956
VIEWS
collection

eye 71,956

CiteSeerX URL Crawl 2017
data

eye 10

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
Wide Web Targeted PDF Crawling (2017)
data

eye 10

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
UNPAYWALL-PDF-CRAWL-2021-05
data

eye 9

favorite 0

comment 0

CiteSeerX URL Crawl 2017
data

eye 4

favorite 0

comment 0

Configuration, Reports, and Logs for CITESEERX-CRAWL-2017 crawl.
This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
MSAG-PDF-CRAWL-2017
data

eye 10

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
UNPAYWALL-PDF-CRAWL-2022-04
UNPAYWALL-PDF-CRAWL-2022-04
collection
38
ITEMS
112,771
VIEWS
collection

eye 112,771