Skip to main content
SHOW DETAILS
eye
Title
Date Archived
Creator
UNPAYWALL-PDF-CRAWL-2021-05
data

eye 9

favorite 0

comment 0

Internet Archive Research Publication Crawls
by CNKI
data

eye 0

favorite 0

comment 0

Metadata about COVID-19 papers downloaded from:  http://en.gzbd.cnki.net/GZBT/brief/Default.aspx
Internet Archive Research Publication Crawls
by Wanfang Data
data

eye 4

favorite 0

comment 0

Metadata and some fulltext PDFs from Wanfang Data, downloaded 2020-03-29 from http://subject.med.wanfangdata.com.cn/Channel/7
Internet Archive Research Publication Crawls
by Wanfang Data
data

eye 6

favorite 0

comment 0

Metadata and some fulltext PDFs from Wanfang Data, downloaded 2020-03-29 from http://subject.med.wanfangdata.com.cn/Channel/7
UNPAYWALL-PDF-CRAWL-2019-04
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
96,737
VIEWS
by Internet Archive Web Group
collection

eye 96,737

JOURNALS-PATCH-CRAWL-2022-01
JOURNALS-PATCH-CRAWL-2022-01
collection
104
ITEMS
1.1M
VIEWS
collection

eye 1.1M

UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
6.3M
VIEWS
by Internet Archive Web Group
collection

eye 6.3M

UNPAYWALL-PDF-CRAWL-2020-03
UNPAYWALL-PDF-CRAWL-2020-03
collection
344
ITEMS
2.2M
VIEWS
by Internet Archive Web Group
collection

eye 2.2M

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
collection
1,011
ITEMS
1.8M
VIEWS
by Internet Archive Web Group
collection

eye 1.8M

collection

eye 2.1M

IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
DOAJ-CRAWL-2020-11
DOAJ-CRAWL-2020-11
collection
102
ITEMS
1M
VIEWS
by Internet Archive Web Group
collection

eye 1M

DOI-CRAWL-2022-02
DOI-CRAWL-2022-02
collection
25
ITEMS
309,045
VIEWS
collection

eye 309,045

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

MAG-PDF-CRAWL-2020-07
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

OA-JOURNAL-CRAWL-2020-07
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

See also the crawl logs item for this crawl.
JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
44
ITEMS
367,737
VIEWS
collection

eye 367,737

Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
3.4M
VIEWS
by Internet Archive Web Group
collection

eye 3.4M

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
2M
VIEWS
by Internet Archive Web Group
collection

eye 2M

MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
4.7M
VIEWS
by Internet Archive Web Group
collection

eye 4.7M

OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
1.7M
VIEWS
by Internet Archive Web Group
collection

eye 1.7M

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 3

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

Crawl reports and logs for CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-full_crawl_logs' item.
PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-03
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2019-04
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
3.8M
VIEWS
by Internet Archive Web Group
collection

eye 3.8M

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
6.8M
VIEWS
by Internet Archive Web Group
collection

eye 6.8M

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 5

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

DOAJ-CRAWL-2020-11
data

eye 2

favorite 0

comment 0

MAG-PDF-CRAWL-2020-07
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 20

favorite 0

comment 0

OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
3M
VIEWS
by Internet Archive Web Group
collection

eye 3M

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
219
ITEMS
468,730
VIEWS
collection

eye 468,730

PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
517,355
VIEWS
by Internet Archive Web Group
collection

eye 517,355

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 9

favorite 0

comment 0

This item contains output files related to the DOI-LANDING-CRAWL-2018-06 crawl of Crossref DOI redirect landing pages: - list of Crossref DOI numbers attempted - an index of DOI, URL, and final HTTP status codes
Open Access Journal Test Crawl (2018)
by Internet Archive Web Group
data

eye 8

favorite 0

comment 0

OA-DOI-CRAWL-2020-12
data

eye 0

favorite 0

comment 0

DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 0

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

"Full" crawl logs (for every hit) from CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-CRL' item for reports etc.
OAI-PMH-PATCH-CRAWL-2021-12
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2022-04
data

eye 3

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 2

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 6

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
1.2M
VIEWS
collection

eye 1.2M

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
1M
VIEWS
collection

eye 1M

OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
11.7M
VIEWS
by Internet Archive Web Group
collection

eye 11.7M

SCIELO-CRAWL-2020-07
data

eye 0

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

OAI-PMH-PATCH-CRAWL-2021-12
OAI-PMH-PATCH-CRAWL-2021-12
collection
75
ITEMS
463,039
VIEWS
collection

eye 463,039

PUBMEDCENTRAL-CRAWL-2020-02
PUBMEDCENTRAL-CRAWL-2020-02
collection
108
ITEMS
296,336
VIEWS
by Internet Archive Web Group
collection

eye 296,336

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 21

favorite 0

comment 0

DATASET-CRAWL-2022-01
DATASET-CRAWL-2022-01
collection
2
ITEMS
5,412
VIEWS
collection

eye 5,412

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
6.1M
VIEWS
by Internet Archive Web Group
collection

eye 6.1M

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-07
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

SCIELO-CRAWL-2020-07
data

eye 2

favorite 0

comment 0

MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
2M
VIEWS
by Internet Archive Web Group
collection

eye 2M

TARGETED-ARTICLE-CRAWL-2022-07
TARGETED-ARTICLE-CRAWL-2022-07
collection
0
ITEMS
85,950
VIEWS
collection

eye 85,950

OMICS-DOI-LANDING-CRAWL-2019-04
OMICS-DOI-LANDING-CRAWL-2019-04
collection
4
ITEMS
14,368
VIEWS
by Internet Archive Web Group
collection

eye 14,368

This crawl started in April 2019, as an informal collaboration with Crossref. Crawling a smallish number (100k) DOI redirects and landing pages (plus PDF outlinks, and maybe a couple other hops) for a single large publisher (OMICS, which has multiple subsidiaries). Intent is to get reasonably good capture that can be used as canonical preservation copies of the landing pages. Secondary goal is to get decent fulltext capture coverage.
Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
12.4M
VIEWS
by Internet Archive Web Group
collection

eye 12.4M

SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
219,673
VIEWS
by Internet Archive Web Group
collection

eye 219,673

CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
2.2M
VIEWS
by Internet Archive Web Group
collection

eye 2.2M

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
CiteSeerX URL Crawl 2017
web

eye 9,741

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:54:24 PDT 2017 to Wed Jul 5 01:08:02 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 6,524

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:58:20 PDT 2017 to Wed Jul 5 00:11:16 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,827

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:03:33 PDT 2017 to Wed Jul 5 02:16:39 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 7,386

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:48:22 PDT 2017 to Wed Jul 5 04:00:37 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,399

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:47:56 PDT 2017 to Wed Jul 5 11:02:06 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,736

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:19:45 PDT 2017 to Wed Jul 5 11:33:54 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 6,058

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:52:51 PDT 2017 to Wed Jul 5 12:06:48 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,974

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 22:05:23 PDT 2017 to Wed Jul 5 15:42:16 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,109

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:41:03 PDT 2017 to Wed Jul 5 11:56:32 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 6,545

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:27:22 PDT 2017 to Wed Jul 5 06:40:32 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,541

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:49:18 PDT 2017 to Wed Jul 5 10:04:13 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,254

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:41:55 PDT 2017 to Wed Jul 5 12:59:15 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 7,660

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:56:02 PDT 2017 to Thu Jul 6 00:08:40 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,842

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:09:00 PDT 2017 to Wed Jul 5 19:24:01 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 7,424

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:31:51 PDT 2017 to Wed Jul 5 19:45:00 PDT 2017.
Topic: crawldata