Skip to main content

7,086
UPLOADS


More right-solid

Show sorted alphabetically

More right-solid

Show sorted alphabetically

More right-solid

SHOW DETAILS
eye
Title
Date Archived
Creator
Custom Crawl Services
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

This item contains a copy of log files found on the Internet Archive (Web Group) machine `wbgrp-svc263.us.archive.org` on 2018-05-29, under the `/3` directory. These are logs of file transfer status between various crawler machines; they are not known to contain any sensitive metadata (eg, personal information, IPs, or other security-sensitive information), but are being keep `access-restricted` anyways. This data is almost certainly unimportant and could be deleted; it is being preserved out...
Internet Archive Research Publication Crawls
by Wanfang Data
data

eye 4

favorite 0

comment 0

Metadata and some fulltext PDFs from Wanfang Data, downloaded 2020-03-29 from http://subject.med.wanfangdata.com.cn/Channel/7
Internet Archive Research Publication Crawls
by CNKI
data

eye 0

favorite 0

comment 0

Metadata about COVID-19 papers downloaded from:  http://en.gzbd.cnki.net/GZBT/brief/Default.aspx
Internet Archive Research Publication Crawls
by Wanfang Data
data

eye 6

favorite 0

comment 0

Metadata and some fulltext PDFs from Wanfang Data, downloaded 2020-03-29 from http://subject.med.wanfangdata.com.cn/Channel/7
MAG-PDF-CRAWL-2020-07
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

See also the crawl logs item for this crawl.
OA-JOURNAL-CRAWL-2020-07
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
3M
VIEWS
by Internet Archive Web Group
collection

eye 3M

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
1.6M
VIEWS
by Internet Archive Web Group
collection

eye 1.6M

JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
44
ITEMS
230,418
VIEWS
collection

eye 230,418

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
3.7M
VIEWS
by Internet Archive Web Group
collection

eye 3.7M

OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection

eye 1.4M

Open Access Journal Test Crawl (2018)
by Internet Archive Web Group
data

eye 8

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 9

favorite 0

comment 0

This item contains output files related to the DOI-LANDING-CRAWL-2018-06 crawl of Crossref DOI redirect landing pages: - list of Crossref DOI numbers attempted - an index of DOI, URL, and final HTTP status codes
CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 3

favorite 0

comment 0

"Full" crawl logs (for every hit) from CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-CRL' item for reports etc.
DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 4

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-12
data

eye 0

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 5

favorite 0

comment 0

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 1

favorite 0

comment 0

DATASET-CRAWL-2022-01
data

eye 2

favorite 0

comment 0

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
695,718
VIEWS
collection

eye 695,718

OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
9.7M
VIEWS
by Internet Archive Web Group
collection

eye 9.7M

UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
943,493
VIEWS
collection

eye 943,493

DIRECT-OA-CRAWL-2019
by Internet Archive Web Group
data

eye 5

favorite 0

comment 0

MAG-PDF-CRAWL-2020-07
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
data

eye 1

favorite 0

comment 0

DOAJ-CRAWL-2020-11
data

eye 2

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 12

favorite 0

comment 0

PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
409,763
VIEWS
by Internet Archive Web Group
collection

eye 409,763

OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
2.7M
VIEWS
by Internet Archive Web Group
collection

eye 2.7M

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
219
ITEMS
226,602
VIEWS
collection

eye 226,602

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 0

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
data

eye 1

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 0

favorite 0

comment 0

SCIELO-CRAWL-2020-07
data

eye 2

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
OMICS-DOI-LANDING-CRAWL-2019-04
collection
4
ITEMS
13,855
VIEWS
by Internet Archive Web Group
collection

eye 13,855

This crawl started in April 2019, as an informal collaboration with Crossref. Crawling a smallish number (100k) DOI redirects and landing pages (plus PDF outlinks, and maybe a couple other hops) for a single large publisher (OMICS, which has multiple subsidiaries). Intent is to get reasonably good capture that can be used as canonical preservation copies of the landing pages. Secondary goal is to get decent fulltext capture coverage.
SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
190,307
VIEWS
by Internet Archive Web Group
collection

eye 190,307

MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
1.6M
VIEWS
by Internet Archive Web Group
collection

eye 1.6M

Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
10.8M
VIEWS
by Internet Archive Web Group
collection

eye 10.8M

UNPAYWALL-PDF-CRAWL-2020-11
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

OA-JOURNAL-CRAWL-2020-07
by Internet Archive Web Group
data

eye 1

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 0

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-03
TARGETED-ARTICLE-CRAWL-2022-03
collection
9
ITEMS
46,809
VIEWS
collection

eye 46,809

by Internet Archive Web Group
collection

eye 6,505

This collection contains web crawl data for a random selection of 500k (0.5 million) Crossref DOI redirects, including the doi.org redirect requests. The intent of this crawl is to gather loose statistics on the number of failing redirects, number of host websites that block automated crawling, and a corpus of HTML landing pages for metadata extraction (eg, "signposting" HTTP headers, linked data HTML metadata, semantic markup). Total size of (uncompressed) WARC data is 50 GB,...
UNPAYWALL-PDF-CRAWL-2020-11
UNPAYWALL-PDF-CRAWL-2020-11
collection
199
ITEMS
1.6M
VIEWS
by Internet Archive Web Group
collection

eye 1.6M

UNPAYWALL-PDF-CRAWL-2020-05
data

eye 1

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

Crawl reports and logs for CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-full_crawl_logs' item.
UNPAYWALL-PDF-CRAWL-2020-03
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2019-04
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

PubMed Central Crawl (2019-10)
data

eye 3

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
by Internet Archive Web Group
data

eye 6

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 2

favorite 0

comment 0

JOURNAL-HOMEPAGE-CRAWL-2022-03
data

eye 3

favorite 0

comment 0

TARGETED-ARTICLE-CRAWL-2022-04
data

eye 1

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
3.2M
VIEWS
by Internet Archive Web Group
collection

eye 3.2M

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
4.9M
VIEWS
by Internet Archive Web Group
collection

eye 4.9M

Journals
data

eye 21

favorite 0

comment 0

This is a September 2017 snapshot of the CEUR-WC (https://ceur-wc.org) scholarly conference proceedings repository, as mirrored from the FTP archive.
SCIELO-CRAWL-2020-07
data

eye 0

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
by Internet Archive Web Group
data

eye 0

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
by Internet Archive Web Group
data

eye 14

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

DATASET-CRAWL-2022-01
DATASET-CRAWL-2022-01
collection
2
ITEMS
4,486
VIEWS
collection

eye 4,486

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
5.2M
VIEWS
by Internet Archive Web Group
collection

eye 5.2M

PUBMEDCENTRAL-CRAWL-2020-02
PUBMEDCENTRAL-CRAWL-2020-02
collection
108
ITEMS
237,656
VIEWS
by Internet Archive Web Group
collection

eye 237,656

OAI-PMH-PATCH-CRAWL-2021-12
OAI-PMH-PATCH-CRAWL-2021-12
collection
75
ITEMS
301,349
VIEWS
collection

eye 301,349

CiteSeerX URL Crawl 2017
web

eye 3,862

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:02:30 PDT 2017 to Wed Jul 5 09:16:52 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,143

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:29:50 PDT 2017 to Wed Jul 5 08:40:44 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,551

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:01:13 PDT 2017 to Wed Jul 5 10:15:10 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,755

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:13:58 PDT 2017 to Wed Jul 5 09:29:32 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,408

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:23:13 PDT 2017 to Wed Jul 5 02:37:37 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 6,638

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:59:25 PDT 2017 to Wed Jul 5 08:13:10 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,821

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:19:45 PDT 2017 to Wed Jul 5 08:32:51 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,776

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 22:33:37 PDT 2017 to Wed Jul 5 16:22:38 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,315

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:05:46 PDT 2017 to Wed Jul 5 12:18:30 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,698

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 23:14:09 PDT 2017 to Wed Jul 5 17:01:18 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,048

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:41:55 PDT 2017 to Wed Jul 5 19:54:59 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 6,575

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:28:50 PDT 2017 to Wed Jul 5 20:41:42 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,542

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:26:07 PDT 2017 to Wed Jul 5 20:39:19 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,950

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:56:59 PDT 2017 to Wed Jul 5 19:14:04 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 4,597

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:15:53 PDT 2017 to Wed Jul 5 23:27:06 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,714

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:37:05 PDT 2017 to Wed Jul 5 22:50:05 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,253

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:29:12 PDT 2017 to Thu Jul 6 04:42:47 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 5,960

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:09:06 PDT 2017 to Thu Jul 6 02:22:13 PDT 2017.
Topic: crawldata