Skip to main content

313
UPLOADS


More right-solid

Show sorted alphabetically

More right-solid

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Reviewed
Creator
DOI-LANDING-CRAWL-2018-06
- Internet Archive Web Group
data

eye 9

favorite 0

comment 0

This item contains output files related to the DOI-LANDING-CRAWL-2018-06 crawl of Crossref DOI redirect landing pages: - list of Crossref DOI numbers attempted - an index of DOI, URL, and final HTTP status codes
Open Access Journal Test Crawl (2018)
- Internet Archive Web Group
data

eye 8

favorite 0

comment 0

OA-DOI-CRAWL-2020-12
data

eye 0

favorite 0

comment 0

Metadata about COVID-19 papers downloaded from:  http://en.gzbd.cnki.net/GZBT/brief/Default.aspx
DIRECT-OA-CRAWL-2019
- Internet Archive Web Group
data

eye 4

favorite 0

comment 0

PUBMEDCENTRAL-CRAWL-2020-02
data

eye 0

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
- Internet Archive Web Group
data

eye 3

favorite 0

comment 0

"Full" crawl logs (for every hit) from CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-CRL' item for reports etc.
UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
926,132
VIEWS
-
collection

eye 926,132

OAI-PMH-PATCH-CRAWL-2021-12
data

eye 1

favorite 0

comment 0

Internet Archive Research Publication Crawls
data

eye 4

favorite 0

comment 0

Metadata and some fulltext PDFs from Wanfang Data, downloaded 2020-03-29 from http://subject.med.wanfangdata.com.cn/Channel/7
DATASET-CRAWL-2022-01
data

eye 2

favorite 0

comment 0

CiteSeerX URL Crawl 2017
CiteSeerX URL Crawl 2017
collection
207
ITEMS
1.1M
VIEWS
-
collection

eye 1.1M

A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:23:44 PDT 2017 to Wed Jul 5 01:37:05 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:17:51 PDT 2017 to Wed Jul 5 00:28:29 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:43:10 PDT 2017 to Wed Jul 5 01:56:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 05:56:50 PDT 2017 to Tue Jul 4 23:09:37 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:57:08 PDT 2017 to Wed Jul 5 07:09:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:46:59 PDT 2017 to Wed Jul 5 06:59:12 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:27:04 PDT 2017 to Wed Jul 5 04:42:02 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:45:37 PDT 2017 to Wed Jul 5 05:59:07 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 11,062

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:06:40 PDT 2017 to Wed Jul 5 06:20:59 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:28:55 PDT 2017 to Wed Jul 5 12:44:56 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 00:41:00 PDT 2017 to Wed Jul 5 18:35:58 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 20:21:08 PDT 2017 to Wed Jul 5 13:39:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 20:36:43 PDT 2017 to Wed Jul 5 14:00:40 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:39:30 PDT 2017 to Wed Jul 5 22:34:42 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:11:26 PDT 2017 to Wed Jul 5 20:24:18 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:06:06 PDT 2017 to Wed Jul 5 22:19:48 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:00:34 PDT 2017 to Thu Jul 6 01:11:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 10:59:03 PDT 2017 to Thu Jul 6 04:11:38 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 13:04:32 PDT 2017 to Thu Jul 6 06:17:57 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:19:33 PDT 2017 to Thu Jul 6 02:30:40 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 10,157

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:45:15 PDT 2017 to Thu Jul 6 01:55:13 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:18:25 PDT 2017 to Thu Jul 6 01:29:26 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 02:29:51 PDT 2017 to Thu Jul 6 20:40:52 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 13:25:18 PDT 2017 to Thu Jul 6 06:39:00 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 01:31:00 PDT 2017 to Thu Jul 6 19:44:19 PDT 2017.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
9.6M
VIEWS
- Internet Archive Web Group
collection

eye 9.6M

PLATFORM-CRAWL-2020
PLATFORM-CRAWL-2020
collection
649
ITEMS
393,977
VIEWS
- Internet Archive Web Group
collection

eye 393,977

DATASET-CRAWL-2022-01
data

eye 4

favorite 0

comment 0

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
678,979
VIEWS
-
collection

eye 678,979

UNPAYWALL-PDF-CRAWL-2018-07
- Internet Archive Web Group
data

eye 1

favorite 0

comment 0

See also the crawl logs item for this crawl.
MAG-PDF-CRAWL-2020-07
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

OA-JOURNAL-CRAWL-2020-07
- Internet Archive Web Group
data

eye 2

favorite 0

comment 0

OMICS-DOI-LANDING-CRAWL-2019-04
- Internet Archive Web Group
data

eye 4

favorite 0

comment 0

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
- Internet Archive Web Group
data

eye 1

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:02:30 PDT 2017 to Wed Jul 5 09:16:52 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:29:50 PDT 2017 to Wed Jul 5 08:40:44 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:01:13 PDT 2017 to Wed Jul 5 10:15:10 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:13:58 PDT 2017 to Wed Jul 5 09:29:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:23:13 PDT 2017 to Wed Jul 5 02:37:37 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:59:25 PDT 2017 to Wed Jul 5 08:13:10 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:19:45 PDT 2017 to Wed Jul 5 08:32:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 22:33:37 PDT 2017 to Wed Jul 5 16:22:38 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:05:46 PDT 2017 to Wed Jul 5 12:18:30 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 23:14:09 PDT 2017 to Wed Jul 5 17:01:18 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:41:55 PDT 2017 to Wed Jul 5 19:54:59 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:28:50 PDT 2017 to Wed Jul 5 20:41:42 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:26:07 PDT 2017 to Wed Jul 5 20:39:19 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:56:59 PDT 2017 to Wed Jul 5 19:14:04 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:15:53 PDT 2017 to Wed Jul 5 23:27:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:37:05 PDT 2017 to Wed Jul 5 22:50:05 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:29:12 PDT 2017 to Thu Jul 6 04:42:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:09:06 PDT 2017 to Thu Jul 6 02:22:13 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:04:47 PDT 2017 to Wed Jul 5 23:17:34 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:49:51 PDT 2017 to Thu Jul 6 05:02:19 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 07:14:32 PDT 2017 to Thu Jul 6 00:28:30 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 00:14:00 PDT 2017 to Thu Jul 6 18:31:17 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 16:43:44 PDT 2017 to Thu Jul 6 10:08:52 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 17:09:19 PDT 2017 to Thu Jul 6 10:32:42 PDT 2017.
Topic: crawldata
JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
44
ITEMS
224,598
VIEWS
-
collection

eye 224,598

OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
1.4M
VIEWS
- Internet Archive Web Group
collection

eye 1.4M

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
1.6M
VIEWS
- Internet Archive Web Group
collection

eye 1.6M

UNPAYWALL-PDF-CRAWL-2021-05
UNPAYWALL-PDF-CRAWL-2021-05
collection
123
ITEMS
852,268
VIEWS
- Internet Archive Web Group
collection

eye 852,268

Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
3M
VIEWS
- Internet Archive Web Group
collection

eye 3M

MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
3.7M
VIEWS
- Internet Archive Web Group
collection

eye 3.7M

CiteSeerX URL Crawl 2017
data

eye 4

favorite 0

comment 0

Configuration, Reports, and Logs for CITESEERX-CRAWL-2017 crawl.
SCIELO-CRAWL-2020-07
data

eye 0

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

DOI-CRAWL-2022-02
data

eye 0

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
UNPAYWALL-PDF-CRAWL-2018-07
- Internet Archive Web Group
data

eye 14

favorite 0

comment 0

DATASET-CRAWL-2022-01
DATASET-CRAWL-2022-01
collection
2
ITEMS
4,431
VIEWS
-
collection

eye 4,431

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
5.1M
VIEWS
- Internet Archive Web Group
collection

eye 5.1M

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:16:47 PDT 2017 to Tue Jul 4 23:30:24 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:07:30 PDT 2017 to Tue Jul 4 23:19:25 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:12:55 PDT 2017 to Wed Jul 5 01:26:34 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:35:44 PDT 2017 to Wed Jul 5 10:49:38 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:17:16 PDT 2017 to Wed Jul 5 07:33:59 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 11,633

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:27:34 PDT 2017 to Wed Jul 5 05:39:37 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:06:22 PDT 2017 to Wed Jul 5 07:19:59 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:36:41 PDT 2017 to Wed Jul 5 05:48:54 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:58:07 PDT 2017 to Wed Jul 5 11:10:57 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:38:04 PDT 2017 to Wed Jul 5 04:54:20 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:19:51 PDT 2017 to Wed Jul 5 20:32:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:24:00 PDT 2017 to Wed Jul 5 21:41:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:37:24 PDT 2017 to Wed Jul 5 19:42:27 PDT 2017.
Topic: crawldata