Skip to main content

313
UPLOADS


More right-solid

Show sorted alphabetically

More right-solid

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Reviewed
Creator
UNPAYWALL-PDF-CRAWL-2021-05
data

eye 4

favorite 0

comment 0

DOI-LANDING-CRAWL-2018-06
- Internet Archive Web Group
data

eye 6

favorite 0

comment 0

CORE-UPSTREAM-CRAWL-2018-11
- Internet Archive Web Group
data

eye 6

favorite 0

comment 0

Crawl reports and logs for CORE-UPSTREAM-CRAWL-2018-11 crawl. See also 'CORE-UPSTREAM-CRAWL-2018-11-full_crawl_logs' item.
UNPAYWALL-PDF-CRAWL-2018-07
- Internet Archive Web Group
data

eye 2

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2019-04
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
UNPAYWALL-PDF-CRAWL-2020-03
data

eye 0

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
3.2M
VIEWS
- Internet Archive Web Group
collection

eye 3.2M

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
4.7M
VIEWS
- Internet Archive Web Group
collection

eye 4.7M

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:48:01 PDT 2017 to Wed Jul 5 00:02:08 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:34:21 PDT 2017 to Wed Jul 5 02:48:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:39:00 PDT 2017 to Wed Jul 5 09:49:18 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:58:23 PDT 2017 to Wed Jul 5 04:10:54 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:37:22 PDT 2017 to Wed Jul 5 03:49:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:40:25 PDT 2017 to Wed Jul 5 08:53:07 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:35:15 PDT 2017 to Wed Jul 5 21:47:15 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:54:02 PDT 2017 to Wed Jul 5 13:11:56 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 20:08:34 PDT 2017 to Wed Jul 5 13:25:24 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:51:41 PDT 2017 to Wed Jul 5 20:04:28 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 21:11:06 PDT 2017 to Wed Jul 5 14:42:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:17:44 PDT 2017 to Wed Jul 5 22:29:44 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:48:00 PDT 2017 to Wed Jul 5 19:01:22 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 20:55:20 PDT 2017 to Wed Jul 5 14:21:01 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:15:17 PDT 2017 to Wed Jul 5 21:26:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:46:05 PDT 2017 to Wed Jul 5 23:58:46 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:46:51 PDT 2017 to Wed Jul 5 22:58:17 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 12:31:31 PDT 2017 to Thu Jul 6 05:45:27 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 10:09:11 PDT 2017 to Thu Jul 6 03:22:04 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 07:51:37 PDT 2017 to Thu Jul 6 01:02:46 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 10:29:25 PDT 2017 to Thu Jul 6 03:42:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:29:23 PDT 2017 to Thu Jul 6 02:41:56 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 20:33:32 PDT 2017 to Thu Jul 6 14:21:11 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 14:06:44 PDT 2017 to Thu Jul 6 07:21:33 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 04:25:04 PDT 2017 to Thu Jul 6 23:46:26 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 14:18:04 PDT 2017 to Thu Jul 6 07:32:15 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 15:04:04 PDT 2017 to Thu Jul 6 17:45:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 13:55:52 PDT 2017 to Thu Jul 6 07:09:25 PDT 2017.
Topic: crawldata
ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
collection
60
ITEMS
101,172
VIEWS
- Internet Archive Web Group
collection

eye 101,172

Custom Crawl Services
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

This item contains a copy of log files found on the Internet Archive (Web Group) machine `wbgrp-svc263.us.archive.org` on 2018-05-29, under the `/3` directory. These are logs of file transfer status between various crawler machines; they are not known to contain any sensitive metadata (eg, personal information, IPs, or other security-sensitive information), but are being keep `access-restricted` anyways. This data is almost certainly unimportant and could be deleted; it is being preserved out...
UNPAYWALL-PDF-CRAWL-2019-04
- Internet Archive Web Group
data

eye 2

favorite 0

comment 0

OAI-PMH-CRAWL-2020-06
- Internet Archive Web Group
data

eye 2

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
5.2M
VIEWS
- Internet Archive Web Group
collection

eye 5.2M

UNPAYWALL-PDF-CRAWL-2020-03
UNPAYWALL-PDF-CRAWL-2020-03
collection
344
ITEMS
1.7M
VIEWS
- Internet Archive Web Group
collection

eye 1.7M

DOAJ-CRAWL-2020-11
DOAJ-CRAWL-2020-11
collection
102
ITEMS
854,580
VIEWS
- Internet Archive Web Group
collection

eye 854,580

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
collection
1,011
ITEMS
1.4M
VIEWS
- Internet Archive Web Group
collection

eye 1.4M

-
collection

eye 1.9M

IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
DOI-LANDING-CRAWL-2018-06
DOI-LANDING-CRAWL-2018-06
collection
279
ITEMS
3.2M
VIEWS
- Internet Archive Web Group
collection

eye 3.2M

DOI-CRAWL-2022-02
DOI-CRAWL-2022-02
collection
25
ITEMS
166,413
VIEWS
-
collection

eye 166,413

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:54:24 PDT 2017 to Wed Jul 5 01:08:02 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:58:20 PDT 2017 to Wed Jul 5 00:11:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:03:33 PDT 2017 to Wed Jul 5 02:16:39 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:47:56 PDT 2017 to Wed Jul 5 11:02:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:27:22 PDT 2017 to Wed Jul 5 06:40:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:49:18 PDT 2017 to Wed Jul 5 10:04:13 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:48:22 PDT 2017 to Wed Jul 5 04:00:37 PDT 2017.
Topic: crawldata
CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
1.5M
VIEWS
- Internet Archive Web Group
collection

eye 1.5M

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:19:45 PDT 2017 to Wed Jul 5 11:33:54 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:52:51 PDT 2017 to Wed Jul 5 12:06:48 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 22:05:23 PDT 2017 to Wed Jul 5 15:42:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:31:51 PDT 2017 to Wed Jul 5 19:45:00 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:41:03 PDT 2017 to Wed Jul 5 11:56:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:09:00 PDT 2017 to Wed Jul 5 19:24:01 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 19:41:55 PDT 2017 to Wed Jul 5 12:59:15 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:19:38 PDT 2017 to Thu Jul 6 04:33:05 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:56:02 PDT 2017 to Thu Jul 6 00:08:40 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 12:53:52 PDT 2017 to Thu Jul 6 06:07:14 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:35:43 PDT 2017 to Thu Jul 6 01:46:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:31:07 PDT 2017 to Thu Jul 6 08:54:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 15:07:36 PDT 2017 to Thu Jul 6 08:28:04 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 13:45:57 PDT 2017 to Thu Jul 6 07:00:09 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 06:46:26 PDT 2017 to Fri Jul 14 15:21:22 PDT 2017.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
JOURNALS-PATCH-CRAWL-2022-01
collection
104
ITEMS
590,763
VIEWS
-
collection

eye 590,763

arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
64,150
VIEWS
- Internet Archive Web Group
collection

eye 64,150

DIRECT-OA-CRAWL-2019
- Internet Archive Web Group
data

eye 5

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2020-11
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

arXiv Content Crawl (2019-10)
data

eye 3

favorite 0

comment 0

DOAJ-CRAWL-2020-11
data

eye 2

favorite 0

comment 0

UNPAYWALL-PDF-CRAWL-2018-07
- Internet Archive Web Group
data

eye 12

favorite 0

comment 0

OA-DOI-CRAWL-2020-02
data

eye 1

favorite 0

comment 0

MAG-PDF-CRAWL-2020-07
- Internet Archive Web Group
data

eye 0

favorite 0

comment 0

MSAG-PDF-CRAWL-2017
data

eye 10

favorite 0

comment 0

This item contains checksums and file-level metadata for most (if not all) files collected in this crawl. The tab-separated-value (.tsv) file is similar to a CDX file but contains additional hashes.
OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
2.7M
VIEWS
- Internet Archive Web Group
collection

eye 2.7M

DATACITE-DOI-CRAWL-2020-01
DATACITE-DOI-CRAWL-2020-01
collection
1,417
ITEMS
3.6M
VIEWS
- Internet Archive Web Group
collection

eye 3.6M

UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
14.3M
VIEWS
- Internet Archive Web Group
collection

eye 14.3M

Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
UNPAYWALL-PDF-CRAWL-2022-04
UNPAYWALL-PDF-CRAWL-2022-04
collection
38
ITEMS
11,972
VIEWS
-
collection

eye 11,972

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
219
ITEMS
201,633
VIEWS
-
collection

eye 201,633

PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
400,669
VIEWS
- Internet Archive Web Group
collection

eye 400,669

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:26:10 PDT 2017 to Wed Jul 5 00:38:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:27:18 PDT 2017 to Wed Jul 5 07:42:31 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:16:31 PDT 2017 to Wed Jul 5 06:29:05 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 11,110

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:57:06 PDT 2017 to Wed Jul 5 06:10:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:37:19 PDT 2017 to Wed Jul 5 07:53:37 PDT 2017.
Topic: crawldata