Skip to main content
SHOW DETAILS
up-solid down-solid
eye
Title
Date Reviewed
Creator
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 05:01:45 PDT 2017 to Tue Jul 4 22:50:03 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:35:29 PDT 2017 to Wed Jul 5 00:48:05 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 05:45:37 PDT 2017 to Tue Jul 4 23:00:36 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:07:49 PDT 2017 to Wed Jul 5 00:18:34 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:44:33 PDT 2017 to Wed Jul 5 00:57:33 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:37:37 PDT 2017 to Tue Jul 4 23:54:39 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 06:27:07 PDT 2017 to Tue Jul 4 23:40:46 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:03:45 PDT 2017 to Wed Jul 5 01:16:31 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:55:35 PDT 2017 to Wed Jul 5 03:08:11 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:33:05 PDT 2017 to Wed Jul 5 01:46:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:26:12 PDT 2017 to Wed Jul 5 03:41:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:17:48 PDT 2017 to Wed Jul 5 05:29:23 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 18:09:15 PDT 2017 to Wed Jul 5 11:23:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:48:17 PDT 2017 to Wed Jul 5 05:01:28 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:59:15 PDT 2017 to Wed Jul 5 05:11:17 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 10,878

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:08:27 PDT 2017 to Wed Jul 5 05:22:22 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:20:32 PDT 2017 to Wed Jul 5 19:33:31 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:48:37 PDT 2017 to Thu Jul 6 03:05:10 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:58:36 PDT 2017 to Thu Jul 6 05:13:01 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:36:31 PDT 2017 to Wed Jul 5 18:48:11 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 11,059

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:53:13 PDT 2017 to Thu Jul 6 02:05:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 00:01:18 PDT 2017 to Wed Jul 5 17:59:10 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:57:33 PDT 2017 to Thu Jul 6 02:08:23 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 07:32:04 PDT 2017 to Thu Jul 6 00:46:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:01:58 PDT 2017 to Wed Jul 5 20:14:53 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:56:39 PDT 2017 to Wed Jul 5 23:08:54 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:08:03 PDT 2017 to Thu Jul 6 04:22:39 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 14:52:50 PDT 2017 to Thu Jul 6 08:15:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:38:29 PDT 2017 to Thu Jul 6 02:51:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 12:19:36 PDT 2017 to Thu Jul 6 05:34:14 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 07:42:06 PDT 2017 to Thu Jul 6 00:54:28 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 22:56:35 PDT 2017 to Thu Jul 6 17:34:26 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 03:24:06 PDT 2017 to Thu Jul 6 21:44:17 PDT 2017.
Topic: crawldata
MSAG-PDF-CRAWL-2017
collection
1,855
ITEMS
14M
VIEWS
- Internet Archive Web Group
collection

eye 14M

Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
12.3M
VIEWS
- Internet Archive Web Group
collection

eye 12.3M

OMICS-DOI-LANDING-CRAWL-2019-04
OMICS-DOI-LANDING-CRAWL-2019-04
collection
4
ITEMS
14,313
VIEWS
- Internet Archive Web Group
collection

eye 14,313

This crawl started in April 2019, as an informal collaboration with Crossref. Crawling a smallish number (100k) DOI redirects and landing pages (plus PDF outlinks, and maybe a couple other hops) for a single large publisher (OMICS, which has multiple subsidiaries). Intent is to get reasonably good capture that can be used as canonical preservation copies of the landing pages. Secondary goal is to get decent fulltext capture coverage.
MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
2M
VIEWS
- Internet Archive Web Group
collection

eye 2M

SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
218,859
VIEWS
- Internet Archive Web Group
collection

eye 218,859

UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
16.7M
VIEWS
- Internet Archive Web Group
collection

eye 16.7M

Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
3M
VIEWS
- Internet Archive Web Group
collection

eye 3M

DATACITE-DOI-CRAWL-2020-01
DATACITE-DOI-CRAWL-2020-01
collection
1,417
ITEMS
4.3M
VIEWS
- Internet Archive Web Group
collection

eye 4.3M

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:27:18 PDT 2017 to Wed Jul 5 07:42:31 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 16:23:41 PDT 2017 to Thu Jul 6 09:49:34 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 07:26:10 PDT 2017 to Wed Jul 5 00:38:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:13:11 PDT 2017 to Wed Jul 5 02:27:43 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:06:40 PDT 2017 to Wed Jul 5 04:21:44 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:16:31 PDT 2017 to Wed Jul 5 06:29:05 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 12,826

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 12:57:06 PDT 2017 to Wed Jul 5 06:10:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:37:19 PDT 2017 to Wed Jul 5 07:53:37 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:57:54 PDT 2017 to Thu Jul 6 03:09:56 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 10,162

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:01:08 PDT 2017 to Thu Jul 6 02:12:56 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 13:15:48 PDT 2017 to Thu Jul 6 06:27:35 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 10:39:46 PDT 2017 to Thu Jul 6 03:52:14 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 15:25:17 PDT 2017 to Thu Jul 6 08:45:39 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 08:08:25 PDT 2017 to Thu Jul 6 01:20:22 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 05:26:51 PDT 2017 to Wed Jul 5 22:39:49 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:47:41 PDT 2017 to Wed Jul 5 20:59:49 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:25:00 PDT 2017 to Wed Jul 5 23:39:28 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 07:05:25 PDT 2017 to Thu Jul 6 00:16:46 PDT 2017.
Topic: crawldata
CiteSeerX URL Crawl 2017
web

eye 11,501

favorite 0

comment 0

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 01:25:08 PDT 2017 to Wed Jul 5 18:40:27 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:45:52 PDT 2017 to Wed Jul 5 21:59:01 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 06:35:45 PDT 2017 to Wed Jul 5 23:47:46 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 21:42:48 PDT 2017 to Wed Jul 5 15:07:34 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 16:02:02 PDT 2017 to Thu Jul 6 09:30:48 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 15:40:27 PDT 2017 to Thu Jul 6 09:06:51 PDT 2017.
Topic: crawldata
PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
513,745
VIEWS
- Internet Archive Web Group
collection

eye 513,745

Internet Archive Research Publication Crawls
Internet Archive Research Publication Crawls
collection
21,177
ITEMS
122.3M
VIEWS
- Internet Archive Web Group
collection

eye 122.3M

A series of open web crawls targeting journal articles, technical memos, essays, datasets, and other research publications. This collection contains WARC and CDX files that end up in Wayback ( https://web.archive.org ). See also bibliographic metadata corpuses at  https://archive.org/details/ia_biblio_metadata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:56:20 PDT 2017 to Wed Jul 5 22:10:21 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:56:45 PDT 2017 to Wed Jul 5 21:07:29 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 04:05:06 PDT 2017 to Wed Jul 5 21:19:08 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 03:38:19 PDT 2017 to Wed Jul 5 20:50:10 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 09:09:11 PDT 2017 to Thu Jul 6 04:53:41 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 11:39:06 PDT 2017 to Thu Jul 6 04:51:08 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 14:40:48 PDT 2017 to Thu Jul 6 07:58:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 10:48:58 PDT 2017 to Thu Jul 6 04:03:15 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 12:09:58 PDT 2017 to Thu Jul 6 05:23:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Fri Jul 7 00:40:24 PDT 2017 to Mon Jul 10 16:07:51 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 17:26:32 PDT 2017 to Thu Jul 6 10:55:12 PDT 2017.
Topic: crawldata
- Internet Archive Web Group
collection

eye 6,866

This collection contains web crawl data for a random selection of 500k (0.5 million) Crossref DOI redirects, including the doi.org redirect requests. The intent of this crawl is to gather loose statistics on the number of failing redirects, number of host websites that block automated crawling, and a corpus of HTML landing pages for metadata extraction (eg, "signposting" HTTP headers, linked data HTML metadata, semantic markup). Total size of (uncompressed) WARC data is 50 GB,...
UNPAYWALL-PDF-CRAWL-2020-11
UNPAYWALL-PDF-CRAWL-2020-11
collection
199
ITEMS
2M
VIEWS
- Internet Archive Web Group
collection

eye 2M

Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:16:05 PDT 2017 to Wed Jul 5 03:29:32 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 08:52:59 PDT 2017 to Wed Jul 5 02:06:36 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 10:06:01 PDT 2017 to Wed Jul 5 03:18:50 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 11:18:04 PDT 2017 to Wed Jul 5 04:29:44 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 09:44:39 PDT 2017 to Wed Jul 5 02:59:13 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:09:12 PDT 2017 to Wed Jul 5 08:21:11 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:23:45 PDT 2017 to Wed Jul 5 10:38:21 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 16:26:39 PDT 2017 to Wed Jul 5 09:45:09 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 15:51:42 PDT 2017 to Wed Jul 5 09:04:28 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 14:49:16 PDT 2017 to Wed Jul 5 08:01:47 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:12:42 PDT 2017 to Wed Jul 5 10:27:30 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 13:35:58 PDT 2017 to Wed Jul 5 06:48:02 PDT 2017.
Topic: crawldata
Web PDF GROBID Corpus (July 2019)
Web PDF GROBID Corpus (July 2019)
collection
10
ITEMS
17
VIEWS
- Internet Archive Web Group
collection

eye 17

arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
95,778
VIEWS
- Internet Archive Web Group
collection

eye 95,778

CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
2.2M
VIEWS
- Internet Archive Web Group
collection

eye 2.2M

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 17:47:56 PDT 2017 to Wed Jul 5 11:02:06 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc284.us.archive.org:CITESEERX-CRAWL-2017 from Wed Jul 5 22:05:23 PDT 2017 to Wed Jul 5 15:42:16 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 12:53:52 PDT 2017 to Thu Jul 6 06:07:14 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 15:07:36 PDT 2017 to Thu Jul 6 08:28:04 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled CiteseerX PDF URLs captured by wbgrp-svc285.us.archive.org:CITESEERX-CRAWL-2017 from Thu Jul 6 02:09:00 PDT 2017 to Wed Jul 5 19:24:01 PDT 2017.
Topic: crawldata