Skip to main content

Web Data Services

Datasets, special collections, and other derived and extracted subsets of web data culled from IA's web archive. Many of these datasets were created in relation to specific partnerships and collaborative projects supporting computational research and data mining using web archives.



rss RSS

794
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
ParaCrawl
ParaCrawl
collection
22
ITEMS
6,663
VIEWS
collection

eye 6,663

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages ( ParaCrawl.eu )
Web Data Services Meetings
Web Data Services Meetings
collection
9
ITEMS
386
VIEWS
collection

eye 386

Presentations for the Internet Archive's Web Archiving & Data Services international partners meeting held on Friday September 20, 2019 in Amsterdam alongside the iPRES 2019 conference.
The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Corporation Websites Collection
Corporation Websites Collection
collection
659
ITEMS
586,135
VIEWS
collection

eye 586,135

This collection contains an extracted web archive corpus of 0.8+ million corporate websites (from an original list of ~0.98 websites) extracted from the archive.org web archive, covering the period 1996 to early 2017. This corpus was originally created as a collaboration between the Internet Archive and a group at Dartmouth University, but it may be useful to other researchers. Updated or more detailed information may exist at:...
Topics: websites, corporations, homepages
Web PDF Training Sets
Web PDF Training Sets
collection
6
ITEMS
189
VIEWS
by Internet Archive Web Group
collection

eye 189

The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Web Data Services Meetings
texts

eye 50

favorite 0

comment 0

Coinciding with iPRES 2019 in Amsterdam, Internet Archive held a half-day partner meeting of presentations on the latest web and data services for preservation and access of born-digital knowledge. These slides are from the presentation given by Martin Klein, of Los Alamos National Laboratory.
Topics: Web & Data Services, Born-Digital, Digital Preservation
collection

eye 175

This collection of WARC files was originally extracted for University of Edinburgh for their project "Broader Provision of Web-Scale Parallel Corpora for Official European Languages". It consists of "parallel" web archive records from 2018/2019, extracted from Global Wayback (GWB) snapshot `20191109192916`, for the languages English (en), Icelandic (is), Croation (hr), Norwegian (no), Irish (ga), i.e., multiple records for the same URL that exist for at least two of the...
Web Data Services Meetings
texts

eye 42

favorite 1

comment 0

Coinciding with iPRES 2019 in Amsterdam, Internet Archive held a half-day partner meeting of discussions and presentations on the latest web and data services for preservation and access of born-digital knowledge. These slides are from the presentation given by Kyrie Whitsett, Program Officer of Web Archiving & Data Services at Internet Archive.
Topics: Web & Data Services, Born-Digital, Digital Preservation
ParaCrawl
data

eye 384

favorite 0

comment 0

The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Web Data Services Meetings
texts

eye 40

favorite 0

comment 0

Coinciding with iPRES 2019 in Amsterdam, Internet Archive held a half-day partner meeting of discussions and presentations on the latest web and data services for preservation and access of born-digital knowledge. These slides are from the presentation given by Jefferson Bailey, the Director of Web Archiving & Data Services at Internet Archive.
Topic: Digital Preservation, Born-Digital, Web & Data Services
The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Scholarly TDM Corpora
Scholarly TDM Corpora
collection
44
ITEMS
27
VIEWS
by Internet Archive Web Group
collection

eye 27

Access-restricted text and data-mining corpora. If you are interested in getting access to work with this content, contact info@archive.org
Web Data Services Meetings
texts

eye 24

favorite 0

comment 0

Coinciding with iPRES 2019 in Amsterdam, Internet Archive held a half-day partner meeting of discussions and presentations on the latest web and data services for preservation and access of born-digital knowledge. These slides are from the presentation given by Nicholas Taylor, of Stanford Libraries.
Topics: Web & Data Services, Born-Digital, Digital Preservation
The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Web Data Services Meetings
texts

eye 31

favorite 0

comment 0

Coinciding with iPRES 2019 in Amsterdam, Internet Archive held a half-day partner meeting of discussions and presentations on the latest web and data services for preservation and access of born-digital knowledge. These slides are from the presentation given by Kees Teszelszky, of Koninklijke Bibliotheek (The Royal Library of the Netherlands).
Topics: Web & Data Services, Born-Digital, Digital Preservation, Climate Change
Web PDF GROBID Corpus (June 2019)
Web PDF GROBID Corpus (June 2019)
collection
10
ITEMS
52
VIEWS
by Internet Archive Web Group
collection

eye 52

Corporation Websites Collection
data

eye 2,212

favorite 0

comment 0

The main goal of the ParaCrawl project is to create the largest publicly available parallel corpora by crawling hundreds of thousands of websites, using open source tools. We develop corpora for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor , a highly modular pipeline that allows harvesting parallel corpora from...
Topics: machine translation, MT, MT datasets, parallel corpora, paired corpora, corpora, paired sentences,...
Corporation Websites Collection
data

eye 3,462

favorite 0

comment 0

ParaCrawl
data

eye 1

favorite 0

comment 0

Web PDF GROBID Corpus (June 2019)
data

eye 2

favorite 0

comment 0

Scholarly TDM Corpora
data

eye 0

favorite 0

comment 0

Corporation Websites Collection
data

eye 788

favorite 0

comment 0

Corporation Websites Collection
data

eye 5,548

favorite 0

comment 0

Corporation Websites Collection
data

eye 2,259

favorite 0

comment 0

Corporation Websites Collection
data

eye 989

favorite 0

comment 0

Corporation Websites Collection
data

eye 786

favorite 0

comment 0

Corporation Websites Collection
data

eye 697

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,877

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,305

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,956

favorite 0

comment 0

Corporation Websites Collection
data

eye 2,160

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,323

favorite 0

comment 0

Scholarly TDM Corpora
data

eye 0

favorite 0

comment 0

Scholarly TDM Corpora
data

eye 0

favorite 0

comment 0

Web PDF GROBID Corpus (June 2019)
data

eye 3

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,765

favorite 0

comment 0

Scholarly TDM Corpora
data

eye 0

favorite 0

comment 0

Corporation Websites Collection
data

eye 1,239

favorite 0

comment 0