Name: Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset)
Creator: Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum
License: https://academictorrents.com/nolicensespecified

wiki-link (109 files)

full-content/part2/part2/108.gz	1.50GB
full-content/part2/part2/109.gz	1.40GB
full-content/part2/part2/105.gz	1.96GB
full-content/part2/part2/106.gz	1.61GB
full-content/part2/part2/107.gz	1.30GB
full-content/part2/part2/103.gz	2.02GB
full-content/part2/part2/104.gz	1.89GB
full-content/part2/part2/100.gz	1.69GB
full-content/part2/part2/101.gz	2.01GB
full-content/part2/part2/102.gz	2.09GB
full-content/part2/part2/097.gz	1.30GB
full-content/part2/part2/098.gz	1.63GB
full-content/part2/part2/099.gz	1.96GB
full-content/part2/part2/094.gz	2.01GB
full-content/part2/part2/095.gz	1.62GB
full-content/part2/part2/096.gz	1.45GB
full-content/part2/part2/092.gz	2.03GB
full-content/part2/part2/093.gz	1.89GB
full-content/part2/part2/089.gz	1.72GB
full-content/part2/part2/090.gz	2.03GB
full-content/part2/part2/091.gz	2.04GB
full-content/part2/part2/086.gz	1.06GB
full-content/part2/part2/087.gz	1.68GB
full-content/part2/part2/088.gz	1.97GB
full-content/part2/part2/083.gz	2.09GB
full-content/part2/part2/084.gz	1.60GB
full-content/part2/part2/085.gz	1.67GB
full-content/part2/part2/081.gz	2.02GB
full-content/part2/part2/082.gz	1.87GB
full-content/part2/part2/078.gz	1.72GB
full-content/part2/part2/079.gz	2.00GB
full-content/part2/part2/080.gz	2.05GB
full-content/part2/part2/076.gz	1.79GB
full-content/part2/part2/077.gz	1.93GB
full-content/part2/part2/073.gz	1.60GB
full-content/part2/part2/074.gz	1.69GB
full-content/part2/part2/075.gz	923.11MB
full-content/part2/part2/070.gz	2.04GB
full-content/part2/part2/071.gz	1.87GB
full-content/part2/part2/072.gz	2.11GB
full-content/part2/part2/067.gz	1.69GB
full-content/part2/part2/068.gz	1.97GB
full-content/part2/part2/069.gz	2.07GB
full-content/part2/part2/065.gz	1.79GB
full-content/part2/part2/066.gz	1.93GB
full-content/part2/part2/062.gz	1.57GB
full-content/part2/part2/063.gz	1.73GB
full-content/part2/part2/064.gz	909.80MB
full-content/part2/part2/059.gz	2.05GB
Too many files! Click here to view them all.

Type: Dataset
Tags:

Bibtex:

@article{,
title= {Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset)},
author= {Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum},
abstract= {Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.

### Introduction


 The Wikipedia links (WikiLinks) data consists of web pages that
 satisfy the following two constraints:

a. contain at least one hyperlink that points to Wikipedia, and
b. the anchor text of that hyperlink closely matches the title of the target Wikipedia page.

  We treat each page on Wikipedia as representing an entity
  (or concept or idea), and the anchor text as a mention of the
  entity. The WikiLinks data set was obtained by iterating 
  over Google's web index. 

####  Content


  This dataset is accompanied by the following tech report:

  https://web.cs.umass.edu/publication/docs/2012/UM-CS-2012-015.pdf

  Please cite the above report if you use this data.

  The dataset is divided over 10 gzipped text files
  data-0000[0-9]-of-00010.gz. Each of these files can be viewed
  without uncompressing them using zcat. For example:

  zcat data-00001-of-00010.gz | head 

  gives:


  URL	ftp://217.219.170.14/Computer%20Group/Faani/vaset%20fani/second/sattari/word/2007/source/s%20crt.docx

  MENTION	vacuum tube	421	http://en.wikipedia.org/wiki/Vacuum_tube

  MENTION	vacuum tubes	10838	http://en.wikipedia.org/wiki/Vacuum_tube

  MENTION	electron gun	598	http://en.wikipedia.org/wiki/Electron_gun

  MENTION	fluorescent	790	http://en.wikipedia.org/wiki/Fluorescent

  MENTION	oscilloscope	1307	http://en.wikipedia.org/wiki/Oscilloscope

  MENTION	computer monitor	1503	http://en.wikipedia.org/wiki/Computer_monitor

  MENTION	computer monitors	3066	http://en.wikipedia.org/wiki/Computer_monitor

  MENTION	radar	1657	http://en.wikipedia.org/wiki/Radar

  MENTION	plasma screens	2162	http://en.wikipedia.org/wiki/Plasma_screen


  Each file is in the following format:

  -------

  URL\t<url>\n

  MENTION\t<mention>\t<byte_offset>\t<target_url>\n

  MENTION\t<mention>\t<byte_offset>\t<target_url>\n

  MENTION\t<mention>\t<byte_offset>\t<target_url>\n

  ...

  TOKEN\t<token>\t<byte_offset>\n

  TOKEN\t<token>\t<byte_offset>\n

  TOKEN\t<token>\t<byte_offset>\n

  ...

  \n\n

  URL\t<url>\n

  ...



  where each web-page is identified by its url (annotated
  by "URL"). For every mention (denoted by "MENTION"), we provide the
  actual mention string, the byte offset of the mention from the start
  of the page and the target url all separated by a tab. It is
  possible (and in many cases very likely) that the contents of a
  web-page may change over time. The dataset also contains information
  about the top 10 least frequent tokens on that page at the time
  it was crawled. These line started with a "TOKEN" and contain
  the string of the token and the byte offset from the start of the page.
  These token strings can be used as fingerprints
  to verify if the page used to generate the data has changed. Finally,
  pages are separated from each other by two blank lines.

####  Basic Statistics


  Number of Document: 11 million
  Number of entities:  3 million
  Number of mentions: 40 million


  Finally please note that this dataset was created automatically
  from the web and therefore contains some amount of noise.

  Enjoy!

  Amar Subramanya (asubram@google.com)

  Sameer Singh (sameer@cs.umass.edu)

  Fernando Pereira (pereira@google.com)

  Andrew McCallum (mccallum@cs.umass.edu)
},
keywords= {},
terms= {},
license= {Attribution 3.0 Unported (CC BY 3.0) Human-Readable Summary}
}

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset) Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset)
Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum