ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl)
Djoerd Hiemstra

ClueWeb12_Anchors (132 files)
part-r-00000.gz 232.32MB
part-r-00001.gz 228.96MB
part-r-00002.gz 229.32MB
part-r-00003.gz 229.12MB
part-r-00004.gz 230.08MB
part-r-00005.gz 229.55MB
part-r-00006.gz 231.03MB
part-r-00007.gz 231.00MB
part-r-00008.gz 230.60MB
part-r-00009.gz 229.81MB
part-r-00010.gz 230.21MB
part-r-00011.gz 229.90MB
part-r-00012.gz 229.51MB
part-r-00013.gz 229.14MB
part-r-00014.gz 229.67MB
part-r-00015.gz 229.19MB
part-r-00016.gz 229.72MB
part-r-00017.gz 228.71MB
part-r-00018.gz 229.17MB
part-r-00019.gz 231.08MB
part-r-00020.gz 230.97MB
part-r-00021.gz 230.49MB
part-r-00022.gz 230.20MB
part-r-00023.gz 230.78MB
part-r-00024.gz 231.85MB
part-r-00025.gz 229.59MB
part-r-00026.gz 229.38MB
part-r-00027.gz 229.75MB
part-r-00028.gz 229.08MB
part-r-00029.gz 230.29MB
part-r-00030.gz 229.26MB
part-r-00031.gz 228.92MB
part-r-00032.gz 229.87MB
part-r-00033.gz 230.00MB
part-r-00034.gz 229.30MB
part-r-00035.gz 229.37MB
part-r-00036.gz 230.07MB
part-r-00037.gz 230.63MB
part-r-00038.gz 229.92MB
part-r-00039.gz 230.11MB
part-r-00040.gz 229.33MB
part-r-00041.gz 230.12MB
part-r-00042.gz 229.70MB
part-r-00043.gz 229.31MB
part-r-00044.gz 229.43MB
part-r-00045.gz 228.63MB
part-r-00046.gz 229.87MB
part-r-00047.gz 229.79MB
part-r-00048.gz 230.37MB
Too many files! Click here to view them all.
Type: Dataset
Tags: TREC, ClueWeb, HTML, web data, anchor texts , web search, Text Retrieval Conference, Univeristy of Twente

Bibtex:
@article{,
title= {ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) },
journal= {},
author= {Djoerd Hiemstra},
year= {2013},
url= {http://www.cs.utwente.nl/~hiemstra/2013/anchor-text-for-clueweb12.html},
license= {http://creativecommons.org/licenses/by/4.0/l},
abstract= {Anchor texts extracted from ClueWeb12

https://djoerdhiemstra.com/2013/anchor-text-for-clueweb12/},
keywords= {TREC, ClueWeb, HTML, web data, anchor texts, web search, Text Retrieval Conference, Univeristy of Twente},
terms= {},
superseded= {}
}

Hosted by users: