ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl)
Djoerd Hiemstra

ClueWeb12_Anchors (132 files)
part-r-00131.gz 229.08MB
part-r-00130.gz 228.56MB
part-r-00129.gz 228.60MB
part-r-00128.gz 229.41MB
part-r-00127.gz 229.25MB
part-r-00126.gz 229.03MB
part-r-00125.gz 230.89MB
part-r-00124.gz 230.38MB
part-r-00123.gz 229.58MB
part-r-00122.gz 229.63MB
part-r-00121.gz 230.13MB
part-r-00120.gz 229.38MB
part-r-00119.gz 230.19MB
part-r-00118.gz 230.53MB
part-r-00117.gz 229.79MB
part-r-00116.gz 229.49MB
part-r-00115.gz 230.18MB
part-r-00114.gz 230.39MB
part-r-00113.gz 230.62MB
part-r-00112.gz 230.23MB
part-r-00111.gz 233.36MB
part-r-00110.gz 230.27MB
part-r-00109.gz 228.99MB
part-r-00108.gz 229.00MB
part-r-00107.gz 229.11MB
part-r-00106.gz 229.06MB
part-r-00105.gz 230.12MB
part-r-00104.gz 229.46MB
part-r-00103.gz 229.89MB
part-r-00102.gz 229.80MB
part-r-00101.gz 230.18MB
part-r-00100.gz 230.08MB
part-r-00099.gz 229.97MB
part-r-00098.gz 229.81MB
part-r-00097.gz 230.39MB
part-r-00096.gz 230.97MB
part-r-00095.gz 229.91MB
part-r-00094.gz 229.71MB
part-r-00093.gz 228.87MB
part-r-00092.gz 229.47MB
part-r-00091.gz 229.07MB
part-r-00090.gz 230.91MB
part-r-00089.gz 230.78MB
part-r-00088.gz 230.69MB
part-r-00087.gz 229.19MB
part-r-00086.gz 228.93MB
part-r-00085.gz 229.70MB
part-r-00084.gz 229.16MB
part-r-00083.gz 229.78MB
Too many files! Click here to view them all.
Type: Dataset
Tags: TREC, ClueWeb, HTML, web data, anchor texts , web search, Text Retrieval Conference, Univeristy of Twente

Bibtex:
@article{,
title= {ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) },
journal= {},
author= {Djoerd Hiemstra},
year= {2013},
url= {http://www.cs.utwente.nl/~hiemstra/2013/anchor-text-for-clueweb12.html},
license= {http://creativecommons.org/licenses/by/4.0/l},
abstract= {Anchor texts extracted from ClueWeb12

https://djoerdhiemstra.com/2013/anchor-text-for-clueweb12/},
keywords= {TREC, ClueWeb, HTML, web data, anchor texts, web search, Text Retrieval Conference, Univeristy of Twente},
terms= {},
superseded= {}
}

Hosted by users:

Send Feedback