|
Yale YouTube Video Text
|
1 |
2014-10-20 |
434.77MB |
7,917 | 5+ |
0 |
|
Enwiki Word2vec model 1000 Dimensions
|
1 |
2015-04-09 |
8.63GB |
3,476 | 8 |
0 |
|
Structured Web Data Extraction Dataset (SWDE)
|
1 |
2015-11-29 |
207.31MB |
2,775 | 3 |
0 |
|
Online News Popularity Data Set
|
1 |
2016-02-11 |
7.48MB |
3,073 | 3+ |
0 |
|
Sentiment Labelled Sentences Data Set
|
1 |
2016-08-26 |
512.21kB |
516 | 5+ |
0 |
|
MovieLens 20M Dataset
|
1 |
2016-12-16 |
198.70MB |
2,050 | 4+ |
0 |
|
Microsoft Academic Graph - 2016/02/05
|
1 |
2016-12-25 |
28.94GB |
253 | 2+ |
0 |
|
IMDb Large Movie Review Dataset
|
1 |
2018-10-16 |
26.40MB |
896 | 3+ |
0 |
|
Wikitext-103
|
1 |
2018-10-16 |
190.20MB |
716 | 4+ |
0 |
|
Wikitext-2
|
1 |
2018-10-16 |
4.07MB |
248 | 2+ |
0 |
|
WMT 2015 French/English parallel texts
|
1 |
2018-10-16 |
2.60GB |
1,931 | 3+ |
0 |
|
AG News
|
1 |
2018-10-16 |
11.78MB |
222 | 2+ |
0 |
|
Amazon reviews - Full
|
1 |
2018-10-16 |
643.70MB |
1,113 | 4+ |
0 |
|
Amazon reviews - Polarity
|
1 |
2018-10-16 |
688.34MB |
1,101 | 2+ |
0 |
|
DBPedia ontology
|
1 |
2018-10-16 |
68.34MB |
128 | 2+ |
1 |
|
Sogou news
|
1 |
2018-10-16 |
384.27MB |
258 | 2+ |
0 |
|
Yelp reviews - Full
|
1 |
2018-10-16 |
196.15MB |
384 | 2+ |
0 |
|
Yelp reviews - Polarity
|
1 |
2018-10-16 |
166.37MB |
441 | 2+ |
0 |
|
Indiana University - Chest X-Rays (XML Reports)
|
1 |
2018-11-22 |
1.11MB |
42,209 | 30+ |
0 |
|
Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN)
|
1 |
2019-02-04 |
657.63MB |
48 | 2+ |
0 |
|
UN corpus - training-parallel-un.tgz (ES-EN, FR-EN)
|
1 |
2019-02-04 |
2.37GB |
57 | 2+ |
0 |
|
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)
|
1 |
2019-02-04 |
918.31MB |
113 | 2+ |
0 |
|
r/WritingPrompts, Text (2018)
|
1 |
2019-06-19 |
87.47MB |
400 | 3 |
0 |
|
Reading Text in the Wild with Convolutional Neural Networks
|
1 |
2021-11-12 |
10.68GB |
40,572 | 27 |
0 |
|
SMS Spam Collection Data Set
|
2 |
2015-11-28 |
695.38kB |
814 | 2+ |
0 |
|
30M Factoid Question-Answer Corpus (30MQA)
|
2 |
2018-11-29 |
529.34MB |
3,972 | 7+ |
0 |
|
Flickr8k Dataset
|
2 |
2019-03-09 |
1.12GB |
13,019 | 26+ |
0 |
|
Lerman Twitter 2010 Dataset
|
3 |
2014-08-15 |
292.17MB |
3,438 | 11+ |
0 |
|
Synthetic Data for Text Localisation in Natural Images
|
15 |
2021-11-15 |
73.50GB |
3,716 | 7 |
5 |
|
PMC Open Access Subset
|
16 |
2020-05-24 |
84.14GB |
240 | 6+ |
0 |
|
OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized
|
395 |
2019-06-01 |
16.02GB |
207 | 5 |
0 |
|
Phishing corpus
|
4555 |
2019-01-02 |
37.48MB |
983 | 2+ |
0 |