Text [RSS] [CSV]

curated by joecohen

Type	Name	Files	Added	Size	DLs
	Yale YouTube Video Text	1	2014-10-20	434.77MB	8,352	10+	0
	Enwiki Word2vec model 1000 Dimensions	1	2015-04-09	8.63GB	3,499	16	0
	Structured Web Data Extraction Dataset (SWDE)	1	2015-11-29	207.31MB	4,224	5	0
	Online News Popularity Data Set	1	2016-02-11	7.48MB	3,170	7+	0
	Sentiment Labelled Sentences Data Set	1	2016-08-26	512.21kB	553	6+	0
	MovieLens 20M Dataset	1	2016-12-16	198.70MB	2,299	8+	0
	Microsoft Academic Graph - 2016/02/05	1	2016-12-25	28.94GB	282	2+	0
	IMDb Large Movie Review Dataset	1	2018-10-16	26.40MB	1,020	7+	0
	Wikitext-103	1	2018-10-16	190.20MB	1,444	4+	0
	Wikitext-2	1	2018-10-16	4.07MB	286	4+	0
	WMT 2015 French/English parallel texts	1	2018-10-16	2.60GB	2,280	4+	0
	AG News	1	2018-10-16	11.78MB	233	5+	0
	Amazon reviews - Full	1	2018-10-16	643.70MB	1,374	7+	0
	Amazon reviews - Polarity	1	2018-10-16	688.34MB	1,137	4+	0
	DBPedia ontology	1	2018-10-16	68.34MB	164	4+	0
	Sogou news	1	2018-10-16	384.27MB	278	4+	0
	Yelp reviews - Full	1	2018-10-16	196.15MB	419	4+	0
	Yelp reviews - Polarity	1	2018-10-16	166.37MB	459	4+	0
	Indiana University - Chest X-Rays (XML Reports)	1	2018-11-22	1.11MB	50,439	18+	0
	Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN)	1	2019-02-04	657.63MB	56	4+	0
	UN corpus - training-parallel-un.tgz (ES-EN, FR-EN)	1	2019-02-04	2.37GB	70	4+	0
	Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)	1	2019-02-04	918.31MB	130	3+	0
	r/WritingPrompts, Text (2018)	1	2019-06-19	87.47MB	464	8	0
	Reading Text in the Wild with Convolutional Neural Networks	1	2021-11-12	10.68GB	49,310	25	0
	SMS Spam Collection Data Set	2	2015-11-28	695.38kB	872	7+	0
	30M Factoid Question-Answer Corpus (30MQA)	2	2018-11-29	529.34MB	5,114	11+	0
	Flickr8k Dataset	2	2019-03-09	1.12GB	15,715	13+	0
	Lerman Twitter 2010 Dataset	3	2014-08-15	292.17MB	3,678	11+	0
	Synthetic Data for Text Localisation in Natural Images	15	2021-11-15	73.50GB	4,159	11	4
	PMC Open Access Subset	16	2020-05-24	84.14GB	372	10+	0
	OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized	395	2019-06-01	16.02GB	240	3	0
	Phishing corpus	4555	2019-01-02	37.48MB	1,186	8+	0

Send Feedback