<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:academictorrents="http://academictorrents.com/" version="2.0">
<channel>
<title>Text - Academic Torrents</title>
<description>collection curated by joecohen</description>
<link>https://academictorrents.com/collection/text</link>
<item>
<title>Synthetic Data for Text Localisation in Natural Images (Dataset)</title>
<description>@inproceedings{gupta16,
author= {Ankush Gupta and Andrea Vedaldi and Andrew Zisserman},
title= {Synthetic Data for Text Localisation in Natural Images},
booktitle= {IEEE Conference on Computer Vision and Pattern Recognition},
year= {2016},
abstract= {This is a synthetically generated dataset, in which word instances are placed in natural scene images, while taking into account the scene layout.

The dataset consists of *800 thousand* images with approximately *8 million* synthetic word instances. Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.},
keywords= {},
terms= {You (the "Researcher"), have requested permission to use the SynthText in the Wild database (the "Database") at the University of Oxford. In exchange for such permission, the Researcher hereby agrees to the following terms and conditions:

1. Researcher shall use the Database only for non-commercial* research and educational purposes.

2. University of Oxford makes no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.

3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify University of Oxford, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.

4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.

5. University of Oxford reservers the right to terminate Researcher's access to the Database at any time.

6. If Researcher is employed by a for-profit, commercial entity*, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.

*  For commercial applications and licensing, contact Roy Azoulay at roy.azoulay@innovation.ox.ac.uk},
license= {},
superseded= {},
url= {https://www.robots.ox.ac.uk/~vgg/data/scenetext/}
}

</description>
<link>https://academictorrents.com/download/2dba9518166cbd141534cbf381aa3e99a087e83c</link>
</item>
<item>
<title>Reading Text in the Wild with Convolutional Neural Networks (Dataset)</title>
<description>@article{jaderberg16,
author= {Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman},
title= {Reading Text in the Wild with Convolutional Neural Networks},
journal= {International Journal of Computer Vision},
number= {1},
volume= {116},
pages= {1--20},
month= {jan},
year= {2016},
abstract= {The exact data used to train our deep convolutional neural networks (see our [research page](http://www.robots.ox.ac.uk/~vgg/research/text/)) is included in this torrent.

This is synthetically generated dataset which we found sufficient for training text recognition on real-world images

![Synthetic Data Engine processt](https://i.imgur.com/cqmgbUa.png)

This dataset consists of *9 million images* covering *90k English words*, and includes the training, validation and test splits used in our work.},
keywords= {},
terms= {},
license= {},
superseded= {},
url= {https://www.robots.ox.ac.uk/~vgg/data/text/}
}

</description>
<link>https://academictorrents.com/download/3d0b4f09080703d2a9c6be50715b46389fdb3af1</link>
</item>
<item>
<title>PMC Open Access Subset (Dataset)</title>
<description>@article{,
title= {PMC Open Access Subset},
journal= {},
author= {NIH/NLM},
year= {},
url= {https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/},
abstract= {https://i.imgur.com/GBSDr8v.png

mirror of ftp.ncbi.nlm.nih.gov:/pub/pmc/oa_bulk

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

https://www.ncbi.nlm.nih.gov/pmc/

The PMC Open Access Subset some or all openaccess content is a part of the total collection of articles in PMC. The articles in the OA Subset are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/},
keywords= {PMC, PubMed Central},
terms= {},
license= {CC},
superseded= {}
}

</description>
<link>https://academictorrents.com/download/06d6badd7d1b0cfee00081c28fddd5e15e106165</link>
</item>
<item>
<title>r/WritingPrompts, Text (2018) (Dataset)</title>
<description>@article{,
title= {r/WritingPrompts, Text (2018)},
journal= {},
author= {},
year= {},
url= {},
abstract= {r/WritingPrompts data, formatted for GPT-2 training. },
keywords= {},
terms= {},
license= {},
superseded= {}
}

</description>
<link>https://academictorrents.com/download/b4fa678ca4a330cf7078750b93eaefb1680a9053</link>
</item>
<item>
<title>OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized (Dataset)</title>
<description>@article{,
title= {OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized},
journal= {},
author= {eukaryote31 and Joshua Peterson and Aaron Gokaslan and Vanya Cohen},
year= {},
url= {},
abstract= {Code by eukaryote31 and Joshua Peterson: https://github.com/jcpeterson/openwebtext and https://github.com/eukaryote31/openwebtext

Scraped by Aaron Gokaslan and Vanya Cohen: https://skylion007.github.io/OpenWebTextCorpus/

Tokenized by eukaryote31},
keywords= {},
terms= {},
license= {},
superseded= {}
}

</description>
<link>https://academictorrents.com/download/36c39b25657ce1639ccec0a91cf242b42e1f01db</link>
</item>
<item>
<title>Flickr8k Dataset (Dataset)</title>
<description>@article{,
title= {Flickr8k Dataset},
keywords= {},
author= {Micah Hodosh and Peter Young and Julia Hockenmaier},
abstract= {8,000 photos and up to 5 captions for each photo.

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations

https://i.imgur.com/6RxAndT.png

## Citation

Hodosh, Micah, Peter Young, and Julia Hockenmaier. "Framing image description as a ranking task: Data, models and evaluation metrics." Journal of Artificial Intelligence Research 47 (2013): 853-899.

},
terms= {},
license= {},
superseded= {},
url= {}
}

</description>
<link>https://academictorrents.com/download/9dea07ba660a722ae1008c4c8afdd303b6f6e53b</link>
</item>
<item>
<title>Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN) (Dataset)</title>
<description>@article{,
title= {Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)},
keywords= {},
author= {},
abstract= {},
terms= {},
license= {},
superseded= {},
url= {http://www.statmt.org/wmt13/translation-task.html}
}

</description>
<link>https://academictorrents.com/download/2a4e272c4fd06abc3b3ee022fd2fd9e220b37c33</link>
</item>
<item>
<title>UN corpus - training-parallel-un.tgz (ES-EN, FR-EN) (Dataset)</title>
<description>@article{,
title= {UN corpus - training-parallel-un.tgz (ES-EN, FR-EN)},
keywords= {},
author= {WMT13},
abstract= {},
terms= {},
license= {},
superseded= {},
url= {http://www.statmt.org/wmt13/translation-task.html}
}

</description>
<link>https://academictorrents.com/download/e4dc3c28d6035a64af928dbdcbc8d6cc0d62d39c</link>
</item>
<item>
<title>Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN) (Dataset)</title>
<description>@article{,
title= {Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN)},
keywords= {},
author= {WMT16},
abstract= {},
terms= {},
license= {},
superseded= {},
url= {http://www.statmt.org/wmt13/translation-task.html}
}

</description>
<link>https://academictorrents.com/download/2c4dbfe50cda15026ebc2579b13edd532b10e911</link>
</item>
<item>
<title>Phishing corpus (Dataset)</title>
<description>@article{,
title= {Phishing corpus},
journal= {},
author= {Vit Listik},
year= {},
url= {},
abstract= {Downloaded at http://monkey.org/~jose/wiki/doku.php?id=phishingcorpus (2015-02-01)},
keywords= {email, phishing, eml, emails},
terms= {},
license= {},
superseded= {}
}

</description>
<link>https://academictorrents.com/download/a77cda9a9d89a60dbdfbe581adf6e2df9197995a</link>
</item>
<item>
<title>30M Factoid Question-Answer Corpus (30MQA) (Dataset)</title>
<description>@article{,
title= {30M Factoid Question-Answer Corpus (30MQA)},
keywords= {},
author= {Iulian Vlad Serban and Alberto García-Durán and Caglar Gulcehre and Sungjin Ahn and Sarath Chandar and Aaron Courville and Yoshua Bengio},
abstract= {The 30M Factoid Question-Answer Corpus consists of 30M natural language questions in English and their corresponding facts in the knowledge base Freebase.

The dataset is formatted as a text file, where each line contains:

```
&lt;subject&gt; \t &lt;relationship&gt; \t &lt;object&gt; \t natural language question,
```
 
where &lt;subject&gt;, &lt;relationship&gt; and &lt;object&gt; are  the subject, relationship and object identifier in Freebase corresponding to the natural language question.

For a more detailed description, have a look at our paper:

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus
http://arxiv.org/abs/1603.06807

Sample:

```
&lt;http://rdf.freebase.com/ns/m.04whkz5&gt;www.freebase.com/book/written_work/subjects&lt;http://rdf.freebase.com/ns/m.01cj3p&gt;what is the book e about ?
&lt;http://rdf.freebase.com/ns/m.0tp2p24&gt;www.freebase.com/music/release_track/release&lt;http://rdf.freebase.com/ns/m.0sjc7c1&gt;in what release does the release track cardiac arrest come from ?
&lt;http://rdf.freebase.com/ns/m.04j0t75&gt;www.freebase.com/film/film/country&lt;http://rdf.freebase.com/ns/m.07ssc&gt;what country is the debt from ?
&lt;http://rdf.freebase.com/ns/m.0ftqr&gt;www.freebase.com/music/producer/tracks_produced&lt;http://rdf.freebase.com/ns/m.0p600l&gt;what songs have nobuo uematsu produced ?
&lt;http://rdf.freebase.com/ns/m.036p007&gt;www.freebase.com/music/release/producers&lt;http://rdf.freebase.com/ns/m.0677ng&gt;who produced eve-olution ?
&lt;http://rdf.freebase.com/ns/m.0ms5mg&gt;www.freebase.com/music/recording/artist&lt;http://rdf.freebase.com/ns/m.0mjn2&gt;which artist recorded most of us are sad ?
```
},
terms= {},
license= {Creative Commons Attribution 3.0 Unported},
superseded= {},
url= {}
}

</description>
<link>https://academictorrents.com/download/973fb709bdb9db6066213bbc5529482a190098ce</link>
</item>
<item>
<title>Indiana University - Chest X-Rays (XML Reports) (Dataset)</title>
<description>@article{,
title= {Indiana University - Chest X-Rays (XML Reports)},
keywords= {chest x-ray, radiology},
author= {},
abstract= {1000 radiology reports for the chest x-ray images from the Indiana University hospital network.

To identify images associated with the reports, use XML tag &lt;parentImage id="image-id"&gt;. More than one image could be associated with a report)

https://i.imgur.com/PWo3x47.png},
terms= {},
license= {Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License},
superseded= {},
url= {https://openi.nlm.nih.gov/faq.php}
}

</description>
<link>https://academictorrents.com/download/66450ba52ba3f83fbf82ef9c91f2bde0e845aba9</link>
</item>
<item>
<title>Yelp reviews - Polarity (Dataset)</title>
<description>@article{,
title= {Yelp reviews - Polarity},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/271777225ff3c6dec8055e231c70731a1da2518f</link>
</item>
<item>
<title>Yelp reviews - Full (Dataset)</title>
<description>@article{,
title= {Yelp reviews - Full},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {1,569,264 samples from the Yelp Dataset Challenge 2015. This full dataset has 130,000 training samples and 10,000 testing samples in each star.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/66ab083bda0c508de6c641baabb1ec17f72dc480</link>
</item>
<item>
<title>Sogou news (Dataset)</title>
<description>@article{,
title= {Sogou news},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {2,909,551 news articles from the SogouCA and SogouCS news corpora, in 5 categories. The number of training samples selected for each class is 90,000 and testing 12,000. Note that the Chinese characters have been converted to Pinyin.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/b2b847b5e1946b0479baa838a0b0547178e5ebe8</link>
</item>
<item>
<title>DBPedia ontology (Dataset)</title>
<description>@article{,
title= {DBPedia ontology},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {40,000 training samples and 5,000 testing samples from 14 nonoverlapping classes from DBpedia 2014.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/881118da3e05d63f91dbadf84317381203f3cb24</link>
</item>
<item>
<title>Amazon reviews - Polarity (Dataset)</title>
<description>@article{,
title= {Amazon reviews - Polarity},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/db0cd5603a0d154ec3dcfc6ff7862d47d3884b83</link>
</item>
<item>
<title>Amazon reviews - Full (Dataset)</title>
<description>@article{,
title= {Amazon reviews - Full},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/66ddbb6d5f49aa6c36a01ca5e814f1beef00b5b7</link>
</item>
<item>
<title>AG News (Dataset)</title>
<description>@article{,
title= {AG News},
keywords= {fastai},
journal= {},
author= {Xiang Zhang et al., 2015},
year= {},
url= {https://arxiv.org/abs/1509.01626},
license= {},
abstract= {496,835 categorized news articles from &gt;2000 news sources from the 4 largest classes from AG’s corpus of news articles, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/758bf646e3ffd39d20f9a3d9efbdb0e1eade5022</link>
</item>
<item>
<title>WMT 2015 French/English parallel texts (Dataset)</title>
<description>@article{,
title= {WMT 2015 French/English parallel texts},
keywords= {fastai},
journal= {},
author= {Callison-Burch et al., 2009},
year= {},
url= {https://www.cis.upenn.edu/~ccb/publications/findings-of-the-wmt09-shared-tasks.pdf},
license= {},
abstract= {French/English parallel texts for training translation models. Over 20 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/2bc57fed1ea43b24296e096aa8746f6faee9513e</link>
</item>
<item>
<title>Wikitext-2 (Dataset)</title>
<description>@article{,
title= {Wikitext-2},
keywords= {fastai},
journal= {},
author= {Stephen Merity et al., 2016},
year= {},
url= {https://arxiv.org/abs/1609.07843},
license= {},
abstract= {A subset of Wikitext-103; useful for testing language model training on smaller datasets.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6</link>
</item>
<item>
<title>Wikitext-103 (Dataset)</title>
<description>@article{,
title= {Wikitext-103},
keywords= {fastai},
journal= {},
author= {Stephen Merity et al., 2016},
year= {},
url= {https://arxiv.org/abs/1609.07843},
license= {},
abstract= {A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/a4fee5547056c845e31ab952598f43b42333183c</link>
</item>
<item>
<title>IMDb Large Movie Review Dataset (Dataset)</title>
<description>@article{,
title= {IMDb Large Movie Review Dataset},
keywords= {fastai},
journal= {},
author= {Andrew L. Maas et al., 2011},
year= {},
url= {http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf},
license= {},
abstract= {A dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/fd24bc44d461b10288469e05a64a8344eb079f15</link>
</item>
<item>
<title>Microsoft Academic Graph - 2016/02/05 (Dataset)</title>
<description>@article{,
title= {Microsoft Academic Graph - 2016/02/05},
keywords= {},
author= {Arnab Sinha and Zhihong Shen and Yang Song and Hao Ma and Darrin Eide and Bo-June (Paul) Hsu and Kuansan Wang},
abstract= {The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. This graph is used to power experiences in Bing, Cortana, and in Microsoft Academic.

},
terms= {We kindly request that any published research that makes use of this data cites our data paper listed below.

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839},
url= {https://academicgraph.blob.core.windows.net/graph-2016-02-05/index.html}
}

</description>
<link>https://academictorrents.com/download/1e0a00b9c606cf87c03e676f75929463c7756fb5</link>
</item>
<item>
<title>MovieLens 20M Dataset (Dataset)</title>
<description>@article{,
title= {MovieLens 20M Dataset},
keywords= {},
journal= {},
author= {},
year= {},
url= {http://files.grouplens.org/datasets/movielens/ml-20m-README.html},
license= {},
abstract= {Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.

### Summary

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.

#### Further Information About GroupLens

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

recommender systems
online communities
mobile and ubiquitious technologies
digital libraries
local geographic information systems
GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit http://movielens.org to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at grouplens-info@cs.umn.edu - we are always interested in working with external collaborators.

#### Content and Use of Files

#### Verifying the Dataset Contents

We encourage you to verify that the dataset you have on your computer is identical to the ones hosted at grouplens.org. This is an important step if you downloaded the dataset from a location other than grouplens.org, or if you wish to publish research results based on analysis of the MovieLens dataset.

We provide a MD5 checksum with the same name as the downloadable .zip file, but with a .md5 file extension. To verify the dataset:

on linux
md5sum ml-20m.zip; cat ml-20m.zip.md5

on OSX
md5 ml-20m.zip; cat ml-20m.zip.md5

windows users can download a tool from Microsoft (or elsewhere) that verifies MD5 checksums
Check that the two lines of output contain the same hash value.

#### Formatting and Encoding

The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes ("). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

#### User Ids

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

#### Movie Ids

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

#### Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Tags Data File Structure (tags.csv)

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Movies Data File Structure (movies.csv)

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
(no genres listed)
Links Data File Structure (links.csv)

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

Tag Genome (genome-scores.csv and genome-tags.csv)

This data set includes a current copy of the Tag Genome.

The tag genome is a data structure that contains tag relevance scores for movies. The structure is a dense matrix: each movie in the genome has a value for every tag in the genome.

As described in this article, the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

The genome is split into two files. The file genome-scores.csv contains movie-tag relevance data in the following format:

movieId,tagId,relevance
The second file, genome-tags.csv, provides the tag descriptions for the tag IDs in the genome file, in the following format:

tagId,tag
The tagId values are generated when the data set is exported, so they may vary from version to version of the MovieLens data sets.

#### Cross-Validation

Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see LensKit for tools, documentation, and open-source code examples.},
superseded= {},
terms= {### Usage License

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
The user may not redistribute the data without separate permission.
The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.
In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email grouplens-info@umn.edu

### Citation

To acknowledge use of the dataset in publications, please cite the following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872}
}

</description>
<link>https://academictorrents.com/download/296054417b4d8eeeb4c7b1c842570bf792ee4d14</link>
</item>
<item>
<title>Sentiment Labelled Sentences Data Set  (Dataset)</title>
<description>@article{,
title= {Sentiment Labelled Sentences Data Set },
keywords= {},
journal= {},
author= {},
year= {},
url= {},
license= {},
abstract= {This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 
Please cite the paper if you want to use it :) It contains sentences labelled with positive or negative sentiment. 

### Format: 
sentence score 

### Details: 
Score is either 1 (for positive) or 0 (for negative)
The sentences come from three different websites/fields: 

imdb.com 
amazon.com 
yelp.com 

For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews. 
We attempted to select sentences that have a clearly positive or negative connotaton, the goal was for no neutral sentences to be selected. 



### Attribute Information:
The attributes are text sentences, extracted from reviews of products, movies, and restaurants


### Relevant Papers:
'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/07e05fc1229555e124df72160a01b2540d04cebf</link>
</item>
<item>
<title>Online News Popularity Data Set  (Dataset)</title>
<description>@article{,
title= {Online News Popularity Data Set },
keywords= {},
journal= {},
author= {Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela},
year= {},
url= {},
license= {},
abstract= {##Data Set Information:

* The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. 
* Acquisition date: January 8, 2015 
* The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.


##Attribute Information:

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) 

0. url: URL of the article (non-predictive) 
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 
2. n_tokens_title: Number of words in the title 
3. n_tokens_content: Number of words in the content 
4. n_unique_tokens: Rate of unique words in the content 
5. n_non_stop_words: Rate of non-stop words in the content 
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 
7. num_hrefs: Number of links 
8. num_self_hrefs: Number of links to other articles published by Mashable 
9. num_imgs: Number of images 
10. num_videos: Number of videos 
11. average_token_length: Average length of the words in the content 
12. num_keywords: Number of keywords in the metadata 
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 
14. data_channel_is_entertainment: Is data channel 'Entertainment'? 
15. data_channel_is_bus: Is data channel 'Business'? 
16. data_channel_is_socmed: Is data channel 'Social Media'? 
17. data_channel_is_tech: Is data channel 'Tech'? 
18. data_channel_is_world: Is data channel 'World'? 
19. kw_min_min: Worst keyword (min. shares) 
20. kw_max_min: Worst keyword (max. shares) 
21. kw_avg_min: Worst keyword (avg. shares) 
22. kw_min_max: Best keyword (min. shares) 
23. kw_max_max: Best keyword (max. shares) 
24. kw_avg_max: Best keyword (avg. shares) 
25. kw_min_avg: Avg. keyword (min. shares) 
26. kw_max_avg: Avg. keyword (max. shares) 
27. kw_avg_avg: Avg. keyword (avg. shares) 
28. self_reference_min_shares: Min. shares of referenced articles in Mashable 
29. self_reference_max_shares: Max. shares of referenced articles in Mashable 
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 
31. weekday_is_monday: Was the article published on a Monday? 
32. weekday_is_tuesday: Was the article published on a Tuesday? 
33. weekday_is_wednesday: Was the article published on a Wednesday? 
34. weekday_is_thursday: Was the article published on a Thursday? 
35. weekday_is_friday: Was the article published on a Friday? 
36. weekday_is_saturday: Was the article published on a Saturday? 
37. weekday_is_sunday: Was the article published on a Sunday? 
38. is_weekend: Was the article published on the weekend? 
39. LDA_00: Closeness to LDA topic 0 
40. LDA_01: Closeness to LDA topic 1 
41. LDA_02: Closeness to LDA topic 2 
42. LDA_03: Closeness to LDA topic 3 
43. LDA_04: Closeness to LDA topic 4 
44. global_subjectivity: Text subjectivity 
45. global_sentiment_polarity: Text sentiment polarity 
46. global_rate_positive_words: Rate of positive words in the content 
47. global_rate_negative_words: Rate of negative words in the content 
48. rate_positive_words: Rate of positive words among non-neutral tokens 
49. rate_negative_words: Rate of negative words among non-neutral tokens 
50. avg_positive_polarity: Avg. polarity of positive words 
51. min_positive_polarity: Min. polarity of positive words 
52. max_positive_polarity: Max. polarity of positive words 
53. avg_negative_polarity: Avg. polarity of negative words 
54. min_negative_polarity: Min. polarity of negative words 
55. max_negative_polarity: Max. polarity of negative words 
56. title_subjectivity: Title subjectivity 
57. title_sentiment_polarity: Title polarity 
58. abs_title_subjectivity: Absolute subjectivity level 
59. abs_title_sentiment_polarity: Absolute polarity level 
60. shares: Number of shares (target)


##Relevant Papers:

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.



##Citation Request:

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

##Source:

Kelwin Fernandes (kafc â€˜@â€™ inesctec.pt, kelwinfc â€™@â€™ gmail.com) - INESC TEC, Porto, Portugal/Universidade do Porto, Portugal. 
Pedro Vinagre (pedro.vinagre.sousa â€™@â€™ gmail.com) - ALGORITMI Research Centre, Universidade do Minho, Portugal 
Paulo Cortez - ALGORITMI Research Centre, Universidade do Minho, Portugal 
Pedro Sernadela - Universidade de Aveiro},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/95d3b03397a0bafd74a662fe13ba3550c13b7ce1</link>
</item>
<item>
<title>Structured Web Data Extraction Dataset (SWDE) (Dataset)</title>
<description>@article{,
title= {Structured Web Data Extraction Dataset (SWDE)},
keywords= {},
journal= {},
author= {Qiang Hao},
year= {2011},
url= {https://swde.codeplex.com/},
license= {},
abstract= {## Motivation
This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value pairs of entities) from the Web. We hope it could serve as a useful benchmark for evaluating and comparing different methods for structured web data extraction.

## Contents of the Dataset

Currently the dataset involves:

8 verticals with diverse semantics;
80 web sites (10 per vertical);
124,291 web pages (200 ~ 2,000 per web site), each containing a single data record with detailed information of an entity;
32 attributes (3 ~ 5 per vertical) associated with carefully labeled ground-truth of corresponding values in each web page. The goal of structured data extraction is to automatically identify the values of these attributes from web pages.
The involved verticals are summarized as follows:

|Vertical  |#Sites|#Pages|#Attributes|Attributes|
|-----------|------|----------|----------------|-----------------|
|Auto|10|17,923|4|model, price, engine, fuel_economy|
|Book|10|20,000|5|title, author, isbn_13, publisher, publication_date|
|Camera|10  |5,258|3|model, price, manufacturer|
|Job|10|20,000|4|title, company, location, date_posted|
|Movie|10|20,000|4|title, director, genre, mpaa_rating|
|NBA Player|10|  4,405|4|name, team, height, weight|
|Restaurant|10|20,000|4|name, address, phone, cuisine|
|University|10|16,705|4|name, phone, website, type|


# Format of Web Pages

Each web page in the dataset is stored as one .htm file (in UTF-8 encoding) where the first tag encodes the source URL of the page.

# Format of Ground-truth Files

For each web site, the page-level ground-truth of attribute values has been labeled using handcrafted regular expressions and stored in .txt files (in UTF-8 encoding) named as: "&lt;vertical&gt;-&lt;site&gt;-&lt;attribute&gt;.txt".

# In each such file:

The first line stores the names of vertical, site, and attribute, separated by TAB characters ('\t').

The second line stores some statistics (separated by TABs) w.r.t. the corresponding site and attribute, including:

* the total number of pages,
* the number of pages containing attribute values,
* the total number of attribute values contained in the pages,
* the number of unique attribute values.

Each remaining line stores the ground-truth information (separated by TABs) of one page, in sequence of:
*page ID,
*the number of attribute values in the page,
*attribute values ("&lt;NULL&gt;" in case of non-existence).

Notes on Ground-truth Labeling

The ground-truth labeling was conducted in the DOM-node level. More specifically, the candidate attribute values in a web page are the non-empty strings contained in text nodes in the corresponding DOM tree.
One page (although containing a single data record) may contain multiple distinct values that correspond to an attribute (e.g., multiple authors of a book, multiple granularity levels of addresses).
Currently, when a text node presents a mixture of multiple attributes, its string value is labeled with each of these attributes, if no substitute is available.
Before being stored in .txt files, the raw attribute values were refined by removing redundant separators (e.g., ' ', '\t', '\n').

## Reference

We would appreciate it if you cite the following paper when using the dataset:

    Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang. "From One Tree to a Forest: a Uniﬁed Solution for Structured Web Data Extraction". in Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), pp.775-784, Beijing, China. July 24-28, 2011.

## Contact

If ﻿you have questions about this dataset, please contact Qiang Hao (haoq@live.com, Homepage).},
superseded= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/411576c7e80787e4b40452360f5f24acba9b5159</link>
</item>
<item>
<title>SMS Spam Collection Data Set  (Dataset)</title>
<description>@article{,
title= {SMS Spam Collection Data Set },
journal= {},
author= {Tiago A. Almeida and José María Gómez Hidalgo},
year= {},
url= {},
abstract= {==Data Set Information:

This corpus has been collected from free or free for research sources at the Internet: 

-&gt; A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. 
-&gt; A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. 
-&gt; A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. 
-&gt; Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches: 

==Source:

Tiago A. Almeida (talmeida ufscar.br) 
Department of Computer Science 
Federal University of Sao Carlos (UFSCar) 
Sorocaba, Sao Paulo - Brazil 

José María Gómez Hidalgo (jmgomezh yahoo.es) 
R&amp;D Department Optenet 
Las Rozas, Madrid - Spain


==Publication and More Information

We offer a comprehensive study of this corpus in the following papers. These works present a number of interesting statistics, studies and baseline results for many traditional machine learning methods.

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results.  Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection.  Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.

Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P.  Towards SMS Spam Filtering: Results under a New Dataset.   International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013. 


==Attribute Information:

The collection is composed by just one text file, where each line has the correct class followed by the raw message. We offer some examples bellow: 

ham What you doing?how are you? 
ham Ok lar... Joking wif u oni... 
ham dun say so early hor... U c already then say... 
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H* 
ham Siva is in hostel aha:-. 
ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. 
spam FreeMsg: Txt: CALL to No: 86888 &amp; claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop 
spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B 
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU 

Note: the messages are not chronologically sorted.


==Citation Request:

If you find this dataset useful, you make a reference to our paper and the web page: [Web Link] in your papers, research, etc; 
Send us a message to talmeida ufscar.br or jmgomezh yahoo.es in case you make use of the corpus. 

We would like to thank Min-Yen Kan and his team for making the NUS SMS Corpus available.},
keywords= {},
terms= {}
}

</description>
<link>https://academictorrents.com/download/25932ba42d983dd7b4474d8f59ab56cdc25d9107</link>
</item>
<item>
<title>Enwiki Word2vec model 1000 Dimensions (Dataset)</title>
<description>@article{,
title= {Enwiki Word2vec model 1000 Dimensions},
journal= {},
author= {Idio},
year= {2015},
url= {https://github.com/idio/wiki2vec},
license= {},
abstract= {Gensim Word2vec model built on the english wikipedia, 1000dimensions, 10cbow, no stemming},
keywords= {Wikipedia, nlp, word2vec, english, gensim, deeplearning, natural language, wiki},
terms= {}
}

</description>
<link>https://academictorrents.com/download/5d18911e7036870197bf5e23cf1be96d3353518a</link>
</item>
<item>
<title>Yale YouTube Video Text (Dataset)</title>
<description>@article{,
title= {Yale YouTube Video Text},
journal= {},
author= {Yale},
year= {},
url= {http://vision.ucsd.edu/content/youtube-video-text},
abstract= {YouTube Video Text (YVT) contains 30 videos. Each video has 15-second length, 30 frames per second, HD 720p quality and was collected from YouTube. The text content in the dataset can be divided into two categories, overlay text (e.g., captions, songs title, logos) and scene text (e.g. street signs, business signs, words on shirt).},
keywords= {Dataset}
}

</description>
<link>https://academictorrents.com/download/156802226bcf5747e0bea4e4f14c03b3b952de80</link>
</item>
<item>
<title>Lerman Twitter 2010 Dataset (Dataset)</title>
<description>@article{,
title= {Lerman Twitter 2010 Dataset},
journal= {},
author= {Kristina Lerman },
year= {2010},
license= {This data is made available to the community for research purposes only},
url= {http://www.isi.edu/~lerman/downloads/twitter/twitter2010.html},
abstract= {Twitter_2010 data set contains tweets containing URLs that have been posted on Twitter during October 2010. In addition to tweets, we also the followee links of tweeting users, allowing us to reconstruct the follower graph of active (tweeting) users.
URLs66,059
tweets2,859,764
users736,930
links36,743,448
Tweets

Table (in csv format) link_status_search_with_ordering_real_csv contains tweets with the following information

link: URL within the text of the tweet
id: tweet id
create_at: date added to the db
create_at_long
inreplyto_screen_name: screen name of user this tweet is replying to
inreplyto_user_id: user id of user this tweet is replying to
source: device from which the tweet originated
bad_user_id: alternate user id
user_screen_name: tweeting user screen name
order_of_users: tweet's index within sequence of tweets of the same URL
user_id: user id
Table (in csv format) distinct_users_from_search_table_real_map contains names of tweeting users, and the following information for each user:

user_id: user id
user_screen_name: user name
indegree: number of followers
outdegree: number of friends/followees
bad_user_id: alternate user id
Follower graph

File active_follower_real_sql contains zipped SQL dump of links between tweeting users in the form:

user_id: user id
follower_id: user id of the follower
Empirical characterization of this data is described in 
Kristina Lerman, Rumi Ghosh, Tawan Surachawala (2012) "Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs." This data is made available to the community for research purposes only. If you use the data in a publication, please cite the above paper.},
keywords= {twitter},
terms= {}
}

</description>
<link>https://academictorrents.com/download/d8b3a315172c8d804528762f37fa67db14577cdb</link>
</item>
</channel>
</rss>
