Name: The Pile An 800GB Dataset of Diverse Text for Language Modeling
Creator: EleutherAI
Published: 2021-03-01 01:37:09
License: https://academictorrents.com/nolicensespecified

EleutherAI_ThePile_v1 (51 files)

pile/train/16.jsonl.zst	15.27GB
pile/train/15.jsonl.zst	15.28GB
pile/train/14.jsonl.zst	15.22GB
pile/train/13.jsonl.zst	15.21GB
pile/train/12.jsonl.zst	15.26GB
pile/train/11.jsonl.zst	15.22GB
pile/train/10.jsonl.zst	15.23GB
pile/train/09.jsonl.zst	15.22GB
pile/train/08.jsonl.zst	15.23GB
pile/train/07.jsonl.zst	15.31GB
pile/train/06.jsonl.zst	15.26GB
pile/train/05.jsonl.zst	15.21GB
pile/train/04.jsonl.zst	15.19GB
pile/train/03.jsonl.zst	15.19GB
pile/train/02.jsonl.zst	15.21GB
pile/train/01.jsonl.zst	15.21GB
pile/train/00.jsonl.zst	15.24GB
pile/test.jsonl.zst	460.25MB
pile/SHA256SUMS.txt	2.78kB
README.txt	0.10kB
pile/train/17.jsonl.zst	15.31GB
pile/train/18.jsonl.zst	15.31GB
pile/train/19.jsonl.zst	15.28GB
pile/train/20.jsonl.zst	15.21GB
pile/train/21.jsonl.zst	15.31GB
pile/train/22.jsonl.zst	15.30GB
pile/train/23.jsonl.zst	15.29GB
pile/train/24.jsonl.zst	15.19GB
pile/train/25.jsonl.zst	15.20GB
pile/train/26.jsonl.zst	15.20GB
pile/train/27.jsonl.zst	15.22GB
pile/train/28.jsonl.zst	15.22GB
pile/train/29.jsonl.zst	15.22GB
pile/val.jsonl.zst	470.91MB
pile_preliminary_components/2020-09-08-arxiv-extracts-nofallback-until-2007-068.tar.gz	17.48GB
pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst	1.48GB
pile_preliminary_components/FreeLaw_Opinions.jsonl.zst	17.01GB
pile_preliminary_components/Literotica.jsonl.zst	4.43GB
pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst	630.78MB
pile_preliminary_components/PMC_extracts.tar.gz	28.28GB
pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst	6.90GB
pile_preliminary_components/PhilArchive.jsonl.zst	797.71MB
pile_preliminary_components/books1.tar.gz	2.40GB
pile_preliminary_components/books3.tar.gz	39.52GB
pile_preliminary_components/github.tar	113.35GB
pile_preliminary_components/hn.tar.gz	706.52MB
pile_preliminary_components/openwebtext2.jsonl.zst.tar	29.34GB
pile_preliminary_components/pile_uspto.tar	11.79GB
pile_preliminary_components/stackexchange_dataset.tar	36.80GB
Too many files! Click here to view them all.

Type: Dataset

Tags:

Bibtex:

@article{,
title= {The Pile An 800GB Dataset of Diverse Text for Language Modeling},
journal= {},
author= {EleutherAI},
year= {},
url= {https://the-eye.eu/public/AI/},
abstract= {## What is the Pile?

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

## Why is the Pile a good training set?
Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

## Why is the Pile a good benchmark?
To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

Date	2023-08-24 13:54:33
Submitter Name	Thomas Heldrup, Vesterbrogade 15, 1 floor, 1620 Copenhagen V, Denmark
Submitter Email	Thomas.heldrup@rettighedsalliancen.dk
Provide a description of the content in question:	"The book ""Afrikas Horn"" by Wilbur Smith, published by Lindhardt og Ringhof A/S in Denmark. There are additional 108 works we represent that are infringed on the URL. On the following link you can see an official description of ""Afrikas Horn"" by Wilbur Smith: https://www.lindhardtogringhof.dk/afrikas-horn-3"
How are you authorized to make the request?	Authorised agent
How is the content not covered under the Fair Use Act sections 107 or 108?	The work originates from an illegal filesharing site called bibliotik.me (this explicit from the paper documenting "The Pil" found here: https://arxiv.org/abs/2101.00027. As the origin of the copy of the content is an illegal source the content cannot be claimed to fall under the Fair Use doctrine.
Provide a statement that the complaining party has a good faith belief.	I have good faith that the use of the work in this notice is not authorised by the copyright owner its agent, or the law.

The Pile An 800GB Dataset of Diverse Text for Language Modeling EleutherAI

The Pile An 800GB Dataset of Diverse Text for Language Modeling
EleutherAI