The Pile An 800GB Dataset of Diverse Text for Language Modeling
EleutherAI

folder EleutherAI_ThePile_v1 (51 files)
filepile/train/16.jsonl.zst 15.27GB
filepile/train/15.jsonl.zst 15.28GB
filepile/train/14.jsonl.zst 15.22GB
filepile/train/13.jsonl.zst 15.21GB
filepile/train/12.jsonl.zst 15.26GB
filepile/train/11.jsonl.zst 15.22GB
filepile/train/10.jsonl.zst 15.23GB
filepile/train/09.jsonl.zst 15.22GB
filepile/train/08.jsonl.zst 15.23GB
filepile/train/07.jsonl.zst 15.31GB
filepile/train/06.jsonl.zst 15.26GB
filepile/train/05.jsonl.zst 15.21GB
filepile/train/04.jsonl.zst 15.19GB
filepile/train/03.jsonl.zst 15.19GB
filepile/train/02.jsonl.zst 15.21GB
filepile/train/01.jsonl.zst 15.21GB
filepile/train/00.jsonl.zst 15.24GB
filepile/test.jsonl.zst 460.25MB
filepile/SHA256SUMS.txt 2.78kB
fileREADME.txt 0.10kB
filepile/train/17.jsonl.zst 15.31GB
filepile/train/18.jsonl.zst 15.31GB
filepile/train/19.jsonl.zst 15.28GB
filepile/train/20.jsonl.zst 15.21GB
filepile/train/21.jsonl.zst 15.31GB
filepile/train/22.jsonl.zst 15.30GB
filepile/train/23.jsonl.zst 15.29GB
filepile/train/24.jsonl.zst 15.19GB
filepile/train/25.jsonl.zst 15.20GB
filepile/train/26.jsonl.zst 15.20GB
filepile/train/27.jsonl.zst 15.22GB
filepile/train/28.jsonl.zst 15.22GB
filepile/train/29.jsonl.zst 15.22GB
filepile/val.jsonl.zst 470.91MB
filepile_preliminary_components/2020-09-08-arxiv-extracts-nofallback-until-2007-068.tar.gz 17.48GB
filepile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst 1.48GB
filepile_preliminary_components/FreeLaw_Opinions.jsonl.zst 17.01GB
filepile_preliminary_components/Literotica.jsonl.zst 4.43GB
filepile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst 630.78MB
filepile_preliminary_components/PMC_extracts.tar.gz 28.28GB
filepile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst 6.90GB
filepile_preliminary_components/PhilArchive.jsonl.zst 797.71MB
filepile_preliminary_components/books1.tar.gz 2.40GB
filepile_preliminary_components/books3.tar.gz 39.52GB
filepile_preliminary_components/github.tar 113.35GB
filepile_preliminary_components/hn.tar.gz 706.52MB
filepile_preliminary_components/openwebtext2.jsonl.zst.tar 29.34GB
filepile_preliminary_components/pile_uspto.tar 11.79GB
filepile_preliminary_components/stackexchange_dataset.tar 36.80GB
Too many files! Click here to view them all.
Type: Dataset
Tags:

Bibtex:
@article{,
title= {The Pile An 800GB Dataset of Diverse Text for Language Modeling},
journal= {},
author= {EleutherAI},
year= {},
url= {https://the-eye.eu/public/AI/},
abstract= {## What is the Pile?

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

## Why is the Pile a good training set?
Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

## Why is the Pile a good benchmark?
To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.},
keywords= {},
terms= {},
license= {},
superseded= {}
}


Send Feedback