main (257 files)
data/articles_0a.parquet |
797.63MB |
data/articles_0b.parquet |
801.09MB |
data/articles_0c.parquet |
795.18MB |
data/articles_0d.parquet |
797.39MB |
data/articles_0e.parquet |
795.71MB |
data/articles_0f.parquet |
794.47MB |
data/articles_00.parquet |
818.75MB |
data/articles_01.parquet |
797.43MB |
data/articles_02.parquet |
797.84MB |
data/articles_03.parquet |
794.41MB |
data/articles_04.parquet |
796.44MB |
data/articles_05.parquet |
793.85MB |
data/articles_06.parquet |
796.30MB |
data/articles_07.parquet |
796.68MB |
data/articles_08.parquet |
798.36MB |
data/articles_09.parquet |
799.20MB |
data/articles_1a.parquet |
796.58MB |
data/articles_1b.parquet |
799.90MB |
data/articles_1c.parquet |
797.23MB |
data/articles_1d.parquet |
796.84MB |
data/articles_1e.parquet |
797.02MB |
data/articles_1f.parquet |
793.08MB |
data/articles_2a.parquet |
795.72MB |
data/articles_2b.parquet |
796.88MB |
data/articles_2c.parquet |
794.51MB |
data/articles_2d.parquet |
799.53MB |
data/articles_2e.parquet |
803.85MB |
data/articles_2f.parquet |
800.11MB |
data/articles_3a.parquet |
794.45MB |
data/articles_3b.parquet |
796.33MB |
data/articles_3c.parquet |
796.64MB |
data/articles_3d.parquet |
796.02MB |
data/articles_3e.parquet |
794.07MB |
data/articles_3f.parquet |
796.40MB |
data/articles_4a.parquet |
795.82MB |
data/articles_4b.parquet |
798.94MB |
data/articles_4c.parquet |
796.15MB |
data/articles_4d.parquet |
795.83MB |
data/articles_4e.parquet |
797.13MB |
data/articles_4f.parquet |
795.45MB |
data/articles_5a.parquet |
797.98MB |
data/articles_5b.parquet |
799.98MB |
data/articles_5c.parquet |
798.02MB |
data/articles_5d.parquet |
796.24MB |
data/articles_5e.parquet |
795.11MB |
data/articles_5f.parquet |
797.80MB |
data/articles_6a.parquet |
798.62MB |
data/articles_6b.parquet |
795.84MB |
data/articles_6c.parquet |
796.27MB |
|
|
|
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {NNTP Discussion Archives},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/nntp-text-387m},
abstract= {
# NNTP Discussion Archives
A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total messages | 386,629,949 |
| Unique newsgroups | 159,345 |
| Date range | 2002 - 2026 |
| Total size | ~191 GB (compressed) |
| File format | Parquet (ZSTD) |
| Number of files | 256 |
| Average content length | ~1,400 characters |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `message_id` | `string` | Original message identifier (unchanged) |
| `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted |
| `author` | `string` | Message author with email addresses redacted as `[email]` |
| `subject` | `string` | Subject line |
| `date` | `string` | RFC 2822 formatted date string |
| `content` | `string` | Message body with email addresses redacted as `[email]` |
## Top Newsgroups by Volume
| Newsgroup | Messages |
|-----------|----------|
| alt.atheism | 5,658,023 |
| free.usenet | 4,691,561 |
| alt.fan.rush-limbaugh | 4,659,639 |
| alt.politics | 3,919,772 |
| fr.soc.politique | 3,554,434 |
| it.sport.calcio.milan | 2,961,804 |
| it.politica | 2,802,687 |
| alt.politics.bush | 2,786,316 |
| talk.politics.misc | 2,784,668 |
| Other (159,336 groups) | 475,430,274 |
*Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.*
## Data Processing
**Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups.
**Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline:
- Quoted-Printable: MIME-encoded content decoded to text
- Base64: Text base64 content decoded; binary base64 excluded
- Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection
- MIME encoded-word headers decoded to UTF-8
**Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained).
**Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged.
## Considerations
- Messages were posted to public newsgroups
- Content reflects unmoderated discussions and may contain controversial opinions},
keywords= {discussions, historical},
terms= {},
license= {},
superseded= {}
}
data/articles_0a.parquet