NNTP Discussion Archives
nyuuzyou

folder main (257 files)
filedata/articles_0a.parquet 797.63MB
filedata/articles_0b.parquet 801.09MB
filedata/articles_0c.parquet 795.18MB
filedata/articles_0d.parquet 797.39MB
filedata/articles_0e.parquet 795.71MB
filedata/articles_0f.parquet 794.47MB
filedata/articles_00.parquet 818.75MB
filedata/articles_01.parquet 797.43MB
filedata/articles_02.parquet 797.84MB
filedata/articles_03.parquet 794.41MB
filedata/articles_04.parquet 796.44MB
filedata/articles_05.parquet 793.85MB
filedata/articles_06.parquet 796.30MB
filedata/articles_07.parquet 796.68MB
filedata/articles_08.parquet 798.36MB
filedata/articles_09.parquet 799.20MB
filedata/articles_1a.parquet 796.58MB
filedata/articles_1b.parquet 799.90MB
filedata/articles_1c.parquet 797.23MB
filedata/articles_1d.parquet 796.84MB
filedata/articles_1e.parquet 797.02MB
filedata/articles_1f.parquet 793.08MB
filedata/articles_2a.parquet 795.72MB
filedata/articles_2b.parquet 796.88MB
filedata/articles_2c.parquet 794.51MB
filedata/articles_2d.parquet 799.53MB
filedata/articles_2e.parquet 803.85MB
filedata/articles_2f.parquet 800.11MB
filedata/articles_3a.parquet 794.45MB
filedata/articles_3b.parquet 796.33MB
filedata/articles_3c.parquet 796.64MB
filedata/articles_3d.parquet 796.02MB
filedata/articles_3e.parquet 794.07MB
filedata/articles_3f.parquet 796.40MB
filedata/articles_4a.parquet 795.82MB
filedata/articles_4b.parquet 798.94MB
filedata/articles_4c.parquet 796.15MB
filedata/articles_4d.parquet 795.83MB
filedata/articles_4e.parquet 797.13MB
filedata/articles_4f.parquet 795.45MB
filedata/articles_5a.parquet 797.98MB
filedata/articles_5b.parquet 799.98MB
filedata/articles_5c.parquet 798.02MB
filedata/articles_5d.parquet 796.24MB
filedata/articles_5e.parquet 795.11MB
filedata/articles_5f.parquet 797.80MB
filedata/articles_6a.parquet 798.62MB
filedata/articles_6b.parquet 795.84MB
filedata/articles_6c.parquet 796.27MB
Too many files! Click here to view them all.
Type: Dataset

Bibtex:
@article{,
title= {NNTP Discussion Archives},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/nntp-text-387m},
abstract= {
# NNTP Discussion Archives

A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades.

## Dataset Statistics

| Metric | Value |
|--------|-------|
| Total messages | 386,629,949 |
| Unique newsgroups | 159,345 |
| Date range | 2002 - 2026 |
| Total size | ~191 GB (compressed) |
| File format | Parquet (ZSTD) |
| Number of files | 256 |
| Average content length | ~1,400 characters |

## Schema

| Column | Type | Description |
|--------|------|-------------|
| `message_id` | `string` | Original message identifier (unchanged) |
| `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted |
| `author` | `string` | Message author with email addresses redacted as `[email]` |
| `subject` | `string` | Subject line |
| `date` | `string` | RFC 2822 formatted date string |
| `content` | `string` | Message body with email addresses redacted as `[email]` |

## Top Newsgroups by Volume

| Newsgroup | Messages |
|-----------|----------|
| alt.atheism | 5,658,023 |
| free.usenet | 4,691,561 |
| alt.fan.rush-limbaugh | 4,659,639 |
| alt.politics | 3,919,772 |
| fr.soc.politique | 3,554,434 |
| it.sport.calcio.milan | 2,961,804 |
| it.politica | 2,802,687 |
| alt.politics.bush | 2,786,316 |
| talk.politics.misc | 2,784,668 |
| Other (159,336 groups) | 475,430,274 |

*Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.*

## Data Processing

**Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups.

**Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline:
- Quoted-Printable: MIME-encoded content decoded to text
- Base64: Text base64 content decoded; binary base64 excluded
- Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection
- MIME encoded-word headers decoded to UTF-8

**Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained).

**Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged.

## Considerations

- Messages were posted to public newsgroups
- Content reflects unmoderated discussions and may contain controversial opinions},
keywords= {discussions, historical},
terms= {},
license= {},
superseded= {}
}


Send Feedback