Name: NotaBug Code Dataset
Creator: nyuuzyou
Published: 2026-01-09 09:02:09
License: https://academictorrents.com/nolicensespecified

main (13 files)

data/notabug_0000.parquet	946.30MB
data/notabug_0001.parquet	1.05GB
data/notabug_0002.parquet	1.09GB
data/notabug_0003.parquet	1.14GB
data/notabug_0004.parquet	1.05GB
data/notabug_0005.parquet	782.05MB
data/notabug_0006.parquet	1.19GB
data/notabug_0007.parquet	1.08GB
data/notabug_0008.parquet	933.83MB
data/notabug_0009.parquet	856.30MB
data/notabug_0010.parquet	902.77MB
data/notabug_0011.parquet	1.58GB
README.md	5.44kB

Type: Dataset

Tags:

Bibtex:

@article{,
title= {NotaBug Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/notabug-code},
abstract= {# NotaBug Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom and privacy. NotaBug is built on a fully free software stack and is popular among free software advocates and privacy-conscious developers.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 12,622,961 |
| **Total Repositories** | 11,660 |
| **Total Size** | 12 GB (compressed Parquet) |
| **Programming Languages** | 6,306 (by file extension) |
| **File Format** | Parquet with Zstd compression (12 files) |

### Key Features

- **Free software focused corpus**: Contains code from repositories on a platform dedicated to software freedom
- **Diverse language coverage**: Spans thousands of file types identified by file extension
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size

### Languages

The dataset includes files from many programming languages and file types. Languages are detected by file extension. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | C++ | 2,219,208 |
| 2 | po | 2,022,441 |
| 3 | none | 1,572,451 |
| 4 | PHP | 951,354 |
| 5 | patch | 637,317 |
| 6 | svg | 547,170 |
| 7 | XML | 502,139 |
| 8 | Python | 392,476 |
| 9 | Text | 296,953 |
| 10 | JavaScript | 233,368 |
| 11 | JSON | 198,981 |
| 12 | Scheme | 192,409 |
| 13 | Markdown | 182,342 |
| 14 | info | 155,078 |
| 15 | slackbuild | 154,859 |
| 16 | HTML | 149,824 |
| 17 | Shell | 133,325 |
| 18 | log | 127,393 |
| 19 | Makefile | 112,989 |
| 20 | INI | 110,537 |
| 21 | Lua | 84,303 |
| 22 | in | 75,138 |
| 23 | Assembly | 74,519 |
| 24 | list | 58,346 |
| 25 | Java | 48,781 |
| 26 | CSS | 48,112 |
| 27 | mk | 47,373 |
| 28 | dtsi | 43,825 |
| 29 | diff | 42,125 |
| 30 | el | 41,017 |

### Licenses

The dataset includes files from repositories with various licenses:

| License | File Count |
|---------|------------|
| mit | 10,029,349 |
| mpl-2.0 | 1,178,420 |
| unknown | 888,840 |
| gpl-2.0 | 333,538 |
| gpl-3.0 | 158,975 |
| unlicense | 11,805 |
| cc-by-4.0 | 8,367 |
| bsd-2-clause | 4,718 |
| agpl-3.0 | 3,055 |
| cc-by-sa-4.0 | 2,309 |
| wtfpl | 1,314 |
| cc0-1.0 | 1,188 |
| bsd-3-clause | 601 |
| cc-by-nc-4.0 | 269 |
| lgpl-3.0 | 137 |
| lgpl-2.1 | 76 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the NotaBug repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as inferred by file extension |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 12 files (`notabug_0000.parquet` to `notabug_0011.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': '#!/usr/bin/env python2n# -*- coding: utf-8 -*-n# Copyright (C) 2014...',
    'repo_name': 'intermsofthewhole/libreboot',
    'path': 'resources/utilities/i945gpu/intel-regs.py',
    'language': 'Python',
    'license': 'mit',
    'size': 3733
}
```

## Dataset Creation

### Source Data

All data originates from public repositories hosted on [NotaBug.org](https://notabug.org).

### Language Detection

Programming languages are detected by file extension inference.

### License Detection

Licenses are detected by scanning for license files in repositories and matching against known license patterns. Repositories without a detectable license are marked as "unknown".

### File Filtering

- **Long Lines**: Files with any line exceeding 1,000 characters were excluded
- **Deduplication**: No deduplication was performed on the dataset

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

NotaBug Code Dataset nyuuzyou

NotaBug Code Dataset
nyuuzyou