NotaBug Code Dataset
nyuuzyou

folder main (13 files)
filedata/notabug_0000.parquet 946.30MB
filedata/notabug_0001.parquet 1.05GB
filedata/notabug_0002.parquet 1.09GB
filedata/notabug_0003.parquet 1.14GB
filedata/notabug_0004.parquet 1.05GB
filedata/notabug_0005.parquet 782.05MB
filedata/notabug_0006.parquet 1.19GB
filedata/notabug_0007.parquet 1.08GB
filedata/notabug_0008.parquet 933.83MB
filedata/notabug_0009.parquet 856.30MB
filedata/notabug_0010.parquet 902.77MB
filedata/notabug_0011.parquet 1.58GB
fileREADME.md 5.44kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {NotaBug Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/notabug-code},
abstract= {# NotaBug Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom and privacy. NotaBug is built on a fully free software stack and is popular among free software advocates and privacy-conscious developers.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 12,622,961 |
| **Total Repositories** | 11,660 |
| **Total Size** | 12 GB (compressed Parquet) |
| **Programming Languages** | 6,306 (by file extension) |
| **File Format** | Parquet with Zstd compression (12 files) |

### Key Features

- **Free software focused corpus**: Contains code from repositories on a platform dedicated to software freedom
- **Diverse language coverage**: Spans thousands of file types identified by file extension
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size

### Languages

The dataset includes files from many programming languages and file types. Languages are detected by file extension. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | C++ | 2,219,208 |
| 2 | po | 2,022,441 |
| 3 | none | 1,572,451 |
| 4 | PHP | 951,354 |
| 5 | patch | 637,317 |
| 6 | svg | 547,170 |
| 7 | XML | 502,139 |
| 8 | Python | 392,476 |
| 9 | Text | 296,953 |
| 10 | JavaScript | 233,368 |
| 11 | JSON | 198,981 |
| 12 | Scheme | 192,409 |
| 13 | Markdown | 182,342 |
| 14 | info | 155,078 |
| 15 | slackbuild | 154,859 |
| 16 | HTML | 149,824 |
| 17 | Shell | 133,325 |
| 18 | log | 127,393 |
| 19 | Makefile | 112,989 |
| 20 | INI | 110,537 |
| 21 | Lua | 84,303 |
| 22 | in | 75,138 |
| 23 | Assembly | 74,519 |
| 24 | list | 58,346 |
| 25 | Java | 48,781 |
| 26 | CSS | 48,112 |
| 27 | mk | 47,373 |
| 28 | dtsi | 43,825 |
| 29 | diff | 42,125 |
| 30 | el | 41,017 |

### Licenses

The dataset includes files from repositories with various licenses:

| License | File Count |
|---------|------------|
| mit | 10,029,349 |
| mpl-2.0 | 1,178,420 |
| unknown | 888,840 |
| gpl-2.0 | 333,538 |
| gpl-3.0 | 158,975 |
| unlicense | 11,805 |
| cc-by-4.0 | 8,367 |
| bsd-2-clause | 4,718 |
| agpl-3.0 | 3,055 |
| cc-by-sa-4.0 | 2,309 |
| wtfpl | 1,314 |
| cc0-1.0 | 1,188 |
| bsd-3-clause | 601 |
| cc-by-nc-4.0 | 269 |
| lgpl-3.0 | 137 |
| lgpl-2.1 | 76 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the NotaBug repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as inferred by file extension |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 12 files (`notabug_0000.parquet` to `notabug_0011.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': '#!/usr/bin/env python2n# -*- coding: utf-8 -*-n# Copyright (C) 2014...',
    'repo_name': 'intermsofthewhole/libreboot',
    'path': 'resources/utilities/i945gpu/intel-regs.py',
    'language': 'Python',
    'license': 'mit',
    'size': 3733
}
```

## Dataset Creation

### Source Data

All data originates from public repositories hosted on [NotaBug.org](https://notabug.org).

### Language Detection

Programming languages are detected by file extension inference.

### License Detection

Licenses are detected by scanning for license files in repositories and matching against known license patterns. Repositories without a detectable license are marked as "unknown".

### File Filtering

- **Long Lines**: Files with any line exceeding 1,000 characters were excluded
- **Deduplication**: No deduplication was performed on the dataset

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback