main (13 files)
data/notabug_0000.parquet |
946.30MB |
data/notabug_0001.parquet |
1.05GB |
data/notabug_0002.parquet |
1.09GB |
data/notabug_0003.parquet |
1.14GB |
data/notabug_0004.parquet |
1.05GB |
data/notabug_0005.parquet |
782.05MB |
data/notabug_0006.parquet |
1.19GB |
data/notabug_0007.parquet |
1.08GB |
data/notabug_0008.parquet |
933.83MB |
data/notabug_0009.parquet |
856.30MB |
data/notabug_0010.parquet |
902.77MB |
data/notabug_0011.parquet |
1.58GB |
README.md |
5.44kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {NotaBug Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/notabug-code},
abstract= {# NotaBug Code Dataset
## Dataset Description
This dataset was compiled from code repositories hosted on [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom and privacy. NotaBug is built on a fully free software stack and is popular among free software advocates and privacy-conscious developers.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 12,622,961 |
| **Total Repositories** | 11,660 |
| **Total Size** | 12 GB (compressed Parquet) |
| **Programming Languages** | 6,306 (by file extension) |
| **File Format** | Parquet with Zstd compression (12 files) |
### Key Features
- **Free software focused corpus**: Contains code from repositories on a platform dedicated to software freedom
- **Diverse language coverage**: Spans thousands of file types identified by file extension
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
### Languages
The dataset includes files from many programming languages and file types. Languages are detected by file extension. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | C++ | 2,219,208 |
| 2 | po | 2,022,441 |
| 3 | none | 1,572,451 |
| 4 | PHP | 951,354 |
| 5 | patch | 637,317 |
| 6 | svg | 547,170 |
| 7 | XML | 502,139 |
| 8 | Python | 392,476 |
| 9 | Text | 296,953 |
| 10 | JavaScript | 233,368 |
| 11 | JSON | 198,981 |
| 12 | Scheme | 192,409 |
| 13 | Markdown | 182,342 |
| 14 | info | 155,078 |
| 15 | slackbuild | 154,859 |
| 16 | HTML | 149,824 |
| 17 | Shell | 133,325 |
| 18 | log | 127,393 |
| 19 | Makefile | 112,989 |
| 20 | INI | 110,537 |
| 21 | Lua | 84,303 |
| 22 | in | 75,138 |
| 23 | Assembly | 74,519 |
| 24 | list | 58,346 |
| 25 | Java | 48,781 |
| 26 | CSS | 48,112 |
| 27 | mk | 47,373 |
| 28 | dtsi | 43,825 |
| 29 | diff | 42,125 |
| 30 | el | 41,017 |
### Licenses
The dataset includes files from repositories with various licenses:
| License | File Count |
|---------|------------|
| mit | 10,029,349 |
| mpl-2.0 | 1,178,420 |
| unknown | 888,840 |
| gpl-2.0 | 333,538 |
| gpl-3.0 | 158,975 |
| unlicense | 11,805 |
| cc-by-4.0 | 8,367 |
| bsd-2-clause | 4,718 |
| agpl-3.0 | 3,055 |
| cc-by-sa-4.0 | 2,309 |
| wtfpl | 1,314 |
| cc0-1.0 | 1,188 |
| bsd-3-clause | 601 |
| cc-by-nc-4.0 | 269 |
| lgpl-3.0 | 137 |
| lgpl-2.1 | 76 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the NotaBug repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as inferred by file extension |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |
### Data Format
- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 12 files (`notabug_0000.parquet` to `notabug_0011.parquet`)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'code': '#!/usr/bin/env python2n# -*- coding: utf-8 -*-n# Copyright (C) 2014...',
'repo_name': 'intermsofthewhole/libreboot',
'path': 'resources/utilities/i945gpu/intel-regs.py',
'language': 'Python',
'license': 'mit',
'size': 3733
}
```
## Dataset Creation
### Source Data
All data originates from public repositories hosted on [NotaBug.org](https://notabug.org).
### Language Detection
Programming languages are detected by file extension inference.
### License Detection
Licenses are detected by scanning for license files in repositories and matching against known license patterns. Repositories without a detectable license are marked as "unknown".
### File Filtering
- **Long Lines**: Files with any line exceeding 1,000 characters were excluded
- **Deduplication**: No deduplication was performed on the dataset
## Considerations for Using the Data
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data/notabug_0000.parquet