main (18 files)
data/gitgud-00000.parquet |
1.18GB |
data/gitgud-00001.parquet |
1.26GB |
data/gitgud-00002.parquet |
1.18GB |
data/gitgud-00003.parquet |
1.14GB |
data/gitgud-00004.parquet |
1.14GB |
data/gitgud-00005.parquet |
1.17GB |
data/gitgud-00006.parquet |
1.14GB |
data/gitgud-00007.parquet |
1.16GB |
data/gitgud-00008.parquet |
1.13GB |
data/gitgud-00009.parquet |
1.12GB |
data/gitgud-00010.parquet |
1.08GB |
data/gitgud-00011.parquet |
1.13GB |
data/gitgud-00012.parquet |
1.17GB |
data/gitgud-00013.parquet |
1.16GB |
data/gitgud-00014.parquet |
1.11GB |
data/gitgud-00015.parquet |
1.15GB |
data/gitgud-00016.parquet |
348.46MB |
README.md |
7.26kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {GitGud Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitgud-code},
abstract= {# GitGud Code Dataset
## Dataset Description
This dataset was compiled from code repositories hosted on [GitGud.io](https://gitgud.io), a GitLab-based code hosting platform. GitGud.io serves as an alternative git hosting service used by various developer communities and open-source projects.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 16,322,315 |
| **Total Repositories** | 7,204 |
| **Total Size** | 17.46 GB (compressed Parquet) |
| **Programming Languages** | 2,185 |
| **File Format** | Parquet with Zstd compression (17 files) |
### Key Features
- **Diverse code corpus**: Contains code from over 7,000 repositories across various domains
- **Wide language coverage**: Spans 2,185 programming languages and file types detected by file extension mapping
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Filtering applied to remove binary files, overly long lines, and license files
### Languages
The dataset includes 2,185 programming languages and file types. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | tw (Twine) | 3,301,366 |
| 2 | XML | 3,281,566 |
| 3 | svg | 1,744,500 |
| 4 | C# | 1,367,799 |
| 5 | JavaScript | 1,252,710 |
| 6 | C++ | 731,619 |
| 7 | erb | 710,279 |
| 8 | JSON | 398,139 |
| 9 | Text | 377,948 |
| 10 | twee | 300,576 |
| 11 | csv | 205,230 |
| 12 | HTML | 170,711 |
| 13 | Markdown | 160,735 |
| 14 | TypeScript | 147,173 |
| 15 | Lua | 117,079 |
| 16 | PHP | 116,059 |
| 17 | none | 111,791 |
| 18 | pal | 110,626 |
| 19 | CSS | 108,664 |
| 20 | Python | 106,261 |
| 21 | dm | 98,333 |
| 22 | Ruby | 93,685 |
| 23 | _comment | 91,730 |
| 24 | Java | 81,190 |
| 25 | YAML | 63,289 |
| 26 | ActionScript | 62,210 |
| 27 | Git | 43,748 |
| 28 | mdwn | 42,654 |
| 29 | mk | 41,789 |
| 30 | INI | 39,760 |
### Licenses
The dataset includes files from repositories with various licenses:
| License | File Count |
|---------|------------|
| mit | 9,517,343 |
| bsd-3-clause | 3,315,732 |
| unknown | 2,935,736 |
| mpl-2.0 | 338,040 |
| gpl-2.0 | 79,415 |
| lgpl-2.1 | 38,429 |
| gpl-3.0 | 25,964 |
| apache-2.0 | 20,562 |
| cc-by-4.0 | 18,703 |
| agpl-3.0 | 15,367 |
| cc-by-nc-4.0 | 6,362 |
| wtfpl | 6,163 |
| bsd-2-clause | 3,749 |
| zlib | 482 |
| unlicense | 261 |
| cc-by-sa-4.0 | 7 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the GitGud repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language detected by file extension mapping |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |
### Data Format
- **Format**: Apache Parquet with Zstd compression (level 19)
- **File Structure**: 17 files (`gitgud-00000.parquet` to `gitgud-00016.parquet`)
- **Rows per shard**: ~1,000,000 (except last shard: 322,315)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'code': 'using System;nusing System.Collections.Generic;n...',
'repo_name': 'username/game-mod',
'path': 'src/GameMod/Player.cs',
'language': 'C#',
'license': 'mit',
'size': 2048
}
```
## Dataset Creation
### Pipeline Overview
The dataset was created through a multi-stage pipeline:
1. **Repository Discovery**: Scraping public repository URLs from GitGud.io's GitLab API v4 endpoint using multiple sort orderings (`id`, `name`, `path`, `updated_at`, `star_count`, `last_activity_at`, `similarity`)
2. **Branch Enumeration**: Fetching all branches for each repository via the GitLab API
3. **Archive Download**: Downloading `.tar.gz` archives for each repository/branch combination
4. **Content Extraction**: Extracting and filtering source code files from archives
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression
### Language Detection
Programming languages are detected using file extension mapping. The pipeline maps ~80 programming languages by their file extensions, including:
- **Major languages**: Python, JavaScript, TypeScript, C, C++, C#, Java, Go, Rust, Ruby, PHP
- **Configuration**: JSON, YAML, TOML, XML, INI
- **Markup**: HTML, CSS, Markdown, LaTeX
- **Game development**: GLSL, HLSL, GDScript
- **And many more**
Files with unrecognized extensions are labeled with the extension itself (without the dot prefix). Files without extensions are labeled as "none" or by special filename matching (e.g., "Dockerfile", "Makefile").
### License Detection
Licenses are detected by:
1. Scanning for license files (`LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, `COPYING.txt`, `COPYING.md`)
2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, MPL, ISC, Unlicense, Artistic, WTFPL, Zlib, etc.)
3. Defaulting to "unknown" if no license can be detected
### File Filtering
Filtering is applied to ensure data quality:
#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max line length | 1,000 characters |
#### Content Filtering
- **Binary Detection**: Files with null bytes in the first 1KB are excluded
- **UTF-8 Validation**: Files must be decodable as UTF-8 (with fallback to latin-1, cp1252, iso-8859-1)
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **License Files**: License files (LICENSE, COPYING, etc.) are excluded from the dataset (but used for license detection)
### Source Data
All data originates from public repositories hosted on [GitGud.io](https://gitgud.io).
## Considerations for Using the Data
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data/gitgud-00000.parquet