GitGud Code Dataset

Name: GitGud Code Dataset
Creator: nyuuzyou
Published: 2026-01-09 11:12:45
License: https://academictorrents.com/nolicensespecified

main (18 files)

data/gitgud-00000.parquet	1.18GB
data/gitgud-00001.parquet	1.26GB
data/gitgud-00002.parquet	1.18GB
data/gitgud-00003.parquet	1.14GB
data/gitgud-00004.parquet	1.14GB
data/gitgud-00005.parquet	1.17GB
data/gitgud-00006.parquet	1.14GB
data/gitgud-00007.parquet	1.16GB
data/gitgud-00008.parquet	1.13GB
data/gitgud-00009.parquet	1.12GB
data/gitgud-00010.parquet	1.08GB
data/gitgud-00011.parquet	1.13GB
data/gitgud-00012.parquet	1.17GB
data/gitgud-00013.parquet	1.16GB
data/gitgud-00014.parquet	1.11GB
data/gitgud-00015.parquet	1.15GB
data/gitgud-00016.parquet	348.46MB
README.md	7.26kB

Type: Dataset

Tags:

Bibtex:

@article{,
title= {GitGud Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitgud-code},
abstract= {# GitGud Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [GitGud.io](https://gitgud.io), a GitLab-based code hosting platform. GitGud.io serves as an alternative git hosting service used by various developer communities and open-source projects.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 16,322,315 |
| **Total Repositories** | 7,204 |
| **Total Size** | 17.46 GB (compressed Parquet) |
| **Programming Languages** | 2,185 |
| **File Format** | Parquet with Zstd compression (17 files) |

### Key Features

- **Diverse code corpus**: Contains code from over 7,000 repositories across various domains
- **Wide language coverage**: Spans 2,185 programming languages and file types detected by file extension mapping
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Filtering applied to remove binary files, overly long lines, and license files

### Languages

The dataset includes 2,185 programming languages and file types. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | tw (Twine) | 3,301,366 |
| 2 | XML | 3,281,566 |
| 3 | svg | 1,744,500 |
| 4 | C# | 1,367,799 |
| 5 | JavaScript | 1,252,710 |
| 6 | C++ | 731,619 |
| 7 | erb | 710,279 |
| 8 | JSON | 398,139 |
| 9 | Text | 377,948 |
| 10 | twee | 300,576 |
| 11 | csv | 205,230 |
| 12 | HTML | 170,711 |
| 13 | Markdown | 160,735 |
| 14 | TypeScript | 147,173 |
| 15 | Lua | 117,079 |
| 16 | PHP | 116,059 |
| 17 | none | 111,791 |
| 18 | pal | 110,626 |
| 19 | CSS | 108,664 |
| 20 | Python | 106,261 |
| 21 | dm | 98,333 |
| 22 | Ruby | 93,685 |
| 23 | _comment | 91,730 |
| 24 | Java | 81,190 |
| 25 | YAML | 63,289 |
| 26 | ActionScript | 62,210 |
| 27 | Git | 43,748 |
| 28 | mdwn | 42,654 |
| 29 | mk | 41,789 |
| 30 | INI | 39,760 |

### Licenses

The dataset includes files from repositories with various licenses:

| License | File Count |
|---------|------------|
| mit | 9,517,343 |
| bsd-3-clause | 3,315,732 |
| unknown | 2,935,736 |
| mpl-2.0 | 338,040 |
| gpl-2.0 | 79,415 |
| lgpl-2.1 | 38,429 |
| gpl-3.0 | 25,964 |
| apache-2.0 | 20,562 |
| cc-by-4.0 | 18,703 |
| agpl-3.0 | 15,367 |
| cc-by-nc-4.0 | 6,362 |
| wtfpl | 6,163 |
| bsd-2-clause | 3,749 |
| zlib | 482 |
| unlicense | 261 |
| cc-by-sa-4.0 | 7 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the GitGud repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language detected by file extension mapping |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression (level 19)
- **File Structure**: 17 files (`gitgud-00000.parquet` to `gitgud-00016.parquet`)
- **Rows per shard**: ~1,000,000 (except last shard: 322,315)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': 'using System;nusing System.Collections.Generic;n...',
    'repo_name': 'username/game-mod',
    'path': 'src/GameMod/Player.cs',
    'language': 'C#',
    'license': 'mit',
    'size': 2048
}
```

## Dataset Creation

### Pipeline Overview

The dataset was created through a multi-stage pipeline:

1. **Repository Discovery**: Scraping public repository URLs from GitGud.io's GitLab API v4 endpoint using multiple sort orderings (`id`, `name`, `path`, `updated_at`, `star_count`, `last_activity_at`, `similarity`)
2. **Branch Enumeration**: Fetching all branches for each repository via the GitLab API
3. **Archive Download**: Downloading `.tar.gz` archives for each repository/branch combination
4. **Content Extraction**: Extracting and filtering source code files from archives
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression

### Language Detection

Programming languages are detected using file extension mapping. The pipeline maps ~80 programming languages by their file extensions, including:
- **Major languages**: Python, JavaScript, TypeScript, C, C++, C#, Java, Go, Rust, Ruby, PHP
- **Configuration**: JSON, YAML, TOML, XML, INI
- **Markup**: HTML, CSS, Markdown, LaTeX
- **Game development**: GLSL, HLSL, GDScript
- **And many more**

Files with unrecognized extensions are labeled with the extension itself (without the dot prefix). Files without extensions are labeled as "none" or by special filename matching (e.g., "Dockerfile", "Makefile").

### License Detection

Licenses are detected by:
1. Scanning for license files (`LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, `COPYING.txt`, `COPYING.md`)
2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, MPL, ISC, Unlicense, Artistic, WTFPL, Zlib, etc.)
3. Defaulting to "unknown" if no license can be detected

### File Filtering

Filtering is applied to ensure data quality:

#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max line length | 1,000 characters |

#### Content Filtering
- **Binary Detection**: Files with null bytes in the first 1KB are excluded
- **UTF-8 Validation**: Files must be decodable as UTF-8 (with fallback to latin-1, cp1252, iso-8859-1)
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **License Files**: License files (LICENSE, COPYING, etc.) are excluded from the dataset (but used for license detection)

### Source Data

All data originates from public repositories hosted on [GitGud.io](https://gitgud.io).

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

GitGud Code Dataset nyuuzyou

GitGud Code Dataset
nyuuzyou