GitGud Code Dataset
nyuuzyou

folder main (18 files)
filedata/gitgud-00000.parquet 1.18GB
filedata/gitgud-00001.parquet 1.26GB
filedata/gitgud-00002.parquet 1.18GB
filedata/gitgud-00003.parquet 1.14GB
filedata/gitgud-00004.parquet 1.14GB
filedata/gitgud-00005.parquet 1.17GB
filedata/gitgud-00006.parquet 1.14GB
filedata/gitgud-00007.parquet 1.16GB
filedata/gitgud-00008.parquet 1.13GB
filedata/gitgud-00009.parquet 1.12GB
filedata/gitgud-00010.parquet 1.08GB
filedata/gitgud-00011.parquet 1.13GB
filedata/gitgud-00012.parquet 1.17GB
filedata/gitgud-00013.parquet 1.16GB
filedata/gitgud-00014.parquet 1.11GB
filedata/gitgud-00015.parquet 1.15GB
filedata/gitgud-00016.parquet 348.46MB
fileREADME.md 7.26kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {GitGud Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitgud-code},
abstract= {# GitGud Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [GitGud.io](https://gitgud.io), a GitLab-based code hosting platform. GitGud.io serves as an alternative git hosting service used by various developer communities and open-source projects.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 16,322,315 |
| **Total Repositories** | 7,204 |
| **Total Size** | 17.46 GB (compressed Parquet) |
| **Programming Languages** | 2,185 |
| **File Format** | Parquet with Zstd compression (17 files) |

### Key Features

- **Diverse code corpus**: Contains code from over 7,000 repositories across various domains
- **Wide language coverage**: Spans 2,185 programming languages and file types detected by file extension mapping
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Filtering applied to remove binary files, overly long lines, and license files

### Languages

The dataset includes 2,185 programming languages and file types. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | tw (Twine) | 3,301,366 |
| 2 | XML | 3,281,566 |
| 3 | svg | 1,744,500 |
| 4 | C# | 1,367,799 |
| 5 | JavaScript | 1,252,710 |
| 6 | C++ | 731,619 |
| 7 | erb | 710,279 |
| 8 | JSON | 398,139 |
| 9 | Text | 377,948 |
| 10 | twee | 300,576 |
| 11 | csv | 205,230 |
| 12 | HTML | 170,711 |
| 13 | Markdown | 160,735 |
| 14 | TypeScript | 147,173 |
| 15 | Lua | 117,079 |
| 16 | PHP | 116,059 |
| 17 | none | 111,791 |
| 18 | pal | 110,626 |
| 19 | CSS | 108,664 |
| 20 | Python | 106,261 |
| 21 | dm | 98,333 |
| 22 | Ruby | 93,685 |
| 23 | _comment | 91,730 |
| 24 | Java | 81,190 |
| 25 | YAML | 63,289 |
| 26 | ActionScript | 62,210 |
| 27 | Git | 43,748 |
| 28 | mdwn | 42,654 |
| 29 | mk | 41,789 |
| 30 | INI | 39,760 |

### Licenses

The dataset includes files from repositories with various licenses:

| License | File Count |
|---------|------------|
| mit | 9,517,343 |
| bsd-3-clause | 3,315,732 |
| unknown | 2,935,736 |
| mpl-2.0 | 338,040 |
| gpl-2.0 | 79,415 |
| lgpl-2.1 | 38,429 |
| gpl-3.0 | 25,964 |
| apache-2.0 | 20,562 |
| cc-by-4.0 | 18,703 |
| agpl-3.0 | 15,367 |
| cc-by-nc-4.0 | 6,362 |
| wtfpl | 6,163 |
| bsd-2-clause | 3,749 |
| zlib | 482 |
| unlicense | 261 |
| cc-by-sa-4.0 | 7 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the GitGud repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language detected by file extension mapping |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression (level 19)
- **File Structure**: 17 files (`gitgud-00000.parquet` to `gitgud-00016.parquet`)
- **Rows per shard**: ~1,000,000 (except last shard: 322,315)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': 'using System;nusing System.Collections.Generic;n...',
    'repo_name': 'username/game-mod',
    'path': 'src/GameMod/Player.cs',
    'language': 'C#',
    'license': 'mit',
    'size': 2048
}
```

## Dataset Creation

### Pipeline Overview

The dataset was created through a multi-stage pipeline:

1. **Repository Discovery**: Scraping public repository URLs from GitGud.io's GitLab API v4 endpoint using multiple sort orderings (`id`, `name`, `path`, `updated_at`, `star_count`, `last_activity_at`, `similarity`)
2. **Branch Enumeration**: Fetching all branches for each repository via the GitLab API
3. **Archive Download**: Downloading `.tar.gz` archives for each repository/branch combination
4. **Content Extraction**: Extracting and filtering source code files from archives
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression

### Language Detection

Programming languages are detected using file extension mapping. The pipeline maps ~80 programming languages by their file extensions, including:
- **Major languages**: Python, JavaScript, TypeScript, C, C++, C#, Java, Go, Rust, Ruby, PHP
- **Configuration**: JSON, YAML, TOML, XML, INI
- **Markup**: HTML, CSS, Markdown, LaTeX
- **Game development**: GLSL, HLSL, GDScript
- **And many more**

Files with unrecognized extensions are labeled with the extension itself (without the dot prefix). Files without extensions are labeled as "none" or by special filename matching (e.g., "Dockerfile", "Makefile").

### License Detection

Licenses are detected by:
1. Scanning for license files (`LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, `COPYING.txt`, `COPYING.md`)
2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, MPL, ISC, Unlicense, Artistic, WTFPL, Zlib, etc.)
3. Defaulting to "unknown" if no license can be detected

### File Filtering

Filtering is applied to ensure data quality:

#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max line length | 1,000 characters |

#### Content Filtering
- **Binary Detection**: Files with null bytes in the first 1KB are excluded
- **UTF-8 Validation**: Files must be decodable as UTF-8 (with fallback to latin-1, cp1252, iso-8859-1)
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **License Files**: License files (LICENSE, COPYING, etc.) are excluded from the dataset (but used for license detection)

### Source Data

All data originates from public repositories hosted on [GitGud.io](https://gitgud.io).

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback