GitCode Code Dataset
nyuuzyou

folder main (35 files)
filedata/gitcode_0000.parquet 1.26GB
filedata/gitcode_0001.parquet 1.07GB
filedata/gitcode_0002.parquet 1.22GB
filedata/gitcode_0003.parquet 1.19GB
filedata/gitcode_0004.parquet 1.25GB
filedata/gitcode_0005.parquet 1.32GB
filedata/gitcode_0006.parquet 1.31GB
filedata/gitcode_0007.parquet 1.28GB
filedata/gitcode_0008.parquet 1.32GB
filedata/gitcode_0009.parquet 1.32GB
filedata/gitcode_0010.parquet 1.33GB
filedata/gitcode_0011.parquet 1.32GB
filedata/gitcode_0012.parquet 1.30GB
filedata/gitcode_0013.parquet 1.30GB
filedata/gitcode_0014.parquet 1.21GB
filedata/gitcode_0015.parquet 1.22GB
filedata/gitcode_0016.parquet 1.19GB
filedata/gitcode_0017.parquet 1.27GB
filedata/gitcode_0018.parquet 1.25GB
filedata/gitcode_0019.parquet 1.26GB
filedata/gitcode_0020.parquet 1.24GB
filedata/gitcode_0021.parquet 1.26GB
filedata/gitcode_0022.parquet 1.31GB
filedata/gitcode_0023.parquet 1.31GB
filedata/gitcode_0024.parquet 1.23GB
filedata/gitcode_0025.parquet 1.20GB
filedata/gitcode_0026.parquet 1.30GB
filedata/gitcode_0027.parquet 1.26GB
filedata/gitcode_0028.parquet 1.14GB
filedata/gitcode_0029.parquet 1.32GB
filedata/gitcode_0030.parquet 1.24GB
filedata/gitcode_0031.parquet 1.23GB
filedata/gitcode_0032.parquet 1.28GB
filedata/gitcode_0033.parquet 710.65MB
fileREADME.md 8.84kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {GitCode Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitcode-code},
abstract= {# GitCode Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [GitCode](https://gitcode.com), a code hosting platform in China backed by CSDN (China Software Developer Network). GitCode serves as a domestic alternative to GitHub, widely used by Chinese developers, students, and enterprises for hosting open-source projects and educational resources, making this dataset particularly valuable for training code models with Chinese language understanding and Chinese coding conventions.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 48,142,567 |
| **Total Repositories** | 85,632 |
| **Total Size** | 40 GB (compressed Parquet) |
| **Programming Languages** | 537 |
| **File Format** | Parquet with Zstd compression (34 files) |

### Key Features

- **Chinese code corpus**: Contains code from over 85,000 repositories, many featuring Chinese comments, documentation, and variable names
- **Diverse language coverage**: Spans 537 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules)
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Open-source and educational projects**: Includes code from individual developers, students, and Chinese enterprises
- **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content

### Languages

The dataset includes 537 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | C++ | 9,513,619 |
| 2 | C | 8,220,317 |
| 3 | Java | 5,362,924 |
| 4 | Python | 3,428,302 |
| 5 | TypeScript | 3,166,959 |
| 6 | JavaScript | 2,540,280 |
| 7 | HTML | 1,578,824 |
| 8 | Kotlin | 1,413,651 |
| 9 | C# | 1,232,638 |
| 10 | Go | 1,159,708 |
| 11 | Rust | 812,959 |
| 12 | Dart | 767,731 |
| 13 | TSX | 749,355 |
| 14 | PHP | 663,953 |
| 15 | Shell | 629,436 |
| 16 | Vue | 563,754 |
| 17 | Makefile | 471,588 |
| 18 | CMake | 460,428 |
| 19 | CSS | 381,628 |
| 20 | Ruby | 350,213 |
| 21 | Objective-C | 347,251 |
| 22 | LLVM | 297,591 |
| 23 | Unix Assembly | 291,826 |
| 24 | Swift | 206,725 |
| 25 | Objective-C++ | 160,526 |
| 26 | Scala | 157,367 |
| 27 | QML | 157,088 |
| 28 | Lua | 149,114 |
| 29 | SCSS | 141,661 |
| 30 | GLSL | 129,124 |

### Licenses

The dataset includes files from repositories with various licenses. Repositories with restrictive licenses (CC-BY-ND variants, Commons Clause, SSPL) were excluded:

| License | File Count |
|---------|------------|
| unknown | 23,567,463 |
| apache-2.0 | 8,722,445 |
| mit | 7,743,613 |
| gpl-2.0 | 3,528,526 |
| agpl-3.0 | 2,300,580 |
| lgpl | 1,013,654 |
| bsd-3-clause | 528,980 |
| gpl-3.0 | 305,332 |
| public-domain | 163,493 |
| bsd-2-clause | 94,426 |
| bsd | 69,967 |
| isc | 36,117 |
| unlicense | 28,411 |
| cc0-1.0 | 26,799 |
| mpl-2.0 | 9,459 |
| Other licenses | ~5,000 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the GitCode repository (format: `username/repo`) |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) |
| `license` | string | License of the repository (SPDX identifier or "unknown") |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 34 files (`gitcode_0000.parquet` to `gitcode_0033.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': '#!/bin/shn# defines variablengit=`ls -d */` && path=`pwd` && gitpath=$path/$gitn...',
    'repo_name': '00fly/git-auto-push',
    'path': 'all-in-one-cron-push.sh',
    'language': 'Shell',
    'license': 'apache-2.0',
    'size': 546
}
```

## Dataset Creation

### Pipeline Overview

The dataset was created through a multi-stage pipeline:

1. **User Discovery**: Collecting usernames via GitCode API
2. **Repository Discovery**: Fetching repository lists for each user via GitCode API
3. **Repository Cloning**: Using Git partial clone with blob size filtering
4. **Content Extraction**: Extracting and filtering source code files
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression

### Language Detection

Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub's Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded).

### License Detection

Licenses are detected by:
1. Scanning for license files (`LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, etc.)
2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.)
3. Defaulting to "unknown" if no license can be detected

**Blocked Licenses**: The following restrictive licenses are excluded from the dataset:
- `cc-by-nd`, `cc-by-nd-2.0`, `cc-by-nd-3.0`, `cc-by-nd-4.0` (Creative Commons No-Derivatives)
- `commons-clause`
- `sspl`, `sspl-1.0` (Server Side Public License)

### File Filtering

Extensive filtering is applied to ensure data quality:

#### Size Limits
| Limit | Value |
|-------|-------|
| Max single file size | 1 MB |
| Max line length | 1,000 characters |

#### Excluded Directories
- **Configuration**: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`
- **Vendor/Dependencies**: `node_modules/`, `vendor/`, `third_party/`, `bower_components/`
- **Build Output**: `build/`, `dist/`, `out/`, `bin/`, `target/`, `__pycache__/`, `.next/`, `.nuxt/`, `coverage/`, `.nyc_output/`

#### Excluded Files
- **Lock Files**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Cargo.lock`, `go.sum`, `composer.lock`, `Gemfile.lock`, `poetry.lock`, `Pipfile.lock`
- **Minified Files**: Any file containing `.min.` in the name
- **Binary Files**: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.o`, `.obj`, `.lib`, `.jar`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.ico`, `.svg`, `.webp`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`

#### Content Filtering
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text
- **Binary Detection**: Files detected as binary by go-enry are excluded
- **Generated Files**: Files with generation markers in the first 500 bytes are excluded:
  - `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `automatically generated`, `@generated`, `<auto-generated`, `this file is generated`
- **Empty Files**: Files that are empty or contain only whitespace are excluded
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **go-enry Filters**: Additional filtering using go-enry's `IsVendor()`, `IsImage()`, `IsDotFile()`, and `IsGenerated()` functions
- **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped

### Source Data

All data originates from public repositories hosted on [GitCode](https://gitcode.com).

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback