GitVerse Code Dataset
nyuuzyou

folder main (2 files)
filedata.parquet 2.11GB
fileREADME.md 4.38kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {GitVerse Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitverse-code},
abstract= {# GitVerse Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [GitVerse](https://gitverse.ru), a Russian code hosting platform and an alternative to GitHub in the Russian developer community. GitVerse is used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with Russian language understanding and Russian coding conventions.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 2,802,994 |
| **Total Repositories** | 9,014 |
| **Total Size** | 2 GB (compressed Parquet) |
| **Programming Languages** | 416 |
| **File Format** | Parquet (single file) |

### Key Features

- **Russian code corpus**: Contains code from over 9,000 repositories, many featuring Russian comments, documentation, and variable names
- **Diverse language coverage**: Spans 416 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)

### Languages

The dataset includes 416 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | C | 580,713 |
| 2 | JavaScript | 275,744 |
| 3 | C++ | 197,896 |
| 4 | Shell | 166,527 |
| 5 | Python | 116,065 |
| 6 | Markdown | 112,811 |
| 7 | TypeScript | 107,867 |
| 8 | Java | 88,429 |
| 9 | PHP | 80,341 |
| 10 | Makefile | 77,619 |
| 11 | XML | 75,320 |
| 12 | Go | 69,155 |
| 13 | C# | 68,185 |
| 14 | Text | 65,677 |
| 15 | JSON | 64,253 |
| 16 | SVG | 58,107 |
| 17 | HTML | 43,261 |
| 18 | YAML | 40,178 |
| 19 | Unity3D Asset | 33,917 |
| 20 | Rust | 32,872 |
| 21 | LLVM | 29,819 |
| 22 | Unix Assembly | 27,672 |
| 23 | Roff | 25,884 |
| 24 | CSS | 21,809 |
| 25 | TSX | 21,637 |
| 26 | reStructuredText | 19,683 |
| 27 | Perl | 18,576 |
| 28 | Gettext Catalog | 17,071 |
| 29 | Diff | 14,225 |
| 30 | CMake | 14,132 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | The full text content of the file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | A unique identifier for the file within the dataset |

### Data Format

- **Format**: Apache Parquet
- **File Structure**: Single file (`data.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'file_text': 'Процедура ОбработкаПроведения(Отказ, Режим)nt// Нерабочий вариант без ошибокn...',
    'language': '1C Enterprise',
    'file_name': '004_work.code.bsl'
}
```

## Dataset Creation

### Language Detection

Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for language detection and syntax highlighting.

### Source Data

All data originates from public repositories hosted on [GitVerse](https://gitverse.ru).

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}


Send Feedback