main (2 files)
data.parquet |
2.11GB |
README.md |
4.38kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {GitVerse Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitverse-code},
abstract= {# GitVerse Code Dataset
## Dataset Description
This dataset was compiled from code repositories hosted on [GitVerse](https://gitverse.ru), a Russian code hosting platform and an alternative to GitHub in the Russian developer community. GitVerse is used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with Russian language understanding and Russian coding conventions.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 2,802,994 |
| **Total Repositories** | 9,014 |
| **Total Size** | 2 GB (compressed Parquet) |
| **Programming Languages** | 416 |
| **File Format** | Parquet (single file) |
### Key Features
- **Russian code corpus**: Contains code from over 9,000 repositories, many featuring Russian comments, documentation, and variable names
- **Diverse language coverage**: Spans 416 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)
### Languages
The dataset includes 416 programming languages. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | C | 580,713 |
| 2 | JavaScript | 275,744 |
| 3 | C++ | 197,896 |
| 4 | Shell | 166,527 |
| 5 | Python | 116,065 |
| 6 | Markdown | 112,811 |
| 7 | TypeScript | 107,867 |
| 8 | Java | 88,429 |
| 9 | PHP | 80,341 |
| 10 | Makefile | 77,619 |
| 11 | XML | 75,320 |
| 12 | Go | 69,155 |
| 13 | C# | 68,185 |
| 14 | Text | 65,677 |
| 15 | JSON | 64,253 |
| 16 | SVG | 58,107 |
| 17 | HTML | 43,261 |
| 18 | YAML | 40,178 |
| 19 | Unity3D Asset | 33,917 |
| 20 | Rust | 32,872 |
| 21 | LLVM | 29,819 |
| 22 | Unix Assembly | 27,672 |
| 23 | Roff | 25,884 |
| 24 | CSS | 21,809 |
| 25 | TSX | 21,637 |
| 26 | reStructuredText | 19,683 |
| 27 | Perl | 18,576 |
| 28 | Gettext Catalog | 17,071 |
| 29 | Diff | 14,225 |
| 30 | CMake | 14,132 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | The full text content of the file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | A unique identifier for the file within the dataset |
### Data Format
- **Format**: Apache Parquet
- **File Structure**: Single file (`data.parquet`)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'file_text': 'Процедура ОбработкаПроведения(Отказ, Режим)nt// Нерабочий вариант без ошибокn...',
'language': '1C Enterprise',
'file_name': '004_work.code.bsl'
}
```
## Dataset Creation
### Language Detection
Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for language detection and syntax highlighting.
### Source Data
All data originates from public repositories hosted on [GitVerse](https://gitverse.ru).
## Considerations for Using the Data
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data.parquet