main (7 files)
data/gitflic-00000.parquet |
1.16GB |
data/gitflic-00001.parquet |
1.84GB |
data/gitflic-00002.parquet |
744.56MB |
data/gitflic-00003.parquet |
1.48GB |
data/gitflic-00004.parquet |
807.77MB |
data/gitflic-00005.parquet |
878.21MB |
README.md |
4.53kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {GitFlic Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitflic-code},
abstract= {# GitFlic Code Dataset
## Dataset Description
This dataset was compiled from code repositories hosted on [GitFlic](https://gitflic.ru), the first Russian service for storing and working with source code, based on the Git version control system. GitFlic is widely used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with strong Russian language understanding and Russian coding conventions.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 5,975,978 |
| **Total Repositories** | 12,527 |
| **Total Size** | 6.44 GB (compressed Parquet) |
| **Programming Languages** | 690 |
| **File Format** | Parquet with Zstd compression (6 files) |
### Key Features
- **Russian code corpus**: Contains code from over 12,000 repositories, many featuring Russian comments, documentation, and variable names
- **Diverse language coverage**: Spans 690 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)
- **Deduplicated**: The dataset has been deduplicated and filtered to remove binary files
- **Quality filtered**: Filtered to ensure data quality and remove non-code content
### Languages
The dataset includes 690 programming languages. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | C | 739,012 |
| 2 | Java | 634,899 |
| 3 | C++ | 587,528 |
| 4 | JavaScript | 422,832 |
| 5 | PHP | 365,105 |
| 6 | XML | 291,920 |
| 7 | Markdown | 211,574 |
| 8 | Shell | 207,178 |
| 9 | Python | 206,443 |
| 10 | Unity3D Asset | 150,654 |
| 11 | SVG | 150,136 |
| 12 | TypeScript | 141,886 |
| 13 | Text | 139,406 |
| 14 | JSON | 126,214 |
| 15 | HTML | 122,341 |
| 16 | Go | 109,740 |
| 17 | YAML | 89,416 |
| 18 | Roff | 82,609 |
| 19 | C# | 77,520 |
| 20 | Makefile | 63,594 |
| 21 | LLVM | 55,680 |
| 22 | Scala | 53,395 |
| 23 | Unix Assembly | 49,909 |
| 24 | Rust | 35,553 |
| 25 | reStructuredText | 35,023 |
| 26 | Objective-C | 34,151 |
| 27 | Ruby | 33,366 |
| 28 | CMake | 33,030 |
| 29 | CSS | 31,664 |
| 30 | TSX | 31,397 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | Content of the source file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | A unique identifier for the file within the dataset |
### Data Format
- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 6 files (`gitflic-00000.parquet` to `gitflic-00005.parquet`)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'file_text': 'package com.example.demo;nnimport org.springframework.boot.SpringApplication;n...',
'language': 'Java',
'file_name': 'Application.java'
}
```
## Dataset Creation
### Language Detection
Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for language detection.
### Source Data
All data originates from public repositories hosted on [GitFlic](https://gitflic.ru).
## Considerations for Using the Data
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data/gitflic-00000.parquet