GitFlic Code Dataset
nyuuzyou

folder main (7 files)
filedata/gitflic-00000.parquet 1.16GB
filedata/gitflic-00001.parquet 1.84GB
filedata/gitflic-00002.parquet 744.56MB
filedata/gitflic-00003.parquet 1.48GB
filedata/gitflic-00004.parquet 807.77MB
filedata/gitflic-00005.parquet 878.21MB
fileREADME.md 4.53kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {GitFlic Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/gitflic-code},
abstract= {# GitFlic Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [GitFlic](https://gitflic.ru), the first Russian service for storing and working with source code, based on the Git version control system. GitFlic is widely used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with strong Russian language understanding and Russian coding conventions.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 5,975,978 |
| **Total Repositories** | 12,527 |
| **Total Size** | 6.44 GB (compressed Parquet) |
| **Programming Languages** | 690 |
| **File Format** | Parquet with Zstd compression (6 files) |

### Key Features

- **Russian code corpus**: Contains code from over 12,000 repositories, many featuring Russian comments, documentation, and variable names
- **Diverse language coverage**: Spans 690 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)
- **Deduplicated**: The dataset has been deduplicated and filtered to remove binary files
- **Quality filtered**: Filtered to ensure data quality and remove non-code content

### Languages

The dataset includes 690 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | C | 739,012 |
| 2 | Java | 634,899 |
| 3 | C++ | 587,528 |
| 4 | JavaScript | 422,832 |
| 5 | PHP | 365,105 |
| 6 | XML | 291,920 |
| 7 | Markdown | 211,574 |
| 8 | Shell | 207,178 |
| 9 | Python | 206,443 |
| 10 | Unity3D Asset | 150,654 |
| 11 | SVG | 150,136 |
| 12 | TypeScript | 141,886 |
| 13 | Text | 139,406 |
| 14 | JSON | 126,214 |
| 15 | HTML | 122,341 |
| 16 | Go | 109,740 |
| 17 | YAML | 89,416 |
| 18 | Roff | 82,609 |
| 19 | C# | 77,520 |
| 20 | Makefile | 63,594 |
| 21 | LLVM | 55,680 |
| 22 | Scala | 53,395 |
| 23 | Unix Assembly | 49,909 |
| 24 | Rust | 35,553 |
| 25 | reStructuredText | 35,023 |
| 26 | Objective-C | 34,151 |
| 27 | Ruby | 33,366 |
| 28 | CMake | 33,030 |
| 29 | CSS | 31,664 |
| 30 | TSX | 31,397 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | Content of the source file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | A unique identifier for the file within the dataset |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 6 files (`gitflic-00000.parquet` to `gitflic-00005.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'file_text': 'package com.example.demo;nnimport org.springframework.boot.SpringApplication;n...',
    'language': 'Java',
    'file_name': 'Application.java'
}
```

## Dataset Creation

### Language Detection

Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for language detection.

### Source Data

All data originates from public repositories hosted on [GitFlic](https://gitflic.ru).

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}


Send Feedback