Name: Mos.Hub Code Dataset
Creator: nyuuzyou
Published: 2026-01-09 11:08:52
License: https://academictorrents.com/nolicensespecified
main (2 files)
data.parquet	554.02MB
README.md	4.46kB
Type: Dataset
Tags:
Bibtex:
@article{,
title= {Mos.Hub Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/moshub-code},
abstract= {# Mos.Hub Code Dataset

## Dataset Description

This dataset was compiled from code repositories hosted on [Mos.Hub](https://hub.mos.ru) (hub.mos.ru), a code hosting platform operated by the Moscow Government. Mos.Hub is a service for storing and working with source code, based on the Git version control system, primarily used by Russian developers and government-related projects.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 15,740,580 |
| **Total Repositories** | 16,130 |
| **Total Size** | 529 MB (compressed Parquet) |
| **Uncompressed Size** | ~29 GB |
| **Programming Languages** | 297 |
| **File Format** | Parquet (single file) |

### Key Features

- **Russian code corpus**: Contains code from repositories hosted on Moscow's official code platform, featuring Russian comments and documentation
- **Diverse language coverage**: Spans 297 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)
- **Quality filtered**: Binary files and low-quality content have been removed

### Languages

The dataset includes 297 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | Ruby | 8,333,731 |
| 2 | JavaScript | 1,786,730 |
| 3 | YAML | 1,757,614 |
| 4 | Vue | 699,171 |
| 5 | Markdown | 639,585 |
| 6 | Haml | 538,837 |
| 7 | GraphQL | 269,485 |
| 8 | JSON | 214,354 |
| 9 | PHP | 191,150 |
| 10 | SVG | 172,884 |
| 11 | Shell | 172,451 |
| 12 | Go | 88,089 |
| 13 | Ignore List | 87,432 |
| 14 | SCSS | 80,716 |
| 15 | Python | 77,532 |
| 16 | C++ | 63,177 |
| 17 | HTML+ERB | 62,605 |
| 18 | Text | 48,400 |
| 19 | Jest Snapshot | 43,638 |
| 20 | HTML | 42,489 |
| 21 | C | 38,354 |
| 22 | reStructuredText | 26,342 |
| 23 | Rust | 24,818 |
| 24 | E-mail | 23,993 |
| 25 | XML | 22,715 |
| 26 | Java | 14,807 |
| 27 | Gettext Catalog | 14,429 |
| 28 | C# | 13,405 |
| 29 | CSS | 12,657 |
| 30 | Protocol Buffer Text Format | 12,181 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | Content of the source file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | Name of the source file |

### Data Format

- **Format**: Apache Parquet
- **File Structure**: Single file (`data.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'file_text': 'package mainnnimport "fmt"nnfunc main() {n    fmt.Println("Hello")n}n',
    'language': 'Go',
    'file_name': 'main.go'
}
```

## Dataset Creation

### Source Data

All data originates from public repositories hosted on [Mos.Hub](https://hub.mos.ru).

### Language Detection

Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for detecting programming languages.

### Filtering

- **Deduplication**: The dataset has been deduplicated to ensure unique code files
- **Binary Files**: Binary files have been removed from the dataset
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text

## Considerations for Using the Data

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset has been compiled with an analysis of the licenses used in the repositories to ensure ethical collection and use of the data. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
Mos.Hub Code Dataset nyuuzyou

Mos.Hub Code Dataset
nyuuzyou