main (2 files)
data.parquet |
554.02MB |
README.md |
4.46kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {Mos.Hub Code Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/moshub-code},
abstract= {# Mos.Hub Code Dataset
## Dataset Description
This dataset was compiled from code repositories hosted on [Mos.Hub](https://hub.mos.ru) (hub.mos.ru), a code hosting platform operated by the Moscow Government. Mos.Hub is a service for storing and working with source code, based on the Git version control system, primarily used by Russian developers and government-related projects.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 15,740,580 |
| **Total Repositories** | 16,130 |
| **Total Size** | 529 MB (compressed Parquet) |
| **Uncompressed Size** | ~29 GB |
| **Programming Languages** | 297 |
| **File Format** | Parquet (single file) |
### Key Features
- **Russian code corpus**: Contains code from repositories hosted on Moscow's official code platform, featuring Russian comments and documentation
- **Diverse language coverage**: Spans 297 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist)
- **Quality filtered**: Binary files and low-quality content have been removed
### Languages
The dataset includes 297 programming languages. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | Ruby | 8,333,731 |
| 2 | JavaScript | 1,786,730 |
| 3 | YAML | 1,757,614 |
| 4 | Vue | 699,171 |
| 5 | Markdown | 639,585 |
| 6 | Haml | 538,837 |
| 7 | GraphQL | 269,485 |
| 8 | JSON | 214,354 |
| 9 | PHP | 191,150 |
| 10 | SVG | 172,884 |
| 11 | Shell | 172,451 |
| 12 | Go | 88,089 |
| 13 | Ignore List | 87,432 |
| 14 | SCSS | 80,716 |
| 15 | Python | 77,532 |
| 16 | C++ | 63,177 |
| 17 | HTML+ERB | 62,605 |
| 18 | Text | 48,400 |
| 19 | Jest Snapshot | 43,638 |
| 20 | HTML | 42,489 |
| 21 | C | 38,354 |
| 22 | reStructuredText | 26,342 |
| 23 | Rust | 24,818 |
| 24 | E-mail | 23,993 |
| 25 | XML | 22,715 |
| 26 | Java | 14,807 |
| 27 | Gettext Catalog | 14,429 |
| 28 | C# | 13,405 |
| 29 | CSS | 12,657 |
| 30 | Protocol Buffer Text Format | 12,181 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `file_text` | string | Content of the source file (UTF-8 encoded) |
| `language` | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) |
| `file_name` | string | Name of the source file |
### Data Format
- **Format**: Apache Parquet
- **File Structure**: Single file (`data.parquet`)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'file_text': 'package mainnnimport "fmt"nnfunc main() {n fmt.Println("Hello")n}n',
'language': 'Go',
'file_name': 'main.go'
}
```
## Dataset Creation
### Source Data
All data originates from public repositories hosted on [Mos.Hub](https://hub.mos.ru).
### Language Detection
Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub's library for detecting programming languages.
### Filtering
- **Deduplication**: The dataset has been deduplicated to ensure unique code files
- **Binary Files**: Binary files have been removed from the dataset
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text
## Considerations for Using the Data
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset has been compiled with an analysis of the licenses used in the repositories to ensure ethical collection and use of the data. Users of this dataset should respect the rights of the authors and use the data responsibly.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data.parquet