main (3 files)
documents.parquet |
136.15MB |
presentations.parquet |
168.06MB |
README.md |
1.91kB |
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {Russian Educational Text Collection},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/edutexts},
abstract= {# Dataset Card for Russian Educational Text Collection
### Dataset Summary
This dataset contains approximately 1.38M educational texts primarily in Russian with some content in Ukrainian and English. The content is extracted from presentations and documents, including educational presentations, essays, and various academic documents covering diverse topics from natural sciences to literature.
### Languages
- Russian (ru) - primary language
- Ukrainian (uk) - secondary language
- English (en) - secondary language
With Russian being the predominant language in the dataset, while Ukrainian and English content appears less frequently.
## Dataset Structure
### Data Fields
The dataset is split into two parquet files:
- presentations (1,335,171 entries):
- `title`: Title of the presentation (string)
- `slide_text`: Array of slide contents (list of strings)
- documents (47,474 entries):
- `title`: Title of the document (string)
- `document_text`: Full text content of the document (string)
## Additional Information
### License
This dataset is dedicated to the public domain under the Creative Commons Zero (CC0) license. This means you can:
* Use it for any purpose, including commercial projects
* Modify it however you like
* Distribute it without asking permission
No attribution is required, but it's always appreciated!},
keywords= {},
terms= {},
license= {Creative Commons Zero (CC0)},
superseded= {}
}
documents.parquet