<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:academictorrents="https://academictorrents.com" version="2.0">
<channel>
<title>Academic Torrents</title>
<description>Recent Torrents</description>
<link>https://academictorrents.com/</link>
<item>
<title>Public MediaWiki Collection</title>
<category>Dataset</category>
<infohash>3ec30d9d8817f62d338ae76783d24ba207b6e9de</infohash>
<guid>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</guid>
<link>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</link>
<description># Dataset Card for Public MediaWiki Collection ### Dataset Summary This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages. ### Languages The dataset is multilingual, covering 35+ languages found across the collected wiki instances. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances. To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/</description>
<size>1167900670</size>
</item><item>
<title>9111.ru Questions Dataset</title>
<category>Dataset</category>
<infohash>3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</infohash>
<guid>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</guid>
<link>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</link>
<description># Dataset Card for 9111.ru Questions ### Dataset Summary This dataset includes legal questions and answers from the Russian law forum [9111.ru](https://9111.ru). It contains inquiries from users and corresponding responses from lawyers. The dataset was created by processing around 21 million questions, providing a significant corpus of legal discussions. ### Languages The dataset is mostly in Russian, but there may be other languages present. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Identifier for the item (integer) -  title : Title of the question (string) -  description : Description of the question (string) -  answers : An array of answer objects (array) -  user_name : Name of the user who answered (string) -  status : Status of the user (string) -  rating : Rating of the user (integer) -  text : Text of the answer (string) ### Data Format The dataset is stored as Apache Parquet files with zstd compression (level 19), split into 3 shards: -  questions-00000-of-00003.parquet  -  questions-00001-of-00003.parquet  -  questions-00002-of-00003.parquet  ### Data Splits All examples are in the train split, there is no validation split.</description>
<size>2938274461</size>
</item><item>
<title>Fandom.com Community Database Dumps Dataset</title>
<category>Dataset</category>
<infohash>0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</infohash>
<guid>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</guid>
<link>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</link>
<description># Dataset Card for Fandom.com Community Database Dumps ### Dataset Summary This dataset contains 7,040,984 current pages from all available [Fandom.com community wiki dumps](https://community.fandom.com/wiki/Help:Database_download) as of February 18, 2025. The dataset was created by processing the "Current pages" database dumps from all available Fandom.com wikis. These dumps contain only the current versions of pages without edit history and includes article text, metadata, and structural information across multiple languages. ### Languages The dataset is multilingual, covering [40+ languages](https://community.fandom.com/wiki/Help:Language_codes). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset inherits the licenses from the source Fandom communities, which use Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0). To learn more about CC-BY-SA 3.0, visit: https://creativecommons.org/licenses/by-sa/3.0/</description>
<size>6224651822</size>
</item><item>
<title>ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles</title>
<category>Dataset</category>
<infohash>adb5741cdcb9352848cc80c976629b44720a04c2</infohash>
<guid>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</guid>
<link>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</link>
<description>High‑resolution aerial imagery from Ontario’s DRAPE 2024, packaged for fast web maps and offline use. Smooth panning, crisp detail, open data.</description>
<size>215750229448</size>
</item><item>
<title>Street-Level Imagery Dataset</title>
<category>Dataset</category>
<infohash>207ba45161f6ba12114cb6d97ad25d222d5125c9</infohash>
<guid>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</guid>
<link>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</link>
<description># Street-Level Imagery Dataset Metadata for street-level imagery across Eastern Europe and Northern Asia. Each record includes image URLs, coordinates, camera orientation, timestamps, and links to similar images captured at the same location over time. ## Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total Records | 934,191 | | Unique Images | 905,940 | | Time Span | 2016–2025 | | File Format | Parquet | ## Geographic Coverage | Boundary | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Minimum Longitude | 20.49° E | | Maximum Longitude | 152.32° E | | Minimum Latitude | 38.55° N | | Maximum Latitude | 69.05° N | Coverage spans urban centers and rural routes. Density is higher in populated areas. ## Camera Specifications **Directions** | Direction | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Front | 740,079 | | Right | 194,112 | **Resolutions** | Preview Size | Full Size | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | 284×160 | 1920×1080 | 932,171 | | 90×160 | 1080×1920 | 1,886 | | 284×160 | 1536×864 | 77 | | 213×160 | 2016×1512 | 41 | ## Data Structure ### Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  | string | Unique image identifier | |  sourceId  | string | Source device or session identifier | |  heading  | float64 | Camera heading (0–360°) | |  cameraDirection  | string | Mount position ( front  or  right ) | |  timestamp  | string | ISO 8601 capture time | |  imagePreview  | struct | Thumbnail URL and dimensions | |  imageFull  | struct | Full resolution URL and dimensions | |  pos  | array | [longitude, latitude] | |  geometry  | struct | GeoJSON Point geometry | |  similar  | array | Related images at location | |  targetGeometry  | struct | Optional target reference | ### Image URL Schema Two resolution variants per entry:    json  "imagePreview":  "url": "https://...", "width": 284, "height": 160 , "imageFull":  "url": "https://...", "width": 1920, "height": 1080       ### Temporal Links Records reference similar images from other timestamps at the same coordinates. Average of 14.3 links per location. ## Limitations Image URLs may [rot](https://en.wikipedia.org/wiki/Link_rot). Coverage concentrates in urban areas, and historical density varies by location. ## License Research use permitted. Comply with source terms of service and local data regulations.</description>
<size>651707506</size>
</item><item>
<title>Subreddit comments/submissions 2005-06 to 2025-12</title>
<category>Dataset</category>
<infohash>3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</infohash>
<guid>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</guid>
<link>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</link>
<description>This is the top 40,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2025-12 which can be found here https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44 These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps If you have questions, please reply to this reddit post or DM u/Watchful on reddit or respond to this post https://www.reddit.com/r/pushshift/comments/1r5z42j/comment/o5mjcvn/</description>
<size>3965777405390</size>
</item><item>
<title>Lung Ultrasound Dataset (LUS-Dataset-Katumba)</title>
<category>Dataset</category>
<infohash>e6e9a5594174aaffee53b8f086e3bf86c02c45ad</infohash>
<guid>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</guid>
<link>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</link>
<description>This dataset contains a curated benchmark collection of 1,062 labelled lung ultrasound (LUS) images collected from patients at Mulago National Referral Hospital and Kiruddu Referral Hospital in Kampala, Uganda. The images were acquired and annotated by senior radiologists to support the development and evaluation of artificial intelligence (AI) models for pulmonary disease diagnosis. Each image is categorized into one of three classes: Probably COVID-19 (COVID-19), Diseased Lung but Probably Not COVID-19 (Other Lung Disease), and Healthy Lung. The dataset addresses key challenges in LUS interpretation, including inter-operator variability, low signal-to-noise ratios, and reliance on expert sonographers. It is particularly suitable for training and testing convolutional neural network (CNN)-based models for medical image classification tasks in low-resource settings. The images are provided in standard formats such as PNG or JPEG, with corresponding labels stored in structured files like CSV or JSON to facilitate ease of use in machine learning workflows. In this second version of the dataset, we have extended the resource by including a folder containing the original unprocessed raw data, as well as the scripts used to process, clean, and sort the data into the final labelled set. These additions promote transparency and reproducibility, allowing researchers to understand the full data pipeline and adapt it for their own applications. This resource is intended to advance research in deep learning for lung ultrasound analysis and to contribute toward building more accessible and reliable diagnostic tools in global health.     Katumba, Andrew; Murindanyi, Sudi; Okila, Nixson; Nakatumba-Nabende, Joyce; Mwikirize, Cosmas; Serugunda, Jonathan; Bugeza, Samuel; Oriekot, Anthony; Bosa, Juliet; Nabawanuka, Eva (2025), “A Dataset of Lung Ultrasound Images for Automated AI-based Lung Disease Classification”, Mendeley Data, V2, doi: 10.17632/hb3p34ytvx.2</description>
<size>281447804</size>
</item><item>
<title>Wikipedia Asian languages 2026-02-01</title>
<category>Dataset</category>
<infohash>1bde58b51e4aad60f03ce1b688b691552fb3041e</infohash>
<guid>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</guid>
<link>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</link>
<description>Wikipedia database dumps of Asian language wikis of 10k articles or more. Wikipedia Multistream 2026-02-01. These 85 languages are included: Acehnese, Armenian, Assamese, Azerbaijani, Balinese, Bangla, Banjar, Banyumasan, Bashkir, Bishnupriya, Buginese, Burmese, Cantonese, Cebuno, Central Bikol, Central Kurdhish, Chechen, Chinese, Chuvash, Classical Chinese, Dimli, Eastern Mari, Georgian, Gilaki, Gorontalo, Gujarati, Hakka, Hebrew, Hindi, Iloko, Indonesian, Japanese, Javanese, Kannada, Kara-Kalpak, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Maithili, Malay, Malayalam, Manipuri, Marathi, Mazanderani, Minangkabau, Mindong, Mingrelian, Minnan, Mongolian, Nepali, Newari, Odia, Ossetic, Pampangan, Pashto, Persian, Punjabi, Russian, Sanskrit, Santali, Saraiki, Shan, Sindhi, Sinhala, South Azerbaijani, Sundanese, Tagalog, Tajik, Talysh, Tamil, Tatar, Telugu, Thai, Turkish, Urdu, Uzbek, Vietnamese, Waray, Western Armenian, Western Mari, Western Punjabi, Wu, Yakut.</description>
<size>31208780863</size>
</item><item>
<title>Reddit comments/submissions 2026-01</title>
<category>Dataset</category>
<infohash>8412b89151101d88c915334c45d9c223169a1a60</infohash>
<guid>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</guid>
<link>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-01 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01</description>
<size>61629104259</size>
</item><item>
<title>Begemot.ai Dataset</title>
<category>Dataset</category>
<infohash>3ada9903be4621cf7e34cd5cf44f191b4124ccfe</infohash>
<guid>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</guid>
<link>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</link>
<description># Dataset Card for Begemot.ai ### Dataset Summary This dataset has 2,728,999 educational project descriptions in Russian. They were generated using AI on the Begemot.ai website. The content includes project titles, descriptions, chapters and chapter content on various educational topics. ### Languages The dataset is primarily in Russian (ru). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the project (integer) -  url : URL of the project page (string) -  title : Title of the educational project (string) -  type : Type of project (string) -  description : Detailed description of the project (string) -  chapters : List of chapter titles (list of strings) -  chapter_content : JSON string mapping chapter titles to their content ### Data Splits All examples are in a single split.</description>
<size>1708074039</size>
</item><item>
<title>OpenPOCUS - Lung Ultrasound Image Database</title>
<category>Dataset</category>
<infohash>63ad0470f43e022cc73407be9c760449d947cb97</infohash>
<guid>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</guid>
<link>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</link>
<description>https://i.imgur.com/s0eFv64.png Background Lung ultrasound (LUS) offers advantages over traditional imaging for diagnosing pulmonary conditions, with superior accuracy compared to chest X-ray and similar performance to CT at lower cost. Despite these benefits, widespread adoption is limited by operator dependency, moderate interrater reliability, and training requirements. Deep learning (DL) could potentially address these challenges, but development of effective algorithms is hindered by the scarcity of comprehensive image repositories with proper metadata.</description>
<size>5256546243</size>
</item><item>
<title>Russian Educational Text Collection</title>
<category>Dataset</category>
<infohash>1f6b373346a0fa34de6b4d916984d698e0a623b3</infohash>
<guid>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</guid>
<link>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</link>
<description># Dataset Card for Russian Educational Text Collection ### Dataset Summary This dataset contains approximately 1.38M educational texts primarily in Russian with some content in Ukrainian and English. The content is extracted from presentations and documents, including educational presentations, essays, and various academic documents covering diverse topics from natural sciences to literature. ### Languages - Russian (ru) - primary language - Ukrainian (uk) - secondary language - English (en) - secondary language With Russian being the predominant language in the dataset, while Ukrainian and English content appears less frequently. ## Dataset Structure ### Data Fields The dataset is split into two parquet files: - presentations (1,335,171 entries): -  title : Title of the presentation (string) -  slide_text : Array of slide contents (list of strings) - documents (47,474 entries): -  title : Title of the document (string) -  document_text : Full text content of the document (string) ## Additional Information ### License This dataset is dedicated to the public domain under the Creative Commons Zero (CC0) license. This means you can: * Use it for any purpose, including commercial projects * Modify it however you like * Distribute it without asking permission No attribution is required, but it s always appreciated!</description>
<size>304218686</size>
</item><item>
<title>Animations Dataset</title>
<category>Dataset</category>
<infohash>8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</infohash>
<guid>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</guid>
<link>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</link>
<description># Dataset Card for Animations Dataset ### Dataset Summary This dataset contains 50,849 animations with their associated metadata and source images. Each animation consists of multiple frames composed of simple sketch-level drawings, text elements, and potentially embedded images. The dataset provides complete information about each animation, including frame components, source images, timing between frames, and canvas settings. This makes it suitable for various tasks such as animation analysis, generation, and modification. ### Languages The dataset is primarily monolingual: - English (en): Any text elements within animations are predominantly in English. ## Dataset Structure ### Data Files The dataset is stored as Parquet files with ZSTD compression: -  train-00000.parquet  through  train-00003.parquet  - Total: 4 shards, ~4.2 GB compressed ### Data Fields Each row in the Parquet files contains the following columns: | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  |  string  | Unique identifier (UUID) for the animation | |  settings  |  string  | JSON object containing canvas configuration | |  dtimes  |  list[int64]  | Time delays between frames in milliseconds | |  frames_data  |  string  | JSON array describing each frame s elements | |  images  |  list[binary]  | PNG images used in the animation (decoded bytes) | #### Settings Object The  settings  JSON contains: -  canvas_width ,  canvas_height : Dimensions of the animation canvas -  fillcolor : Background color of the canvas (if specified) -  default_font : Default font used for text elements -  default_font_size : Default font size #### Frames Data Structure The  frames_data  JSON is an array of arrays, where each inner array represents a frame s elements: -  type_for_loader : Element type (e.g., "text", "image") -  data : Object containing element properties: -  type : Element type -  centerx ,  centery : Position coordinates on the canvas -  text : Text content (for text elements) -  font ,  size : Font properties -  rotate_angle ,  angle : Rotation properties -  strokeColor ,  fillColor ,  textColor : Color properties -  src : Index into the  images  array (for image elements) -  children_data : Array of child elements (if any) ### Data Splits | Split | Number of Examples | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| |  train  | 50,849 |</description>
<size>4374676620</size>
</item><item>
<title>Sprite Compositing &amp; Animation Dataset</title>
<category>Dataset</category>
<infohash>df2a3742526f44dac4dbb80299333e84132c5b45</infohash>
<guid>https://academictorrents.com/details/df2a3742526f44dac4dbb80299333e84132c5b45</guid>
<link>https://academictorrents.com/details/df2a3742526f44dac4dbb80299333e84132c5b45</link>
<description># Sprite Compositing &amp; Animation Dataset A diverse dataset of image sequences with their source sprite assets. Contains animations, slideshows, and composited scenes created from transparent PNG sprites with additional effects and overlays. ## Dataset Statistics | Metric | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total animations | 50,849 | | Total frames | 1,191,969 | | Total source sprites | 312,926 | | Avg frames/animation | 23.4 | | Avg sprites/animation | 6.2 | | Frame resolution | 800 x 450-600 | | Sprite resolution | Variable | | Total size | ~27 GB | | Format | Parquet (ZSTD compressed) | ## Schema | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  |  string  | Unique identifier (UUID) | |  source_images  |  list[binary]  | PNG bytes of source sprite assets (RGBA), sorted by index. Can be empty. | |  frames  |  list[binary]  | PNG bytes of final rendered frames (RGB 800x450-600), sorted by sequence order | |  num_sources  |  int64  | Number of source sprites (0 for rows without source assets) | |  num_frames  |  int64  | Number of frames in the sequence | ## Data Structure Each sample contains: 1. **Source Images** ( source_images ): Transparent PNG sprites/assets (RGBA mode) used to compose the final frames. Variable sizes. May include characters, objects, etc. 2. **Frames** ( frames ): Final rendered image sequence (RGB mode, typically 800x450-600). These are the result of compositing source sprites with additional effects like: - Text overlays - Drawings and sketches - Backgrounds - Animations and transitions - Visual effects ### Content Variations - **Animations**: Smooth frame-by-frame animations of sprites (e.g., character movement) - **Slideshows**: Discrete scene transitions using source assets - **Composited Scenes**: Source sprites combined with text, drawings, and effects - **Sketches**: Hand-drawn or illustrated frames with optional sprite references **Note**: Not all frames are strict animations - many are slideshows or scene compositions where source assets are combined with additional elements.</description>
<size>27130846899</size>
</item><item>
<title>NNTP Discussion Archives</title>
<category>Dataset</category>
<infohash>cac053d01e256ae3001bf40c5c98eefa86cdc870</infohash>
<guid>https://academictorrents.com/details/cac053d01e256ae3001bf40c5c98eefa86cdc870</guid>
<link>https://academictorrents.com/details/cac053d01e256ae3001bf40c5c98eefa86cdc870</link>
<description># NNTP Discussion Archives A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades. ## Dataset Statistics | Metric | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total messages | 386,629,949 | | Unique newsgroups | 159,345 | | Date range | 2002 - 2026 | | Total size | ~191 GB (compressed) | | File format | Parquet (ZSTD) | | Number of files | 256 | | Average content length | ~1,400 characters | ## Schema | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  message_id  |  string  | Original message identifier (unchanged) | |  newsgroups  |  string  | Target newsgroup(s), comma-separated if cross-posted | |  author  |  string  | Message author with email addresses redacted as  [email]  | |  subject  |  string  | Subject line | |  date  |  string  | RFC 2822 formatted date string | |  content  |  string  | Message body with email addresses redacted as  [email]  | ## Top Newsgroups by Volume | Newsgroup | Messages | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | alt.atheism | 5,658,023 | | free.usenet | 4,691,561 | | alt.fan.rush-limbaugh | 4,659,639 | | alt.politics | 3,919,772 | | fr.soc.politique | 3,554,434 | | it.sport.calcio.milan | 2,961,804 | | it.politica | 2,802,687 | | alt.politics.bush | 2,786,316 | | talk.politics.misc | 2,784,668 | | Other (159,336 groups) | 475,430,274 | *Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.* ## Data Processing **Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups. **Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline: - Quoted-Printable: MIME-encoded content decoded to text - Base64: Text base64 content decoded; binary base64 excluded - Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection - MIME encoded-word headers decoded to UTF-8 **Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained). **Privacy:** Email addresses in  author  and  content  fields redacted as  [email] ;  message_id  unchanged. ## Considerations - Messages were posted to public newsgroups - Content reflects unmoderated discussions and may contain controversial opinions</description>
<size>204065504201</size>
</item><item>
<title>Reddit comments/submissions 2005-06 to 2025-12</title>
<category>Dataset</category>
<infohash>3d426c47c767d40f82c7ef0f47c3acacedd2bf44</infohash>
<guid>https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44</guid>
<link>https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44</link>
<description>Reddit comments and submissions from 2005-06 to 2025-12 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps The more recent dumps are collected by u/RaiderBDev</description>
<size>3804096351995</size>
</item><item>
<title>RU-OK? Uptime measurements of Russian/Belarusian DDoS targets of IT ARMY</title>
<category>Dataset</category>
<infohash>87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</infohash>
<guid>https://academictorrents.com/details/87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</guid>
<link>https://academictorrents.com/details/87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</link>
<description>In 2022, Russia began a full-scale invasion of Ukraine in the escalating Russo-Ukrainian war. Ukrainian ingenuity quickly led to the creation of a volunteer cyberwarfare organization, [IT Army of Ukraine](https://en.wikipedia.org/wiki/IT_Army_of_Ukraine), which conducted both defensive and offensive operations. Notably, they invited anyone with an internet connection to DDoS an ever-growing list of Russian and Belarusian websites, with the goal of disrupting infrastructure and draining Russia’s own cyberwarfare capabilities. I made a very quick project to assess the status of Russian and Belarusian internet properties (via [RIPE Atlas](https://atlas.ripe.net/)) being targeted by hacktivists. Specifically, I evaluated almost every target listed by the IT ARMY Telegram group with many unique probes between 2022-02-27 (the day after IT ARMY was created) and 2022-05-30 to check for service availability. I wanted to check connectivity from within Russia’s borders because I saw many mixed reports across Twitter and Reddit, with international parties (Americans, Ukrainians, etc.) claiming many sites had been knocked offline, where Russians chimed in that many sites remained online for them. The truth is more complex - some sites were significantly disrupted and took time to recover glovally, while others had existing mitigations in place, others seemed to deprioritize or sinkhole international traffic, etc. This research was included in several news articles around the world: * Ukraine’s IT army is doing well, hitting Russia with ‘cost and chaos’ - [VentureBeat](https://venturebeat.com/2022/03/04/ukraines-it-army-is-doing-well-hitting-russia-with-cost-and-chaos/) * Ukraine deserves an IT army. We have to live with the fallout - [VentureBeat](https://venturebeat.com/2022/03/04/ukraine-deserves-an-it-army-we-have-to-live-with-the-fallout/) * Ukraine: We’ve repelled ‘nonstop’ DDoS attacks from Russia - [VentureBeat](https://venturebeat.com/2022/03/07/ukraine-weve-repelled-nonstop-ddos-attacks-from-russia/) * Guerre en Ukraine : les cyberattaques contre la Russie, le « cri de colère » d’une armée de volontaires - [Le Monde](https://www.lemonde.fr/pixels/article/2022/03/25/guerre-en-ukraine-face-a-la-russie-les-cyberattaques-en-forme-de-cri-de-colere-d-une-armee-de-volontaires_6119064_4408996.html) * Ukraine Demanded Cloudflare Stop Protecting Russians From Cyberattacks. Cloudflare Said No - [Forbes](https://www.forbes.com/sites/thomasbrewster/2022/03/07/cloudflare-rejects-ukraines-call-to-stop-protecting-russians-from-cyberattacks/) The data and methodology for RU-OK was originally published on my GitHub, where I hope it will remain. However, I’ve received the occasional nastygram about this research and recently received a takedown request against from a Russian cybersecurity firm, claiming that sensitive information is being stored in my repository. There isn’t, of course, and all the data is public measurements against public endpoints. Still I’m concerned that fraudulent reports could result in my repo getting deleted, so I’m creating a censorship-resistant copy and distributing it on my blog and on Academic Torrents. It’s long overdue anyway. I encourage anyone curious to take a dig through the data, as you can watch both the immediate impact of DDoS attacks as well as Russian government and company resilience change over several months as this attack became commonplace.</description>
<size>1609147173</size>
</item><item>
<title>Wikipedia Wikidata 2026-01-01</title>
<category>Dataset</category>
<infohash>91f29e60cc4a65747a346109ef49a48808c6a2cd</infohash>
<guid>https://academictorrents.com/details/91f29e60cc4a65747a346109ef49a48808c6a2cd</guid>
<link>https://academictorrents.com/details/91f29e60cc4a65747a346109ef49a48808c6a2cd</link>
<description>Database dump of the Wikidata wiki. Wikipedia Multistream 2026-01-01.</description>
<size>175980564292</size>
</item><item>
<title>Wikipedia Commons 2026-01-01</title>
<category>Dataset</category>
<infohash>83f2cfd35db16f696000bd3dee56e3837fe3e60c</infohash>
<guid>https://academictorrents.com/details/83f2cfd35db16f696000bd3dee56e3837fe3e60c</guid>
<link>https://academictorrents.com/details/83f2cfd35db16f696000bd3dee56e3837fe3e60c</link>
<description>Database dump of Wikipedia Commons. Wikipedia Multistream 2026-01-01.</description>
<size>106644000008</size>
</item><item>
<title>Reddit comments/submissions 2025-12</title>
<category>Dataset</category>
<infohash>481bf2eac43172ae724fd6c75dbcb8e27de77734</infohash>
<guid>https://academictorrents.com/details/481bf2eac43172ae724fd6c75dbcb8e27de77734</guid>
<link>https://academictorrents.com/details/481bf2eac43172ae724fd6c75dbcb8e27de77734</link>
<description>Reddit comments and submissions from 2025-12 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumps</description>
<size>56418645823</size>
</item><item>
<title>GitGud Code Dataset</title>
<category>Dataset</category>
<infohash>221571632238b826f0aa6ec4f370af633575cae4</infohash>
<guid>https://academictorrents.com/details/221571632238b826f0aa6ec4f370af633575cae4</guid>
<link>https://academictorrents.com/details/221571632238b826f0aa6ec4f370af633575cae4</link>
<description># GitGud Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [GitGud.io](https://gitgud.io), a GitLab-based code hosting platform. GitGud.io serves as an alternative git hosting service used by various developer communities and open-source projects. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 16,322,315 | | **Total Repositories** | 7,204 | | **Total Size** | 17.46 GB (compressed Parquet) | | **Programming Languages** | 2,185 | | **File Format** | Parquet with Zstd compression (17 files) | ### Key Features - **Diverse code corpus**: Contains code from over 7,000 repositories across various domains - **Wide language coverage**: Spans 2,185 programming languages and file types detected by file extension mapping - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Quality filtered**: Filtering applied to remove binary files, overly long lines, and license files ### Languages The dataset includes 2,185 programming languages and file types. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | tw (Twine) | 3,301,366 | | 2 | XML | 3,281,566 | | 3 | svg | 1,744,500 | | 4 | C# | 1,367,799 | | 5 | JavaScript | 1,252,710 | | 6 | C++ | 731,619 | | 7 | erb | 710,279 | | 8 | JSON | 398,139 | | 9 | Text | 377,948 | | 10 | twee | 300,576 | | 11 | csv | 205,230 | | 12 | HTML | 170,711 | | 13 | Markdown | 160,735 | | 14 | TypeScript | 147,173 | | 15 | Lua | 117,079 | | 16 | PHP | 116,059 | | 17 | none | 111,791 | | 18 | pal | 110,626 | | 19 | CSS | 108,664 | | 20 | Python | 106,261 | | 21 | dm | 98,333 | | 22 | Ruby | 93,685 | | 23 | _comment | 91,730 | | 24 | Java | 81,190 | | 25 | YAML | 63,289 | | 26 | ActionScript | 62,210 | | 27 | Git | 43,748 | | 28 | mdwn | 42,654 | | 29 | mk | 41,789 | | 30 | INI | 39,760 | ### Licenses The dataset includes files from repositories with various licenses: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | mit | 9,517,343 | | bsd-3-clause | 3,315,732 | | unknown | 2,935,736 | | mpl-2.0 | 338,040 | | gpl-2.0 | 79,415 | | lgpl-2.1 | 38,429 | | gpl-3.0 | 25,964 | | apache-2.0 | 20,562 | | cc-by-4.0 | 18,703 | | agpl-3.0 | 15,367 | | cc-by-nc-4.0 | 6,362 | | wtfpl | 6,163 | | bsd-2-clause | 3,749 | | zlib | 482 | | unlicense | 261 | | cc-by-sa-4.0 | 7 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the GitGud repository (format:  username/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language detected by file extension mapping | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression (level 19) - **File Structure**: 17 files ( gitgud-00000.parquet  to  gitgud-00016.parquet ) - **Rows per shard**: ~1,000,000 (except last shard: 322,315) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       code :  using System;nusing System.Collections.Generic;n... ,  repo_name :  username/game-mod ,  path :  src/GameMod/Player.cs ,  language :  C# ,  license :  mit ,  size : 2048      ## Dataset Creation ### Pipeline Overview The dataset was created through a multi-stage pipeline: 1. **Repository Discovery**: Scraping public repository URLs from GitGud.io s GitLab API v4 endpoint using multiple sort orderings ( id ,  name ,  path ,  updated_at ,  star_count ,  last_activity_at ,  similarity ) 2. **Branch Enumeration**: Fetching all branches for each repository via the GitLab API 3. **Archive Download**: Downloading  .tar.gz  archives for each repository/branch combination 4. **Content Extraction**: Extracting and filtering source code files from archives 5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression ### Language Detection Programming languages are detected using file extension mapping. The pipeline maps ~80 programming languages by their file extensions, including: - **Major languages**: Python, JavaScript, TypeScript, C, C++, C#, Java, Go, Rust, Ruby, PHP - **Configuration**: JSON, YAML, TOML, XML, INI - **Markup**: HTML, CSS, Markdown, LaTeX - **Game development**: GLSL, HLSL, GDScript - **And many more** Files with unrecognized extensions are labeled with the extension itself (without the dot prefix). Files without extensions are labeled as "none" or by special filename matching (e.g., "Dockerfile", "Makefile"). ### License Detection Licenses are detected by: 1. Scanning for license files ( LICENSE ,  LICENSE.txt ,  LICENSE.md ,  COPYING ,  COPYING.txt ,  COPYING.md ) 2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, MPL, ISC, Unlicense, Artistic, WTFPL, Zlib, etc.) 3. Defaulting to "unknown" if no license can be detected ### File Filtering Filtering is applied to ensure data quality: #### Size Limits | Limit | Value | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Max repository archive size | 64 MB | | Max line length | 1,000 characters | #### Content Filtering - **Binary Detection**: Files with null bytes in the first 1KB are excluded - **UTF-8 Validation**: Files must be decodable as UTF-8 (with fallback to latin-1, cp1252, iso-8859-1) - **Long Lines**: Files with any line exceeding 1,000 characters are excluded - **License Files**: License files (LICENSE, COPYING, etc.) are excluded from the dataset (but used for license detection) ### Source Data All data originates from public repositories hosted on [GitGud.io](https://gitgud.io). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.</description>
<size>18751337953</size>
</item><item>
<title>Mos.Hub Code Dataset</title>
<category>Dataset</category>
<infohash>991f0d7eaa11bfda7f08e9bd82466458982cd430</infohash>
<guid>https://academictorrents.com/details/991f0d7eaa11bfda7f08e9bd82466458982cd430</guid>
<link>https://academictorrents.com/details/991f0d7eaa11bfda7f08e9bd82466458982cd430</link>
<description># Mos.Hub Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [Mos.Hub](https://hub.mos.ru) (hub.mos.ru), a code hosting platform operated by the Moscow Government. Mos.Hub is a service for storing and working with source code, based on the Git version control system, primarily used by Russian developers and government-related projects. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 15,740,580 | | **Total Repositories** | 16,130 | | **Total Size** | 529 MB (compressed Parquet) | | **Uncompressed Size** | ~29 GB | | **Programming Languages** | 297 | | **File Format** | Parquet (single file) | ### Key Features - **Russian code corpus**: Contains code from repositories hosted on Moscow s official code platform, featuring Russian comments and documentation - **Diverse language coverage**: Spans 297 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist) - **Quality filtered**: Binary files and low-quality content have been removed ### Languages The dataset includes 297 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Ruby | 8,333,731 | | 2 | JavaScript | 1,786,730 | | 3 | YAML | 1,757,614 | | 4 | Vue | 699,171 | | 5 | Markdown | 639,585 | | 6 | Haml | 538,837 | | 7 | GraphQL | 269,485 | | 8 | JSON | 214,354 | | 9 | PHP | 191,150 | | 10 | SVG | 172,884 | | 11 | Shell | 172,451 | | 12 | Go | 88,089 | | 13 | Ignore List | 87,432 | | 14 | SCSS | 80,716 | | 15 | Python | 77,532 | | 16 | C++ | 63,177 | | 17 | HTML+ERB | 62,605 | | 18 | Text | 48,400 | | 19 | Jest Snapshot | 43,638 | | 20 | HTML | 42,489 | | 21 | C | 38,354 | | 22 | reStructuredText | 26,342 | | 23 | Rust | 24,818 | | 24 | E-mail | 23,993 | | 25 | XML | 22,715 | | 26 | Java | 14,807 | | 27 | Gettext Catalog | 14,429 | | 28 | C# | 13,405 | | 29 | CSS | 12,657 | | 30 | Protocol Buffer Text Format | 12,181 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  file_text  | string | Content of the source file (UTF-8 encoded) | |  language  | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) | |  file_name  | string | Name of the source file | ### Data Format - **Format**: Apache Parquet - **File Structure**: Single file ( data.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       file_text :  package mainnnimport "fmt"nnfunc main() n    fmt.Println("Hello")nn ,  language :  Go ,  file_name :  main.go       ## Dataset Creation ### Source Data All data originates from public repositories hosted on [Mos.Hub](https://hub.mos.ru). ### Language Detection Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub s library for detecting programming languages. ### Filtering - **Deduplication**: The dataset has been deduplicated to ensure unique code files - **Binary Files**: Binary files have been removed from the dataset - **UTF-8 Validation**: Files must be valid UTF-8 encoded text ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset has been compiled with an analysis of the licenses used in the repositories to ensure ethical collection and use of the data. Users of this dataset should respect the rights of the authors and use the data responsibly.</description>
<size>554021034</size>
</item><item>
<title>Google Code Archive Dataset</title>
<category>Dataset</category>
<infohash>a342da363792ac5fa018039d5a57c81be74e4b52</infohash>
<guid>https://academictorrents.com/details/a342da363792ac5fa018039d5a57c81be74e4b52</guid>
<link>https://academictorrents.com/details/a342da363792ac5fa018039d5a57c81be74e4b52</link>
<description>## Dataset Description This dataset was compiled from the [Google Code Archive](https://code.google.com/archive/), a preserved snapshot of projects hosted on Google Code, Google s open-source project hosting service that operated from 2006 to 2016. Google Code was one of the major code hosting platforms of its era, hosting hundreds of thousands of open-source projects before its shutdown. The archive provides a unique historical record of open-source development during a formative period of modern software engineering. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 65,825,565 | | **Total Repositories** | 488,618 | | **Total Size** | 47 GB (compressed Parquet) | | **Programming Languages** | 454 | | **File Format** | Parquet with Zstd compression (71 files) | ### Key Features - **Historical open-source corpus**: Contains code from over 488K repositories hosted on Google Code during 2006-2016 - **Diverse language coverage**: Spans 454 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content - **Era-specific patterns**: Captures coding conventions and library usage from the pre-modern era of software development ### Languages The dataset includes 454 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Java | 16,331,993 | | 2 | PHP | 12,764,574 | | 3 | HTML | 5,705,184 | | 4 | C++ | 5,090,685 | | 5 | JavaScript | 4,937,765 | | 6 | C | 4,179,202 | | 7 | C# | 3,872,245 | | 8 | Python | 2,207,240 | | 9 | CSS | 1,697,385 | | 10 | Objective-C | 1,186,050 | | 11 | Shell | 639,183 | | 12 | Java Server Pages | 541,498 | | 13 | ActionScript | 540,557 | | 14 | Makefile | 481,563 | | 15 | ASP.NET | 381,389 | | 16 | Smarty | 339,555 | | 17 | Ruby | 331,743 | | 18 | Go | 316,427 | | 19 | Perl | 307,960 | | 20 | Vim Script | 216,236 | | 21 | Lua | 215,226 | | 22 | HTML+PHP | 150,781 | | 23 | HTML+Razor | 149,131 | | 24 | MATLAB | 145,686 | | 25 | Batchfile | 138,523 | | 26 | Pascal | 135,992 | | 27 | Visual Basic .NET | 118,732 | | 28 | TeX | 110,379 | | 29 | Less | 98,221 | | 30 | Unix Assembly | 94,758 | ### Licenses The dataset includes files from repositories with various licenses as specified in the Google Code Archive: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | Apache License 2.0 (asf20) | 21,568,143 | | GNU GPL v3 (gpl3) | 14,843,470 | | GNU GPL v2 (gpl2) | 6,824,185 | | Other Open Source (oos) | 5,433,436 | | MIT License (mit) | 4,754,567 | | GNU LGPL (lgpl) | 4,073,137 | | BSD License (bsd) | 3,787,348 | | Artistic License (art) | 1,910,047 | | Eclipse Public License (epl) | 1,587,289 | | Mozilla Public License 1.1 (mpl11) | 580,102 | | Multiple Licenses (multiple) | 372,457 | | Google Summer of Code (gsoc) | 63,292 | | Public Domain (publicdomain) | 28,092 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the Google Code project | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | |  license  | string | License of the repository (Google Code license identifier) | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 71 files ( google_code_0000.parquet  to  google_code_0070.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point</description>
<size>50126651493</size>
</item><item>
<title>NotaBug Code Dataset</title>
<category>Dataset</category>
<infohash>ba10507193e4169b1b2420fb81e6b4999840e0f5</infohash>
<guid>https://academictorrents.com/details/ba10507193e4169b1b2420fb81e6b4999840e0f5</guid>
<link>https://academictorrents.com/details/ba10507193e4169b1b2420fb81e6b4999840e0f5</link>
<description># NotaBug Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom and privacy. NotaBug is built on a fully free software stack and is popular among free software advocates and privacy-conscious developers. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 12,622,961 | | **Total Repositories** | 11,660 | | **Total Size** | 12 GB (compressed Parquet) | | **Programming Languages** | 6,306 (by file extension) | | **File Format** | Parquet with Zstd compression (12 files) | ### Key Features - **Free software focused corpus**: Contains code from repositories on a platform dedicated to software freedom - **Diverse language coverage**: Spans thousands of file types identified by file extension - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size ### Languages The dataset includes files from many programming languages and file types. Languages are detected by file extension. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | C++ | 2,219,208 | | 2 | po | 2,022,441 | | 3 | none | 1,572,451 | | 4 | PHP | 951,354 | | 5 | patch | 637,317 | | 6 | svg | 547,170 | | 7 | XML | 502,139 | | 8 | Python | 392,476 | | 9 | Text | 296,953 | | 10 | JavaScript | 233,368 | | 11 | JSON | 198,981 | | 12 | Scheme | 192,409 | | 13 | Markdown | 182,342 | | 14 | info | 155,078 | | 15 | slackbuild | 154,859 | | 16 | HTML | 149,824 | | 17 | Shell | 133,325 | | 18 | log | 127,393 | | 19 | Makefile | 112,989 | | 20 | INI | 110,537 | | 21 | Lua | 84,303 | | 22 | in | 75,138 | | 23 | Assembly | 74,519 | | 24 | list | 58,346 | | 25 | Java | 48,781 | | 26 | CSS | 48,112 | | 27 | mk | 47,373 | | 28 | dtsi | 43,825 | | 29 | diff | 42,125 | | 30 | el | 41,017 | ### Licenses The dataset includes files from repositories with various licenses: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | mit | 10,029,349 | | mpl-2.0 | 1,178,420 | | unknown | 888,840 | | gpl-2.0 | 333,538 | | gpl-3.0 | 158,975 | | unlicense | 11,805 | | cc-by-4.0 | 8,367 | | bsd-2-clause | 4,718 | | agpl-3.0 | 3,055 | | cc-by-sa-4.0 | 2,309 | | wtfpl | 1,314 | | cc0-1.0 | 1,188 | | bsd-3-clause | 601 | | cc-by-nc-4.0 | 269 | | lgpl-3.0 | 137 | | lgpl-2.1 | 76 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the NotaBug repository (format:  username/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as inferred by file extension | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 12 files ( notabug_0000.parquet  to  notabug_0011.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       code :  #!/usr/bin/env python2n# -*- coding: utf-8 -*-n# Copyright (C) 2014... ,  repo_name :  intermsofthewhole/libreboot ,  path :  resources/utilities/i945gpu/intel-regs.py ,  language :  Python ,  license :  mit ,  size : 3733      ## Dataset Creation ### Source Data All data originates from public repositories hosted on [NotaBug.org](https://notabug.org). ### Language Detection Programming languages are detected by file extension inference. ### License Detection Licenses are detected by scanning for license files in repositories and matching against known license patterns. Repositories without a detectable license are marked as "unknown". ### File Filtering - **Long Lines**: Files with any line exceeding 1,000 characters were excluded - **Deduplication**: No deduplication was performed on the dataset ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.</description>
<size>12602550935</size>
</item><item>
<title>JihuLab Code Dataset</title>
<category>Dataset</category>
<infohash>004c9565d325cf8951aa23d3a56ebb806898fc7f</infohash>
<guid>https://academictorrents.com/details/004c9565d325cf8951aa23d3a56ebb806898fc7f</guid>
<link>https://academictorrents.com/details/004c9565d325cf8951aa23d3a56ebb806898fc7f</link>
<description># JihuLab Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [JihuLab](https://jihulab.com), a GitLab-based code hosting platform operated by JiHu (GitLab s Chinese joint venture). JihuLab serves as the primary GitLab instance for Chinese developers and enterprises, offering localized services and compliance with Chinese regulations. This dataset is particularly valuable for training code models with Chinese language understanding and enterprise-level coding practices. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 1,853,253 | | **Total Repositories** | 11,589 | | **Total Size** | 1.5 GB (compressed Parquet) / 12.76 GB (uncompressed) | | **Programming Languages** | 304 | | **File Format** | Parquet with Zstd compression | ### Key Features - **Chinese developer ecosystem**: Contains code from JihuLab, GitLab s official Chinese distribution, featuring Chinese comments, documentation, and variable names - **Diverse language coverage**: Spans 304 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Enterprise and open-source projects**: Includes code from both individual developers and Chinese enterprises using GitLab - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content ### Languages The dataset includes 304 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Java | 348,517 | | 2 | C | 209,924 | | 3 | JavaScript | 191,164 | | 4 | Python | 172,798 | | 5 | C++ | 136,046 | | 6 | Go | 80,000 | | 7 | TypeScript | 79,067 | | 8 | HTML | 69,173 | | 9 | C# | 64,511 | | 10 | Rust | 50,515 | | 11 | Shell | 43,352 | | 12 | Vue | 40,687 | | 13 | TSX | 36,844 | | 14 | CSS | 34,779 | | 15 | Makefile | 26,227 | | 16 | Ruby | 25,812 | | 17 | PHP | 21,401 | | 18 | CMake | 15,292 | | 19 | Kotlin | 14,220 | | 20 | BitBake | 13,060 | | 21 | SCSS | 10,957 | | 22 | Scala | 9,333 | | 23 | Dart | 9,125 | | 24 | Lua | 7,413 | | 25 | ASP.NET | 7,005 | | 26 | Vim Script | 5,710 | | 27 | Unix Assembly | 5,239 | | 28 | Starlark | 5,134 | | 29 | Objective-C | 4,931 | | 30 | Factor | 4,920 | ### Licenses The dataset includes files from repositories with various licenses. Repositories with restrictive licenses (CC-BY-ND variants, Commons Clause, SSPL) were excluded: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | apache-2.0 | 551,008 | | unknown | 535,320 | | mit | 320,834 | | agpl-3.0 | 169,922 | | gpl-2.0 | 112,829 | | bsd | 65,104 | | cc0-1.0 | 13,557 | | lgpl-3.0 | 12,871 | | lgpl-2.1 | 9,960 | | bsd-3-clause | 9,109 | | bsl-1.1 | 8,972 | | epl-1.0 | 7,494 | | gpl-3.0 | 7,476 | | unlicense | 6,265 | | cc-by-3.0 | 4,717 | | cc-by-nc-sa | 4,339 | | mpl-2.0 | 3,847 | | cc-by-4.0 | 2,459 | | cc-by-nc-sa-4.0 | 1,715 | | cc-by-sa-4.0 | 1,701 | | bsd-2-clause | 1,599 | | cc-by-nc-nd-4.0 | 1,222 | | isc | 520 | | wtfpl | 274 | | cc-by-nc-4.0 | 122 | | cc-by-sa | 13 | | cc-by-sa-3.0 | 4 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the JihuLab repository (format:  username/repo  or  group/subgroup/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: Single consolidated file ( data.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       code :  package com.example.demo;nnimport org.springframework.boot.*;nimport org.springframework.boot.autoconfigure.*;n... ,  repo_name :  SmallQ/demo ,  path :  src/main/java/com/example/demo/DemoApplication.java ,  language :  Java ,  license :  unknown ,  size : 400      ## Dataset Creation ### Pipeline Overview The dataset was created through a multi-stage pipeline: 1. **Repository Discovery**: Paginated API requests to JihuLab s GitLab API ( /api/v4/projects ) to enumerate public repositories 2. **Branch Selection**: Using the repository s default branch (typically  main  or  master ) 3. **Repository Downloading**: Downloading repository archives via JihuLab s archive endpoint 4. **Content Extraction**: Extracting and filtering source code files 5. **Parquet Generation**: Writing filtered records to Parquet with Zstd compression ### Language Detection Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub s Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded). ### License Detection Licenses are detected by: 1. Scanning for license files ( LICENSE ,  LICENSE.txt ,  LICENSE.md ,  COPYING , etc.) 2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.) 3. Defaulting to "unknown" if no license can be detected **Blocked Licenses**: The following restrictive licenses are excluded from the dataset: -  cc-by-nd ,  cc-by-nd-2.0 ,  cc-by-nd-3.0 ,  cc-by-nd-4.0  (Creative Commons No-Derivatives) -  commons-clause  -  sspl ,  sspl-1.0  (Server Side Public License) ### File Filtering Extensive filtering is applied to ensure data quality: #### Size Limits | Limit | Value | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Max repository ZIP size | 48 MB | | Max single file size | 1 MB | | Max line length | 1,000 characters | #### Excluded Directories - **Configuration**:  .git/ ,  .github/ ,  .gitlab/ ,  .vscode/ ,  .idea/ ,  .vs/ ,  .settings/ ,  .eclipse/ ,  .project/ ,  .metadata/  - **Vendor/Dependencies**:  node_modules/ ,  bower_components/ ,  jspm_packages/ ,  vendor/ ,  third_party/ ,  3rdparty/ ,  external/ ,  packages/ ,  deps/ ,  lib/vendor/ ,  target/dependency/ ,  Pods/  - **Build Output**:  build/ ,  dist/ ,  out/ ,  bin/ ,  target/ ,  release/ ,  debug/ ,  .next/ ,  .nuxt/ ,  _site/ ,  _build/ ,  __pycache__/ ,  .pytest_cache/ ,  cmake-build-* ,  .gradle/ ,  .maven/  #### Excluded Files - **Lock Files**:  package-lock.json ,  yarn.lock ,  pnpm-lock.yaml ,  Gemfile.lock ,  Cargo.lock ,  poetry.lock ,  Pipfile.lock ,  composer.lock ,  go.sum ,  mix.lock  - **Minified Files**: Any file containing  .min.  in the name - **Binary Files**:  .exe ,  .dll ,  .so ,  .dylib ,  .a ,  .lib ,  .o ,  .obj ,  .jar ,  .war ,  .ear ,  .class ,  .pyc ,  .pyo ,  .wasm ,  .bin ,  .dat ,  .pdf ,  .doc ,  .docx ,  .xls ,  .xlsx ,  .ppt ,  .pptx ,  .zip ,  .tar ,  .gz ,  .bz2 ,  .7z ,  .rar ,  .jpg ,  .jpeg ,  .png ,  .gif ,  .bmp ,  .ico ,  .svg ,  .mp3 ,  .mp4 ,  .avi ,  .mov ,  .wav ,  .flac ,  .ttf ,  .otf ,  .woff ,  .woff2 ,  .eot  - **System Files**:  .DS_Store ,  thumbs.db  #### Content Filtering - **UTF-8 Validation**: Files must be valid UTF-8 encoded text - **Binary Detection**: Files detected as binary by go-enry are excluded - **Generated Files**: Files with generation markers in the first 500 bytes are excluded: -  generated by ,  do not edit ,  auto-generated ,  autogenerated ,  automatically generated ,  code generator ,  generated code ,  this file is generated ,  @generated ,  &lt;auto-generated  - **Empty Files**: Files that are empty or contain only whitespace are excluded - **Long Lines**: Files with any line exceeding 1,000 characters are excluded - **go-enry Filters**: Additional filtering using go-enry s  IsVendor() ,  IsImage() ,  IsDotFile() ,  IsTest() , and  IsGenerated()  functions - **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped ### Source Data All data originates from public repositories hosted on [JihuLab](https://jihulab.com). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.</description>
<size>1506303268</size>
</item><item>
<title>GitVerse Code Dataset</title>
<category>Dataset</category>
<infohash>8d4375542da8412f19d7d867413e3c29eeb6f4a0</infohash>
<guid>https://academictorrents.com/details/8d4375542da8412f19d7d867413e3c29eeb6f4a0</guid>
<link>https://academictorrents.com/details/8d4375542da8412f19d7d867413e3c29eeb6f4a0</link>
<description># GitVerse Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [GitVerse](https://gitverse.ru), a Russian code hosting platform and an alternative to GitHub in the Russian developer community. GitVerse is used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with Russian language understanding and Russian coding conventions. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 2,802,994 | | **Total Repositories** | 9,014 | | **Total Size** | 2 GB (compressed Parquet) | | **Programming Languages** | 416 | | **File Format** | Parquet (single file) | ### Key Features - **Russian code corpus**: Contains code from over 9,000 repositories, many featuring Russian comments, documentation, and variable names - **Diverse language coverage**: Spans 416 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist) ### Languages The dataset includes 416 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | C | 580,713 | | 2 | JavaScript | 275,744 | | 3 | C++ | 197,896 | | 4 | Shell | 166,527 | | 5 | Python | 116,065 | | 6 | Markdown | 112,811 | | 7 | TypeScript | 107,867 | | 8 | Java | 88,429 | | 9 | PHP | 80,341 | | 10 | Makefile | 77,619 | | 11 | XML | 75,320 | | 12 | Go | 69,155 | | 13 | C# | 68,185 | | 14 | Text | 65,677 | | 15 | JSON | 64,253 | | 16 | SVG | 58,107 | | 17 | HTML | 43,261 | | 18 | YAML | 40,178 | | 19 | Unity3D Asset | 33,917 | | 20 | Rust | 32,872 | | 21 | LLVM | 29,819 | | 22 | Unix Assembly | 27,672 | | 23 | Roff | 25,884 | | 24 | CSS | 21,809 | | 25 | TSX | 21,637 | | 26 | reStructuredText | 19,683 | | 27 | Perl | 18,576 | | 28 | Gettext Catalog | 17,071 | | 29 | Diff | 14,225 | | 30 | CMake | 14,132 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  file_text  | string | The full text content of the file (UTF-8 encoded) | |  language  | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) | |  file_name  | string | A unique identifier for the file within the dataset | ### Data Format - **Format**: Apache Parquet - **File Structure**: Single file ( data.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       file_text :  Процедура ОбработкаПроведения(Отказ, Режим)nt// Нерабочий вариант без ошибокn... ,  language :  1C Enterprise ,  file_name :  004_work.code.bsl       ## Dataset Creation ### Language Detection Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub s library for language detection and syntax highlighting. ### Source Data All data originates from public repositories hosted on [GitVerse](https://gitverse.ru). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.</description>
<size>2113649354</size>
</item><item>
<title>GitFlic Code Dataset</title>
<category>Dataset</category>
<infohash>ac3849c0ddceb432d9733ed19a000befe88e1e82</infohash>
<guid>https://academictorrents.com/details/ac3849c0ddceb432d9733ed19a000befe88e1e82</guid>
<link>https://academictorrents.com/details/ac3849c0ddceb432d9733ed19a000befe88e1e82</link>
<description># GitFlic Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [GitFlic](https://gitflic.ru), the first Russian service for storing and working with source code, based on the Git version control system. GitFlic is widely used by Russian developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with strong Russian language understanding and Russian coding conventions. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 5,975,978 | | **Total Repositories** | 12,527 | | **Total Size** | 6.44 GB (compressed Parquet) | | **Programming Languages** | 690 | | **File Format** | Parquet with Zstd compression (6 files) | ### Key Features - **Russian code corpus**: Contains code from over 12,000 repositories, many featuring Russian comments, documentation, and variable names - **Diverse language coverage**: Spans 690 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist) - **Deduplicated**: The dataset has been deduplicated and filtered to remove binary files - **Quality filtered**: Filtered to ensure data quality and remove non-code content ### Languages The dataset includes 690 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | C | 739,012 | | 2 | Java | 634,899 | | 3 | C++ | 587,528 | | 4 | JavaScript | 422,832 | | 5 | PHP | 365,105 | | 6 | XML | 291,920 | | 7 | Markdown | 211,574 | | 8 | Shell | 207,178 | | 9 | Python | 206,443 | | 10 | Unity3D Asset | 150,654 | | 11 | SVG | 150,136 | | 12 | TypeScript | 141,886 | | 13 | Text | 139,406 | | 14 | JSON | 126,214 | | 15 | HTML | 122,341 | | 16 | Go | 109,740 | | 17 | YAML | 89,416 | | 18 | Roff | 82,609 | | 19 | C# | 77,520 | | 20 | Makefile | 63,594 | | 21 | LLVM | 55,680 | | 22 | Scala | 53,395 | | 23 | Unix Assembly | 49,909 | | 24 | Rust | 35,553 | | 25 | reStructuredText | 35,023 | | 26 | Objective-C | 34,151 | | 27 | Ruby | 33,366 | | 28 | CMake | 33,030 | | 29 | CSS | 31,664 | | 30 | TSX | 31,397 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  file_text  | string | Content of the source file (UTF-8 encoded) | |  language  | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) | |  file_name  | string | A unique identifier for the file within the dataset | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 6 files ( gitflic-00000.parquet  to  gitflic-00005.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       file_text :  package com.example.demo;nnimport org.springframework.boot.SpringApplication;n... ,  language :  Java ,  file_name :  Application.java       ## Dataset Creation ### Language Detection Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub s library for language detection. ### Source Data All data originates from public repositories hosted on [GitFlic](https://gitflic.ru). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. Users of this dataset should respect the rights of the authors and use the data responsibly.</description>
<size>6911382788</size>
</item><item>
<title>Gitee Code Dataset</title>
<category>Dataset</category>
<infohash>e572ddd8459e96ed50ba40f1ee991734805f2259</infohash>
<guid>https://academictorrents.com/details/e572ddd8459e96ed50ba40f1ee991734805f2259</guid>
<link>https://academictorrents.com/details/e572ddd8459e96ed50ba40f1ee991734805f2259</link>
<description># Gitee Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [Gitee](https://gitee.com), China s largest code hosting platform and a leading alternative to GitHub in the Chinese developer community. Gitee is widely used by Chinese developers, enterprises, and open-source projects, making this dataset particularly valuable for training code models with strong Chinese language understanding and Chinese coding conventions. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 819,472,785 | | **Total Repositories** | 3,105,923 | | **Total Size** | 536 GB (compressed Parquet) | | **Programming Languages** | 554 | | **File Format** | Parquet with Zstd compression (468 files) | ### Key Features - **Large-scale Chinese code corpus**: Contains code from over 3 million repositories, many featuring Chinese comments, documentation, and variable names - **Diverse language coverage**: Spans 554 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Enterprise and open-source projects**: Includes code from both individual developers and Chinese enterprises - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content ### Languages The dataset includes 554 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Java | 293,439,777 | | 2 | JavaScript | 77,715,425 | | 3 | C | 62,836,721 | | 4 | C++ | 49,134,251 | | 5 | HTML | 46,191,063 | | 6 | Vue | 40,468,646 | | 7 | PHP | 37,132,954 | | 8 | C# | 33,842,369 | | 9 | Python | 25,192,704 | | 10 | CSS | 20,802,464 | | 11 | TypeScript | 20,122,528 | | 12 | Go | 16,176,561 | | 13 | Shell | 8,371,429 | | 14 | Makefile | 6,341,964 | | 15 | Java Server Pages | 6,224,523 | | 16 | TSX | 5,768,542 | | 17 | CMake | 5,581,774 | | 18 | SCSS | 5,291,031 | | 19 | Objective-C | 4,922,736 | | 20 | Less | 4,669,672 | | 21 | Ruby | 3,027,385 | | 22 | Kotlin | 2,986,211 | | 23 | Scala | 2,869,640 | | 24 | Rust | 2,466,122 | | 25 | Starlark | 2,027,514 | | 26 | Dart | 2,010,079 | | 27 | Unix Assembly | 1,900,320 | | 28 | Fluent | 1,882,380 | | 29 | HTML+Razor | 1,863,914 | | 30 | Swift | 1,607,477 | ### Licenses The dataset includes files from repositories with various licenses. Repositories with restrictive licenses (CC-BY-ND variants, Commons Clause, SSPL) were excluded: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | apache-2.0 | 273,706,950 | | mit | 201,880,040 | | unknown | 195,868,240 | | agpl-3.0 | 60,181,320 | | bsd | 30,013,190 | | gpl-2.0 | 27,831,530 | | lgpl-3.0 | 11,746,750 | | lgpl-2.1 | 4,807,600 | | bsd-3-clause | 4,442,480 | | cc0-1.0 | 3,144,920 | | gpl-3.0 | 1,631,590 | | unlicense | 1,181,930 | | bsd-2-clause | 1,154,300 | | epl-1.0 | 1,045,470 | | Other licenses | ~5,800,000 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the Gitee repository (format:  username/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 468 files ( gitee_0000.parquet  to  gitee_0467.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       code :  package com.example.demo;nnimport org.springframework.boot.SpringApplication;n... ,  repo_name :  username/spring-demo ,  path :  src/main/java/com/example/demo/Application.java ,  language :  Java ,  license :  apache-2.0 ,  size : 1234      ## Dataset Creation ### Pipeline Overview The dataset was created through a multi-stage pipeline: 1. **Repository Discovery** 2. **Branch Selection**: Selecting the main branch for each repository (priority:  master  &gt;  main  &gt;  develop  &gt;  dev  &gt; first branch) 3. **Repository Downloading** 4. **Content Extraction**: Extracting and filtering source code files 5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression ### Language Detection Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub s Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded). ### License Detection Licenses are detected by: 1. Scanning for license files ( LICENSE ,  LICENSE.txt ,  LICENSE.md ,  COPYING , etc.) 2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.) 3. Defaulting to "unknown" if no license can be detected **Blocked Licenses**: The following restrictive licenses are excluded from the dataset: -  cc-by-nd ,  cc-by-nd-2.0 ,  cc-by-nd-3.0 ,  cc-by-nd-4.0  (Creative Commons No-Derivatives) -  commons-clause  -  sspl ,  sspl-1.0  (Server Side Public License) ### File Filtering Extensive filtering is applied to ensure data quality: #### Size Limits | Limit | Value | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Max repository ZIP size | 48 MB | | Max single file size | 1 MB | | Max line length | 1,000 characters | #### Excluded Directories - **Configuration**:  .git/ ,  .github/ ,  .gitlab/ ,  .vscode/ ,  .idea/ ,  .vs/ ,  .settings/ ,  .eclipse/ ,  .project/ ,  .metadata/  - **Vendor/Dependencies**:  node_modules/ ,  bower_components/ ,  jspm_packages/ ,  vendor/ ,  third_party/ ,  3rdparty/ ,  external/ ,  packages/ ,  deps/ ,  lib/vendor/ ,  target/dependency/ ,  Pods/  - **Build Output**:  build/ ,  dist/ ,  out/ ,  bin/ ,  target/ ,  release/ ,  debug/ ,  .next/ ,  .nuxt/ ,  _site/ ,  _build/ ,  __pycache__/ ,  .pytest_cache/ ,  cmake-build-* ,  .gradle/ ,  .maven/  #### Excluded Files - **Lock Files**:  package-lock.json ,  yarn.lock ,  pnpm-lock.yaml ,  Gemfile.lock ,  Cargo.lock ,  poetry.lock ,  Pipfile.lock ,  composer.lock ,  go.sum ,  mix.lock  - **Minified Files**: Any file containing  .min.  in the name - **Binary Files**:  .exe ,  .dll ,  .so ,  .dylib ,  .a ,  .lib ,  .o ,  .obj ,  .jar ,  .war ,  .ear ,  .class ,  .pyc ,  .pyo ,  .wasm ,  .bin ,  .dat ,  .pdf ,  .doc ,  .docx ,  .xls ,  .xlsx ,  .ppt ,  .pptx ,  .zip ,  .tar ,  .gz ,  .bz2 ,  .7z ,  .rar ,  .jpg ,  .jpeg ,  .png ,  .gif ,  .bmp ,  .ico ,  .svg ,  .mp3 ,  .mp4 ,  .avi ,  .mov ,  .wav ,  .flac ,  .ttf ,  .otf ,  .woff ,  .woff2 ,  .eot  - **System Files**:  .DS_Store ,  thumbs.db  #### Content Filtering - **UTF-8 Validation**: Files must be valid UTF-8 encoded text - **Binary Detection**: Files detected as binary by go-enry are excluded - **Generated Files**: Files with generation markers in the first 500 bytes are excluded: -  generated by ,  do not edit ,  auto-generated ,  autogenerated ,  automatically generated ,  code generator ,  generated code ,  this file is generated ,  @generated ,  &lt;auto-generated  - **Empty Files**: Files that are empty or contain only whitespace are excluded - **Long Lines**: Files with any line exceeding 1,000 characters are excluded - **go-enry Filters**: Additional filtering using go-enry s  IsVendor() ,  IsImage() ,  IsDotFile() ,  IsTest() , and  IsGenerated()  functions - **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped ### Source Data All data originates from public repositories hosted on [Gitee](https://gitee.com). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.</description>
<size>574748274837</size>
</item><item>
<title>GitCode Code Dataset</title>
<category>Dataset</category>
<infohash>c1dd455d6f9bbeb7f1e88de92a952df5e04f64c4</infohash>
<guid>https://academictorrents.com/details/c1dd455d6f9bbeb7f1e88de92a952df5e04f64c4</guid>
<link>https://academictorrents.com/details/c1dd455d6f9bbeb7f1e88de92a952df5e04f64c4</link>
<description># GitCode Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [GitCode](https://gitcode.com), a code hosting platform in China backed by CSDN (China Software Developer Network). GitCode serves as a domestic alternative to GitHub, widely used by Chinese developers, students, and enterprises for hosting open-source projects and educational resources, making this dataset particularly valuable for training code models with Chinese language understanding and Chinese coding conventions. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 48,142,567 | | **Total Repositories** | 85,632 | | **Total Size** | 40 GB (compressed Parquet) | | **Programming Languages** | 537 | | **File Format** | Parquet with Zstd compression (34 files) | ### Key Features - **Chinese code corpus**: Contains code from over 85,000 repositories, many featuring Chinese comments, documentation, and variable names - **Diverse language coverage**: Spans 537 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Open-source and educational projects**: Includes code from individual developers, students, and Chinese enterprises - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content ### Languages The dataset includes 537 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | C++ | 9,513,619 | | 2 | C | 8,220,317 | | 3 | Java | 5,362,924 | | 4 | Python | 3,428,302 | | 5 | TypeScript | 3,166,959 | | 6 | JavaScript | 2,540,280 | | 7 | HTML | 1,578,824 | | 8 | Kotlin | 1,413,651 | | 9 | C# | 1,232,638 | | 10 | Go | 1,159,708 | | 11 | Rust | 812,959 | | 12 | Dart | 767,731 | | 13 | TSX | 749,355 | | 14 | PHP | 663,953 | | 15 | Shell | 629,436 | | 16 | Vue | 563,754 | | 17 | Makefile | 471,588 | | 18 | CMake | 460,428 | | 19 | CSS | 381,628 | | 20 | Ruby | 350,213 | | 21 | Objective-C | 347,251 | | 22 | LLVM | 297,591 | | 23 | Unix Assembly | 291,826 | | 24 | Swift | 206,725 | | 25 | Objective-C++ | 160,526 | | 26 | Scala | 157,367 | | 27 | QML | 157,088 | | 28 | Lua | 149,114 | | 29 | SCSS | 141,661 | | 30 | GLSL | 129,124 | ### Licenses The dataset includes files from repositories with various licenses. Repositories with restrictive licenses (CC-BY-ND variants, Commons Clause, SSPL) were excluded: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | unknown | 23,567,463 | | apache-2.0 | 8,722,445 | | mit | 7,743,613 | | gpl-2.0 | 3,528,526 | | agpl-3.0 | 2,300,580 | | lgpl | 1,013,654 | | bsd-3-clause | 528,980 | | gpl-3.0 | 305,332 | | public-domain | 163,493 | | bsd-2-clause | 94,426 | | bsd | 69,967 | | isc | 36,117 | | unlicense | 28,411 | | cc0-1.0 | 26,799 | | mpl-2.0 | 9,459 | | Other licenses | ~5,000 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the GitCode repository (format:  username/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 34 files ( gitcode_0000.parquet  to  gitcode_0033.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point</description>
<size>42220690870</size>
</item><item>
<title>Russian QnA 333K</title>
<category>Dataset</category>
<infohash>8282e7191ddec974eb94c56ce83d424cc8184204</infohash>
<guid>https://academictorrents.com/details/8282e7191ddec974eb94c56ce83d424cc8184204</guid>
<link>https://academictorrents.com/details/8282e7191ddec974eb94c56ce83d424cc8184204</link>
<description># Dataset Card for Russian QnA ### Dataset Summary This dataset contains a collection of questions and answers in Russian. The dataset includes questions across various categories with corresponding answers, ratings, and metadata. ### Languages The dataset content is primarily in Russian: - Russian (ru) ## Dataset Structure ### Data Files - Single file containing all Q&amp;A records:  data.parquet  ### Data Fields Each record contains the following fields: -  question_id : Unique identifier for the question. -  question_title : Title/subject of the question. -  question_description : Extended description or body of the question. -  question_images : Array of image URLs associated with the question. -  category : Category/topic area of the question (e.g., "здоровье и медицина"). -  tags : Array of tags associated with the question. -  question_rating : Rating/score of the question. -  answers : Array of answer objects, each containing: -  answer_text : Text content of the answer -  answer_images : Array of image URLs in the answer -  answer_rating : Rating/score of the answer ### Data Splits The dataset contains a single split with all Q&amp;A records: | Split   | Description                      | Number of Examples | | :&amp;mdash;&amp;mdash;&amp;mdash; | :&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;- | &amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-: | |  train  | All question-answer pairs        | 333,029            |</description>
<size>225873026</size>
</item></channel>
</rss>
