<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:academictorrents="https://academictorrents.com" version="2.0">
<channel>
<title>Academic Torrents</title>
<description>Recent Torrents</description>
<link>https://academictorrents.com/</link>
<item>
<title>Reddit comments/submissions 2026-03</title>
<category>Dataset</category>
<infohash>668087bb8c8c9c763b27a1a4c5e7fcb6add25f2c</infohash>
<guid>https://academictorrents.com/details/668087bb8c8c9c763b27a1a4c5e7fcb6add25f2c</guid>
<link>https://academictorrents.com/details/668087bb8c8c9c763b27a1a4c5e7fcb6add25f2c</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-03 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions</description>
<size>68404242141</size>
</item><item>
<title>Places in the Wild: Ecologically-sampled RAW photographs</title>
<category>Dataset</category>
<infohash>0484bed6b741505f66503bb141e09ae20a4d59d3</infohash>
<guid>https://academictorrents.com/details/0484bed6b741505f66503bb141e09ae20a4d59d3</guid>
<link>https://academictorrents.com/details/0484bed6b741505f66503bb141e09ae20a4d59d3</link>
<description>Places in the Wild comprises over 67,000 RAW-format images, each captured with a 45-megapixel Canon EOS R5 full-frame mirrorless camera at 5-degree intervals, providing 360-degree coverage across over 800 unique locations. These locations span 260 basic-level scene categories, including both indoor and outdoor environments such as bedrooms, train stations, forests, and parking garages.</description>
<size>2618655068160</size>
</item><item>
<title>UIdataGB Gallblader Diseases Dataset</title>
<category>Dataset</category>
<infohash>aa2092fb6910c90c467a62293f74042a6ff7d251</infohash>
<guid>https://academictorrents.com/details/aa2092fb6910c90c467a62293f74042a6ff7d251</guid>
<link>https://academictorrents.com/details/aa2092fb6910c90c467a62293f74042a6ff7d251</link>
<description>The dataset is composed of ultrasound images of the GB organ from inside the gastrointestinal tract. The dataset includes 9 classes according to anatomical landmarks. Each class represents a GB disease. Published: 23 January 2024 | Version 1 | DOI: 10.17632/r6h24d2d3y.1 Turki, Amina; Mahdi Obaid, Ahmed; Bellaaj, Hatem; Ksantini, Mohamed; Altaee, Abdulla (2024), “Gallblader Diseases Dataset  ”, Mendeley Data, V1, doi: 10.17632/r6h24d2d3y.1 "The UIdataGB dataset consists of 10692 images, annotated, and verified by medical doctorsand experienced radiologists. It includes 9 classes according to anatomical landmarks. Each classcontains nearly 1200 images. Therefore, the dataset is balanced in terms of diseases. In total,1782 patients were involved in the data collection; the number of female images was 6246,with an average age of 63.4, while the number of male images was 4446, with an average ageof 59.6.The number of images is sufficient to be used for different tasks, e.g., image retrieval, ML, DL,and transfer learning (TL), etc. The anatomical landmark of the GB determines the pathologicalfinding like cholecystitis, stone of the GB and polyps.The dataset consists of images with a resolution of 90 0×120 0 pixels and they are sorted intoseparate nine folders named according to the content. Tables 1 and 2 show the distribution ofdiseases in terms of images and patients’ numbers as well as the distribution of images accord-ing to gender." https://www.sciencedirect.com/science/article/pii/S2352340924003950</description>
<size>2042724460</size>
</item><item>
<title>enwiki-20260401-pages-articles-multistream-index.txt.bz2</title>
<category>Paper</category>
<infohash>8ed9dcc05b0fdecb47f998186e9e5c30f8212cfc</infohash>
<guid>https://academictorrents.com/details/8ed9dcc05b0fdecb47f998186e9e5c30f8212cfc</guid>
<link>https://academictorrents.com/details/8ed9dcc05b0fdecb47f998186e9e5c30f8212cfc</link>
<description>English Wikipedia Multistream Index 2026-04-01 https://en.wikipedia.org/wiki/Wikipedia:Database_download Corresponding multistream file: https://academictorrents.com/details/2b2ffc80941b61dcfe55fc444a6a78a60eef5944</description>
<size>280349081</size>
</item><item>
<title>enwiki-20260401-pages-articles-multistream.xml.bz2</title>
<category>Paper</category>
<infohash>2b2ffc80941b61dcfe55fc444a6a78a60eef5944</infohash>
<guid>https://academictorrents.com/details/2b2ffc80941b61dcfe55fc444a6a78a60eef5944</guid>
<link>https://academictorrents.com/details/2b2ffc80941b61dcfe55fc444a6a78a60eef5944</link>
<description>English Wikipedia Multistream 2026-04-01 https://en.wikipedia.org/wiki/Wikipedia:Database_download Corresponding index file: https://academictorrents.com/details/8ed9dcc05b0fdecb47f998186e9e5c30f8212cfc</description>
<size>26207949994</size>
</item><item>
<title>EchoCP_dataset</title>
<category>Dataset</category>
<infohash>22508526b4bf4b08641b3dbba010eb083388322c</infohash>
<guid>https://academictorrents.com/details/22508526b4bf4b08641b3dbba010eb083388322c</guid>
<link>https://academictorrents.com/details/22508526b4bf4b08641b3dbba010eb083388322c</link>
<description>A dataset of contrast transthoracic echocardiography, EchoCP, for patent foramen ovale diagnosis is published. We present EchoCP, the first dataset for cTTE based PFO diagnosis. EchoCP contains both VM and rest echocardiography videos captured from 30 patients. Data annotation including diagnosis annotation and segmentation annotation are performed by four experienced cardiovascular sonographers. As there are more than a thousand images in each patient s video, sparse labeling (only select representative frames) of the segmentation is adopted. EchoCP contains cTTE videos from 30 patients. For each patient, two videos corresponding to the rest and VM state of the patients are captured. Note that in the rest state, patients just relax and breathe normally. While in the VM, patients need to close their mouths, pinch their noses shut while expelling air out as if blowing up a balloon. The video is captured in the apical-4-chamber view and contains at least ten cardiac cycles. For the VM state, the action is performed three to five times during acquisition, and we selected the most representative one. If you used our dataset, please consider to cite our paper in MICCAI 2021, Tianchen Wang, Zhihe Li, Shanshan Bi, Meiping Huang, Jiawei Zhang, Jian Zhuang, Yiyu Shi, Hongwen Fei, Xiaowei Xu, "ImageCHD: A 3D Computed Tomography Image Dataset for Classification of Congenital Heart Disease," in Proc. of Medical Image Computing and Computer Assisted Interventions (MICCAI), Online, 2021. https://arxiv.org/abs/2101.10799 HIGHLIGHT 20231101: We have deployed the dataset on Kaggle! Please send emails to me xiao.wei.xu@foxmail.com if you have any questions about the dataset and the benchmark.</description>
<size>5554695068</size>
</item><item>
<title>cardiacUDC_dataset</title>
<category>Dataset</category>
<infohash>55cce2068badb8204b5de896922afc301c37a691</infohash>
<guid>https://academictorrents.com/details/55cce2068badb8204b5de896922afc301c37a691</guid>
<link>https://academictorrents.com/details/55cce2068badb8204b5de896922afc301c37a691</link>
<description>We collect CardiacUDA from our two hospitals: site G and site R. In order to guarantee all echocardiogram videos are standardscompliant, all cases of CardiacUDA are collected, annotated and approved by 5-6 experienced physicians. For ethical issues, we have required approval from medical institutions. Each patient underwent four views during scanning, which included parasternal left ventricle long axis (LVLA), pulmonary artery long axis (PALA), left ventricular short-axis (LVSA), and apical fourchamber heart (A4C), resulting in four videos per patient. The resolution of each video was either 800x600 or 1024x768, depending on the scanner used (Philips or HITACHI). A total of 516 and 476 videos were collected from Site G and Site R, respectively, from approximately 100 different patients. Each video consists of over 100 frames, covering at least one heartbeat cycle. We have provided pixel-level annotations for each view, including masks for the left ventricle (LV) and right ventricle (RV) in the LVLA view, masks for the pulmonary artery (PA) in the PALA view, masks for the LV and RV in the LVSA view, and masks for the LV, RV, left atrium (LA), and right atrium (RA) in the A4C view. The videos in both Site R and Site G were divided into a ratio of 8:1:1 for training, validation, and testing, respectively. To lower annotation costs in the training set, only five frames per video are provided with pixellevel annotation masks. To better measure the model performance, we provide pixel-level annotations for every frame in each video in the validation and testing sets. HIGHLIGHT 20231101: We have deployed the dataset on Kaggle! Please refer to the code (https://github.com/xmed-lab/GraphEcho) and our ICCV paper (https://arxiv.org/abs/2309.11145) for more detailes. Please send emails to me xiao.wei.xu@foxmail.com if you have any questions about the dataset and the benchmark.</description>
<size>4547904911</size>
</item><item>
<title>ImperialAIEchocardiographyDataset_2020-12-05</title>
<category>Dataset</category>
<infohash>eeb1fb0e2e0e7bb8c6873f3357e45b545cba9fac</infohash>
<guid>https://academictorrents.com/details/eeb1fb0e2e0e7bb8c6873f3357e45b545cba9fac</guid>
<link>https://academictorrents.com/details/eeb1fb0e2e0e7bb8c6873f3357e45b545cba9fac</link>
<description>This is the latest versions of the datasets and code. They are constantly being added to. The code lives on github. Download 2020-12-05 release: Unity Imaging Echocardiography Model Development Dataset Images: Download Unity Imaging Echocardiography Model Development Dataset Labels: Download Unity Imaging Code: https://github.com/UnityImaging For reproducibility, specific snapshots of the datasets and code used for publication are below. Images - png-cache.zip 1) We curate a collection of DICOM files that will contribute to a dataset. 2) Each DICOM file is assigned to a dataset class - currently there are two 01 - development - training / tuning / internal validation images 02 - external validation images 3) Each DICOM file is given a 64 character hexadecimal code, e.g. 4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66 4) Each image within a DICOM (typically an individual frame for echo) gets given a number padded to 4 digits, starting from 0000 and going to 9999. 5) These images are extracted from the DICOM file, burnt-in meta-data masked, and saved as a png with their code as a filename - e.g. 01-4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66-0000.png 6) The individual images that make up a dataset for a paper are saved in a folder called png-cache, with sub directories for the dataset class (e.g. /01) and then the first two pairs of hexadecimal digits (e.g. /4d/44), i.e. /png-cache/01/4d/44/4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66-0000.png 7) This folder is then compressed to form png-cache.zip Not all files may have an associated label - e.g. all the frames of a video may be included, but only a few of them have expert labels Labels - labels.zip These are stored as JSON files. The development dataset (provided as labels-all.json) is divided up into: labels-train.json - training labels-tune.json - tuning labels-ival.json - internal validation For each image file (which acts as the key), there is a dictionary for every possible label. Each label for an image may have a type of: "off": the structure is definitely not in the image - i.e the outputs would be expected to be all zeros "blurred": the structure is might be in the image, but there is no label available (either it was too blurry, or no one has tried to label it) - i.e the output would need to be masked from the loss function "point": the structure is a single point, with the x and y coordinate from the x and y keys "curve": the structure is a curve, repreesnted as a cubic spline, with the x and y coordinates of the control points in the x and y keys For convenience each of the .json files have an equivalent .txt file with a list of the contained images.</description>
<size>1274051668</size>
</item><item>
<title>Cardiac Assessment and Classification of Ultrasound (CACTUS) dataset</title>
<category>Dataset</category>
<infohash>329c0ee4a0037a2628e2f2dba826066f764f193c</infohash>
<guid>https://academictorrents.com/details/329c0ee4a0037a2628e2f2dba826066f764f193c</guid>
<link>https://academictorrents.com/details/329c0ee4a0037a2628e2f2dba826066f764f193c</link>
<description>The Cardiac Assessment and Classification of Ultrasound (CACTUS) dataset is an open-graded dataset designed for the evaluation and classification of cardiac ultrasound images. The dataset was created as part of the ARQUS project, which aims to develop an autonomous robotic system capable of performing ultrasound scans and extracting quantitative measurements. This project is funded by the NSERC (Natural Sciences and Engineering Research Council of Canada). The dataset contains ultrasound images obtained from scans of the CAE Blue Phantom, a synthetic model used to simulate the human heart. These images represent a variety of heart views and exhibit different quality levels. A detailed grading schema was developed by two medical imaging experts to assess the quality of each image, which ensures that the dataset contains a diverse range of both high- and low-quality ultrasound scans. The CACTUS dataset is particularly valuable for applications in artificial intelligence, specifically in the domain of echocardiography. It has been used in the development of automated system for the classification of cardiac ultrasound images and the assessment of image quality, which can assist medical practitioners in automating these traditionally labor-intensive tasks. 1. Title of Dataset:  CACTUS: An open dataset and framework for automated Cardiac Assessment and Classification of Ultrasound images using deep transfer learning. 2. Author Information A. Principal Investigator Contact Information Name: Hanae Elmekki Institution:  Concordia University Email: hanae.elmekki@mail.concordia.ca B. Associate or Co-investigator Contact Information Name: Amanda Spilkin Institution:  Concordia University Email: amanda.spilkin@mail.concordia.ca 3. Date of data collection (single date, range, approximate date): 2025-03-05 4. Geographic location of data collection: Concordia University, Montreal, Quebec, Canada 5. Information about funding sources that supported the collection of the data: Natural Sciences and Engineering Research Council of Canada (NSERC), Discovery Horizons Program and Individual Discovery Grant &amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;- SHARING/ACCESS INFORMATION &amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;- 1. Licenses/restrictions placed on the data:  These data are available under a CC BY 4.0 license &lt;https://creativecommons.org/licenses/by/4.0/&gt; 2. Links to publications that cite or use the data: https://doi.org/10.1016/j.compbiomed.2025.110003</description>
<size>13184921358</size>
</item><item>
<title>EchoUNAL a benchmarking study for automatic classification of six echocardiographic views</title>
<category>Dataset</category>
<infohash>50969c9b0b452e66e9648d1d31c60f8adbf3ec9b</infohash>
<guid>https://academictorrents.com/details/50969c9b0b452e66e9648d1d31c60f8adbf3ec9b</guid>
<link>https://academictorrents.com/details/50969c9b0b452e66e9648d1d31c60f8adbf3ec9b</link>
<description># EchoUNAL Dataset &amp; Classifier This repository contains the dataset and resources for the paper, "The EchoCardiography open data base EchoUNAL: a benchmarking study for automatic classification of six echocardiographic views". ##  Description The study presents a comparative evaluation of four CNN architectures for classifying six standard echocardiographic views: * **A4C**: Apical Four Chamber * **A5C**: Apical Five Chamber * **PLAX**: Parasternal Long Axis * **PSAX**: Parasternal Short Axis * **S4C**: Subcostal Four Chambers * **IVC**: Inferior Vena Cava ##  Dataset Details The database was created specifically for this study and is shared publicly to encourage further research. * **Data Source**: The dataset was obtained from **89 healthy volunteers**, with informed consent from all participants. * **Acquisition Hardware**: A **Butterfly iQ+** ultrasound device was used for image acquisition. * **Acquisition Protocol**: Images were captured by two cardiologists specializing in echocardiography. Some videos were excluded due to poor acoustic window quality. * **Total Contents**: The final dataset comprises **346 videos**, distributed as follows: * **A4C**: 61 videos * **A5C**: 49 videos * **PLAX**: 66 videos * **PSAX**: 61 videos * **S4C**: 54 videos * **IVC**: 55 videos ##  Data Collection &amp; Preprocessing * **Original Format**: Videos were stored in  .AVI  format with a  500x500  pixel resolution. The original frame rate varied between 26 and 30 fps. * **Applied Preprocessing**: 1.  The ultrasound interface was cropped to center the cardiac image area. 2.  The frame rate was standardized to **30 fps** using interpolation. 3.  A manual, frame-level relabeling was performed to ensure label accuracy. A **"no view"** label was used for frames without a clear view, though this class was later excluded from the training setup due to its high heterogeneity. 4.  All video frames were resized to **224x224** pixels, a standard input size for the pretrained CNN architectures. ##  Acknowledgments This work was funded by the project "Estimation of cardiac work as an index of cardiovascular function in echocardiographic videos" with code 60946 from the call for research projects SUE Distrito Capital of 2023. J. D. S. Avila et al., "The EchoCardiography open data base EchoUNAL: a benchmarking study for automatic classification of six echocardiographic views," 2025 21st International Symposium on Biomedical Image Processing and Analysis (SIPAIM), Pasto, Colombia, 2025, pp. 1-4, doi: 10.1109/SIPAIM67325.2025.11283340. keywords: Training;Visualization;Ultrasonic imaging;Systematics;Echocardiography;Computational modeling;Benchmark testing;Convolutional neural networks;Videos;Residual neural networks</description>
<size>4302279602</size>
</item><item>
<title>Phishing &amp; Malware Website Snapshots </title>
<category>Dataset</category>
<infohash>8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</infohash>
<guid>https://academictorrents.com/details/8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</guid>
<link>https://academictorrents.com/details/8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</link>
<description># Phishing &amp; Malware Website Snapshots 136,414 phishing and malware website snapshots captured by a headless Chromium browser between July 24 and August 15, 2024. URLs were confirmed or high-confidence phishing/malware at the time of collection, though some hosts had already been blocked or taken down when the snapshot was taken. Each row contains the full HTML source, extracted visible text, complete network traffic from HAR recording, parsed page features, and resource fingerprints. Collected in mid-2024, published March 2026. All of the phishing domains and infrastructure captured here are long dead. The value is in page content, kit patterns, and network behavior rather than live IOCs. 55,339 unique domains, 23,588 of which appear more than once. ## Use Cases - Training phishing/malware classifiers (URL-level, page-level, or multimodal) - Static rule generation for phishing kit detection - Threat intelligence on phishing infrastructure and hosting - Feature engineering for browser-based detection extensions - Academic research on web-based social engineering ## Schema Single flat Parquet table, 42 columns per row. Nested columns use Parquet  list&lt;struct&gt;  types. ### Identification &amp; Metadata | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  id  | string | Unique archive ID ( hostname_rand5 ) | |  domain  | string | Target hostname | |  url  | string | Original URL visited | |  scan_timestamp  | string | ISO timestamp of capture | |  language  | string | Declared page language | ### Page Content | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  html  | large_string | Full page HTML source | |  html_length  | int64 | HTML byte length | |  extracted_text  | large_string | Visible text via trafilatura | |  text_length  | int64 | Extracted text byte length | |  title  | string | Page  &lt;title&gt;  content | |  meta_tags  | list&lt;struct&gt; | All  &lt;meta&gt;  tags (name + content) | |  html_comments  | list&lt;string&gt; | HTML comments (useful for kit signatures) | ### Scripts &amp; Styles | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  external_script_urls  | list&lt;string&gt; | External  &lt;script src&gt;  URLs | |  inline_scripts  | list&lt;large_string&gt; | Full inline JavaScript | |  inline_scripts_count  | int32 | Number of inline  &lt;script&gt;  blocks | |  inline_scripts_total_bytes  | int64 | Total inline JS size | |  external_css_urls  | list&lt;string&gt; | External stylesheet URLs | |  inline_style_count  | int32 | Number of inline  &lt;style&gt;  blocks | ### Forms &amp; Inputs | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  form_actions  | list&lt;string&gt; |  &lt;form action&gt;  URLs | |  form_count  | int32 | Number of  &lt;form&gt;  elements | |  input_fields  | list&lt;struct&gt; | Input fields with type, name, placeholder | |  input_count  | int32 | Number of  &lt;input&gt;  elements | |  has_password_field  | bool | Page contains a password input | |  has_file_upload  | bool | Page contains a file upload input | ### Linked Resources | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  favicon_urls  | list&lt;string&gt; | Favicon link hrefs | |  favicon_hashes  | list&lt;struct&gt; | MD5/SHA256 of favicon content | |  anchor_hrefs  | list&lt;string&gt; | All  &lt;a href&gt;  targets | |  image_srcs  | list&lt;string&gt; | All  &lt;img src&gt;  URLs | |  iframe_srcs  | list&lt;string&gt; | All  &lt;iframe src&gt;  URLs | |  external_domains  | list&lt;string&gt; | Unique external domains referenced | ### Network &amp; HTTP | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  final_url  | string | URL after redirects | |  redirect_chain  | list&lt;string&gt; | Full redirect path | |  server_header  | string |  Server  response header | |  x_powered_by  | string |  X-Powered-By  response header | |  content_security_policy  | string |  Content-Security-Policy  header | |  http_status  | int32 | Final HTTP status code | |  network_requests  | list&lt;struct&gt; | All HAR entries (see below) | |  network_request_count  | int32 | Total network requests | |  resource_hashes  | list&lt;struct&gt; | MD5/SHA256 of served resources (see below) | ### Availability Flags | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  has_html  | bool | Row has HTML content | |  has_har  | bool | Row has HAR data | |  has_text  | bool | Row has extracted text | ### Nested Struct:  network_requests  Each entry:  method ,  url ,  url_domain ,  status ,  mime_type ,  response_size ,  server_ip ,  is_redirect ,  redirect_url ,  response_headers  (list of key/value structs),  request_cookies ,  response_cookies  (with name, domain, httpOnly, secure). ### Nested Struct:  resource_hashes  Each entry:  url ,  mime_type ,  resource_type  (script / style / image / font / document / other),  body_size ,  body_md5 ,  body_sha256 ,  is_favicon . ### Nested Struct:  input_fields  Each entry:  type ,  name ,  placeholder ,  id ,  class_name ,  required ,  autocomplete . ### Nested Struct:  meta_tags  Each entry:  name ,  content . ## Statistics | Metric | Value | |&amp;mdash;-|&amp;mdash;-| | Rows | 136,414 | | Unique domains | 55,339 | | Multi-snapshot domains | 23,588 | | Rows with HTML | 135,407 (99.3%) | | Rows with HAR data | 126,747 (92.9%) | | Rows with extracted text | 134,114 (98.3%) | | Rows with forms | 52,089 (38.2%) | | Rows with password fields | 33,199 (24.3%) | | Total resource hashes | 2,349,290 | | Median network requests/row | 11 | | Mean network requests/row | 20.2 | | Median HTML size | 25 KB | | Mean HTML size | 124 KB | | Total inline JS | 6.1 GB (uncompressed) | | Shards | 28 | | Total size on disk | 2.0 GB (zstd level 19) | ### Language Distribution (top 10) | Language | Rows | |&amp;mdash;-|&amp;mdash;-| | en | 50,925 | | en-US | 9,037 | | ru | 7,926 | | fr | 3,618 | | en-us | 1,759 | | zh-CN | 1,483 | | ja | 1,378 | | (empty) | 1,290 | | de | 1,226 | | fr-FR | 777 | ## Collection Method A headless Chromium browser (Playwright) running inside Docker, controlled by a Flask API, visited each URL. Per visit: 1. Navigate to URL with HAR recording active 2. Save rendered HTML ( index.html ) 3. Capture all network requests and responses ( requests.har , HAR 1.2 format) 4. Extract visible text via trafilatura ( trafilatura.txt ) 5. Compress into a  .7z  archive During conversion to Parquet, HAR response bodies were hashed (MD5 + SHA256) rather than stored. This reduced the dataset from ~76 GB of raw archives to 2 GB of compressed Parquet. Full HTML and inline JavaScript are preserved. Archives containing blocked or error pages were excluded. Filtered titles: "403 Forbidden", "Not found", "Attention Required! | Cloudflare", "Suspected phishing site | Cloudflare". Empty and corrupted archives were also removed. 31,793 archives were filtered in total. ## Safety This dataset contains snapshots of malicious websites. The HTML and scripts include: - Credential harvesting forms that mimic legitimate services - Obfuscated JavaScript for redirects, fingerprinting, or exploit delivery - References to attacker-controlled infrastructure Do not execute JavaScript or render HTML from this dataset in a browser without sandboxing. This data is for defensive security research. ## License CC0-1.0</description>
<size>2097072302</size>
</item><item>
<title>PPT Online</title>
<category>Dataset</category>
<infohash>9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</infohash>
<guid>https://academictorrents.com/details/9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</guid>
<link>https://academictorrents.com/details/9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</link>
<description>### Dataset Summary This dataset contains metadata about 1,418,349 PowerPoint (.ppt) files hosted on the ppt-online.org platform. PPT Online is a service designed to display PowerPoint presentations. The dataset includes information such as presentation titles, categories, file sizes, and content snippets. The majority of the presentations are in Russian, Ukrainian, Belarusian, Kazakh, and English, but other languages are also present. ### Languages The dataset is multilingual, with the primary languages being Russian, Ukrainian, Belarusian, Kazakh, and English. However, presentations in other languages are also included. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the presentation (integer) -  title : Title of the PowerPoint presentation (string) -  category : Category or topic of the presentation (string) -  file_size : Size of the PowerPoint file (string) -  body_content : Snippet or summary of the presentation content. Generated by a service, quite low quality (string) ### Data Splits All examples are in a single split.</description>
<size>3053250450</size>
</item><item>
<title>March 2026 Public Data File from Crossref</title>
<category>Dataset</category>
<infohash>b5ee0e102689b3e67023dd024694c0f5f124646f</infohash>
<guid>https://academictorrents.com/details/b5ee0e102689b3e67023dd024694c0f5f124646f</guid>
<link>https://academictorrents.com/details/b5ee0e102689b3e67023dd024694c0f5f124646f</link>
<description>Note that this Crossref metadata is always openly available. The difference here is that we’ve done the time-saving work of putting all of the records registered through March 2026 into one file for download. To keep this metadata current, you can access new records via our public API at: https://api.crossref.org And, if you do use our API, we encourage you to read the section of the documentation on "etiquette". That is, how to use the API without making it impossible for others to use.</description>
<size>223053729528</size>
</item><item>
<title>[Sample Dataset] March 2026 Public Data File from Crossref</title>
<category>Dataset</category>
<infohash>f692802461d1277d14551526ecb292a9637af254</infohash>
<guid>https://academictorrents.com/details/f692802461d1277d14551526ecb292a9637af254</guid>
<link>https://academictorrents.com/details/f692802461d1277d14551526ecb292a9637af254</link>
<description>[Sample Dataset] March 2026 Public Data File from Crossref. This dataset includes 100 random JSON records from the Crossref metadata corpus.</description>
<size>30841803</size>
</item><item>
<title>Reddit comments/submissions 2026-02</title>
<category>Dataset</category>
<infohash>c5ba00048236b60f819dbf010e9034d24fc291fb</infohash>
<guid>https://academictorrents.com/details/c5ba00048236b60f819dbf010e9034d24fc291fb</guid>
<link>https://academictorrents.com/details/c5ba00048236b60f819dbf010e9034d24fc291fb</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-02 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01</description>
<size>60377154786</size>
</item><item>
<title>IUGC Ultrasound Dataset (MICCAI 2025)</title>
<category>Dataset</category>
<infohash>71dee5920278325bb73eb735cade3b0f3550e9f9</infohash>
<guid>https://academictorrents.com/details/71dee5920278325bb73eb735cade3b0f3550e9f9</guid>
<link>https://academictorrents.com/details/71dee5920278325bb73eb735cade3b0f3550e9f9</link>
<description>In 2018, the World Health Organization (WHO) published 56 recommendations to improve the quality of intrapartum care and enhance women’s childbirth experiences. In response, the WHO developed the Labour Care Guide (LCG) in 2020, a next-generation tool designed to promote evidence-based, respectful, and woman-centered care during labor and delivery. The LCG was created through expert consultations, primary research with maternity healthcare providers, and usability studies across multiple countries. It serves as a practical tool for monitoring labor progress and maternal and fetal well-being by recording key clinical parameters. When deviations from normal labor progression are detected, the LCG highlights these issues, prompting timely interventions to ensure safe and effective care. Intrapartum ultrasound for labor progression analysis is a crucial examination in labor management. The core operation in this analysis is the identification of landmarks from intrapartum ultrasound images. These landmarks serve as the basis for subsequent qualitative evaluations of angles and distances, which offer valuable diagnostic information regarding labor arrest and influence decisions about the timing and type of intervention. However, obtaining reliable landmark annotations typically demands experienced physicians, and even for proficient obstetricians, manual landmark identification is a time-consuming and labor-intensive endeavor. Consequently, the development of fully automatic and precise landmark localization techniques has been an area of significant and persistent need. The Intrapartum Ultrasound Grand Challenge (IUGC) 2025 is a collaborative initiative involving the "Deep Learning in Intrapartum Ultrasound Image Analysis" cooperative group and prominent clinical societies such as the International Society of Ultrasound in Obstetrics &amp; Gynecology (ISUOG), the World Association of Perinatal Medicine (WAPM), the Perinatal Medicine Foundation (PMF), and the National Institute for Health and Care-Excellence (NICE). The objective of this partnership is to formulate and promote clinically relevant challenges, thereby maximizing the potential clinical impact of innovative algorithmic contributions from participating teams. Since its inception at MICCAI 2023, the IUGC has advanced the Pubic Symphysis - Fetal Head Segmentation (PSFHS) by facilitating and benchmarking algorithmic progress and providing high-quality annotated image datasets. In MICCAI 2024, the IUGC expanded to incorporate multiple benchmarks: (1) The analysis objects were extended from images to videos; (2) The tasks were augmented from image segmentation to classification, segmentation, and biometric parameter measurement; (3) The quantitative parameters were increased from one (i.e., Angle of Progression (AOP)) to two (i.e., AOP and head - symphysis distance (HSD)); and (4) The data sources were broadened from being solely from Asia to include Asia, Europe, and Africa. This novel and inventive design has established a benchmarking ecosystem for the systematic comparison of algorithms across diverse tasks and clinical challenges. The significance of the IUGC 2025 lies in its concentration on addressing the actual clinical assessment of labor progress, covering (1) end-to-end measurements (which are currently indirect measurements based on segmentation results); (2) all fetal descent stations during the childbirth process (comprising five “minus”, one “zero", and three “plus” stations for reliable longitudinal assessment of labor progress); (3) computational tasks (such as regression, detection); and (4) learning methods (semi-supervised, weakly-supervised, and barely-supervised learning). In line with the IUGC s goal of addressing clinical requirements, authoritative and leading clinical organizations have allied with the IUGC. We have extended the IUGC 2024 Challenge from an indirect ultrasound measurement based on segmentation results to an end-to-end measurement based on landmarks. Specifically, we provide 300 labeled cases and 31,421 unlabeled cases in the training set, 100 visible cases for validation, and 501 hidden cases for testing. The targets are the coordinates of three landmarks and the corresponding biometric parameter. In addition to the typical Mean Radial Error (MRE) and the absolute difference between predicted and manually measured parameters, our evaluation metrics also emphasize inference speed. In summary, the IUGC 2025 challenge exhibits three primary characteristics: (1) Task: Employing semi-supervised landmark detection. (2) Dataset: Curating a large-scale and diverse fetal ultrasound dataset that accounts for all fetal descent stations during the childbirth process. It comprises 28,919 ultrasound images from over 20 medical groups. (3) Evaluation measures: Focusing on detection accuracy. (4) Multiple raters independently annotate a subset of test cases to compare algorithmic performance against human expert inter-rater variability.</description>
<size>1204118461</size>
</item><item>
<title>Wikipedia European languages 2026-03-01</title>
<category>Dataset</category>
<infohash>357aed6775e72b4bac4688497590a262d87d2e2a</infohash>
<guid>https://academictorrents.com/details/357aed6775e72b4bac4688497590a262d87d2e2a</guid>
<link>https://academictorrents.com/details/357aed6775e72b4bac4688497590a262d87d2e2a</link>
<description>Wikipedia database dumps of European language wikis of 10k articles or more. enwiki excluded. Wikipedia Multistream 2026-03-01. These 68 languages are included: Albanian, Alemannic, Aragonese, Asturian, Basque, Bavarian, Belarusian, Benetian, Bosnian, Breton, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Emilian-Romagnol, Esperanto, Estonian, Faroese, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Ladin, Latin, Latvian, Ligurian, Limburgish, Lithuanian, Lombard, Low German, Luxembourgish, Macedonian, Maltese, Neapolitan, North Frisian, Norwegian, Nynorsk, Occitan, Piedmontese, Polish, Portuguese, Romanian, Romansh, Rusyn, Samogitian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Silesian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Upper Sorbian, Walloon, Welsh, West Frisian, Yiddish.</description>
<size>52792595523</size>
</item><item>
<title>Public MediaWiki Collection</title>
<category>Dataset</category>
<infohash>3ec30d9d8817f62d338ae76783d24ba207b6e9de</infohash>
<guid>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</guid>
<link>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</link>
<description># Dataset Card for Public MediaWiki Collection ### Dataset Summary This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages. ### Languages The dataset is multilingual, covering 35+ languages found across the collected wiki instances. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances. To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/</description>
<size>1167900670</size>
</item><item>
<title>9111.ru Questions Dataset</title>
<category>Dataset</category>
<infohash>3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</infohash>
<guid>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</guid>
<link>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</link>
<description># Dataset Card for 9111.ru Questions ### Dataset Summary This dataset includes legal questions and answers from the Russian law forum [9111.ru](https://9111.ru). It contains inquiries from users and corresponding responses from lawyers. The dataset was created by processing around 21 million questions, providing a significant corpus of legal discussions. ### Languages The dataset is mostly in Russian, but there may be other languages present. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Identifier for the item (integer) -  title : Title of the question (string) -  description : Description of the question (string) -  answers : An array of answer objects (array) -  user_name : Name of the user who answered (string) -  status : Status of the user (string) -  rating : Rating of the user (integer) -  text : Text of the answer (string) ### Data Format The dataset is stored as Apache Parquet files with zstd compression (level 19), split into 3 shards: -  questions-00000-of-00003.parquet  -  questions-00001-of-00003.parquet  -  questions-00002-of-00003.parquet  ### Data Splits All examples are in the train split, there is no validation split.</description>
<size>2938274461</size>
</item><item>
<title>Fandom.com Community Database Dumps Dataset</title>
<category>Dataset</category>
<infohash>0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</infohash>
<guid>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</guid>
<link>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</link>
<description># Dataset Card for Fandom.com Community Database Dumps ### Dataset Summary This dataset contains 7,040,984 current pages from all available [Fandom.com community wiki dumps](https://community.fandom.com/wiki/Help:Database_download) as of February 18, 2025. The dataset was created by processing the "Current pages" database dumps from all available Fandom.com wikis. These dumps contain only the current versions of pages without edit history and includes article text, metadata, and structural information across multiple languages. ### Languages The dataset is multilingual, covering [40+ languages](https://community.fandom.com/wiki/Help:Language_codes). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset inherits the licenses from the source Fandom communities, which use Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0). To learn more about CC-BY-SA 3.0, visit: https://creativecommons.org/licenses/by-sa/3.0/</description>
<size>6224651822</size>
</item><item>
<title>ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles</title>
<category>Dataset</category>
<infohash>adb5741cdcb9352848cc80c976629b44720a04c2</infohash>
<guid>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</guid>
<link>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</link>
<description>High‑resolution aerial imagery from Ontario’s DRAPE 2024, packaged for fast web maps and offline use. Smooth panning, crisp detail, open data. Want to preview the file? Go to https://source.coop/dataforcanada/d4c-datapkg-orthoimagery/processed/ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles We are aware that there is a nodata issue with the product and will fix it in the next release.</description>
<size>215750229448</size>
</item><item>
<title>Street-Level Imagery Dataset</title>
<category>Dataset</category>
<infohash>207ba45161f6ba12114cb6d97ad25d222d5125c9</infohash>
<guid>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</guid>
<link>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</link>
<description># Street-Level Imagery Dataset Metadata for street-level imagery across Eastern Europe and Northern Asia. Each record includes image URLs, coordinates, camera orientation, timestamps, and links to similar images captured at the same location over time. ## Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total Records | 934,191 | | Unique Images | 905,940 | | Time Span | 2016–2025 | | File Format | Parquet | ## Geographic Coverage | Boundary | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Minimum Longitude | 20.49° E | | Maximum Longitude | 152.32° E | | Minimum Latitude | 38.55° N | | Maximum Latitude | 69.05° N | Coverage spans urban centers and rural routes. Density is higher in populated areas. ## Camera Specifications **Directions** | Direction | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Front | 740,079 | | Right | 194,112 | **Resolutions** | Preview Size | Full Size | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | 284×160 | 1920×1080 | 932,171 | | 90×160 | 1080×1920 | 1,886 | | 284×160 | 1536×864 | 77 | | 213×160 | 2016×1512 | 41 | ## Data Structure ### Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  | string | Unique image identifier | |  sourceId  | string | Source device or session identifier | |  heading  | float64 | Camera heading (0–360°) | |  cameraDirection  | string | Mount position ( front  or  right ) | |  timestamp  | string | ISO 8601 capture time | |  imagePreview  | struct | Thumbnail URL and dimensions | |  imageFull  | struct | Full resolution URL and dimensions | |  pos  | array | [longitude, latitude] | |  geometry  | struct | GeoJSON Point geometry | |  similar  | array | Related images at location | |  targetGeometry  | struct | Optional target reference | ### Image URL Schema Two resolution variants per entry:    json  "imagePreview":  "url": "https://...", "width": 284, "height": 160 , "imageFull":  "url": "https://...", "width": 1920, "height": 1080       ### Temporal Links Records reference similar images from other timestamps at the same coordinates. Average of 14.3 links per location. ## Limitations Image URLs may [rot](https://en.wikipedia.org/wiki/Link_rot). Coverage concentrates in urban areas, and historical density varies by location. ## License Research use permitted. Comply with source terms of service and local data regulations.</description>
<size>651707506</size>
</item><item>
<title>Subreddit comments/submissions 2005-06 to 2025-12</title>
<category>Dataset</category>
<infohash>3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</infohash>
<guid>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</guid>
<link>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</link>
<description>This is the top 40,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2025-12 which can be found here https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44 These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps If you have questions, please reply to this reddit post or DM u/Watchful on reddit or respond to this post https://www.reddit.com/r/pushshift/comments/1r5z42j/comment/o5mjcvn/</description>
<size>3965777405390</size>
</item><item>
<title>Lung Ultrasound Dataset (LUS-Dataset-Katumba)</title>
<category>Dataset</category>
<infohash>e6e9a5594174aaffee53b8f086e3bf86c02c45ad</infohash>
<guid>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</guid>
<link>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</link>
<description>This dataset contains a curated benchmark collection of 1,062 labelled lung ultrasound (LUS) images collected from patients at Mulago National Referral Hospital and Kiruddu Referral Hospital in Kampala, Uganda. The images were acquired and annotated by senior radiologists to support the development and evaluation of artificial intelligence (AI) models for pulmonary disease diagnosis. Each image is categorized into one of three classes: Probably COVID-19 (COVID-19), Diseased Lung but Probably Not COVID-19 (Other Lung Disease), and Healthy Lung. The dataset addresses key challenges in LUS interpretation, including inter-operator variability, low signal-to-noise ratios, and reliance on expert sonographers. It is particularly suitable for training and testing convolutional neural network (CNN)-based models for medical image classification tasks in low-resource settings. The images are provided in standard formats such as PNG or JPEG, with corresponding labels stored in structured files like CSV or JSON to facilitate ease of use in machine learning workflows. In this second version of the dataset, we have extended the resource by including a folder containing the original unprocessed raw data, as well as the scripts used to process, clean, and sort the data into the final labelled set. These additions promote transparency and reproducibility, allowing researchers to understand the full data pipeline and adapt it for their own applications. This resource is intended to advance research in deep learning for lung ultrasound analysis and to contribute toward building more accessible and reliable diagnostic tools in global health.     Katumba, Andrew; Murindanyi, Sudi; Okila, Nixson; Nakatumba-Nabende, Joyce; Mwikirize, Cosmas; Serugunda, Jonathan; Bugeza, Samuel; Oriekot, Anthony; Bosa, Juliet; Nabawanuka, Eva (2025), “A Dataset of Lung Ultrasound Images for Automated AI-based Lung Disease Classification”, Mendeley Data, V2, doi: 10.17632/hb3p34ytvx.2</description>
<size>281447804</size>
</item><item>
<title>Wikipedia Asian languages 2026-02-01</title>
<category>Dataset</category>
<infohash>1bde58b51e4aad60f03ce1b688b691552fb3041e</infohash>
<guid>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</guid>
<link>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</link>
<description>Wikipedia database dumps of Asian language wikis of 10k articles or more. Wikipedia Multistream 2026-02-01. These 85 languages are included: Acehnese, Armenian, Assamese, Azerbaijani, Balinese, Bangla, Banjar, Banyumasan, Bashkir, Bishnupriya, Buginese, Burmese, Cantonese, Cebuno, Central Bikol, Central Kurdhish, Chechen, Chinese, Chuvash, Classical Chinese, Dimli, Eastern Mari, Georgian, Gilaki, Gorontalo, Gujarati, Hakka, Hebrew, Hindi, Iloko, Indonesian, Japanese, Javanese, Kannada, Kara-Kalpak, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Maithili, Malay, Malayalam, Manipuri, Marathi, Mazanderani, Minangkabau, Mindong, Mingrelian, Minnan, Mongolian, Nepali, Newari, Odia, Ossetic, Pampangan, Pashto, Persian, Punjabi, Russian, Sanskrit, Santali, Saraiki, Shan, Sindhi, Sinhala, South Azerbaijani, Sundanese, Tagalog, Tajik, Talysh, Tamil, Tatar, Telugu, Thai, Turkish, Urdu, Uzbek, Vietnamese, Waray, Western Armenian, Western Mari, Western Punjabi, Wu, Yakut.</description>
<size>31208780863</size>
</item><item>
<title>Reddit comments/submissions 2026-01</title>
<category>Dataset</category>
<infohash>8412b89151101d88c915334c45d9c223169a1a60</infohash>
<guid>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</guid>
<link>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-01 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01</description>
<size>61629104259</size>
</item><item>
<title>Begemot.ai Dataset</title>
<category>Dataset</category>
<infohash>3ada9903be4621cf7e34cd5cf44f191b4124ccfe</infohash>
<guid>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</guid>
<link>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</link>
<description># Dataset Card for Begemot.ai ### Dataset Summary This dataset has 2,728,999 educational project descriptions in Russian. They were generated using AI on the Begemot.ai website. The content includes project titles, descriptions, chapters and chapter content on various educational topics. ### Languages The dataset is primarily in Russian (ru). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the project (integer) -  url : URL of the project page (string) -  title : Title of the educational project (string) -  type : Type of project (string) -  description : Detailed description of the project (string) -  chapters : List of chapter titles (list of strings) -  chapter_content : JSON string mapping chapter titles to their content ### Data Splits All examples are in a single split.</description>
<size>1708074039</size>
</item><item>
<title>OpenPOCUS - Lung Ultrasound Image Database</title>
<category>Dataset</category>
<infohash>63ad0470f43e022cc73407be9c760449d947cb97</infohash>
<guid>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</guid>
<link>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</link>
<description>https://i.imgur.com/s0eFv64.png Background Lung ultrasound (LUS) offers advantages over traditional imaging for diagnosing pulmonary conditions, with superior accuracy compared to chest X-ray and similar performance to CT at lower cost. Despite these benefits, widespread adoption is limited by operator dependency, moderate interrater reliability, and training requirements. Deep learning (DL) could potentially address these challenges, but development of effective algorithms is hindered by the scarcity of comprehensive image repositories with proper metadata.</description>
<size>5256546243</size>
</item><item>
<title>Russian Educational Text Collection</title>
<category>Dataset</category>
<infohash>1f6b373346a0fa34de6b4d916984d698e0a623b3</infohash>
<guid>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</guid>
<link>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</link>
<description># Dataset Card for Russian Educational Text Collection ### Dataset Summary This dataset contains approximately 1.38M educational texts primarily in Russian with some content in Ukrainian and English. The content is extracted from presentations and documents, including educational presentations, essays, and various academic documents covering diverse topics from natural sciences to literature. ### Languages - Russian (ru) - primary language - Ukrainian (uk) - secondary language - English (en) - secondary language With Russian being the predominant language in the dataset, while Ukrainian and English content appears less frequently. ## Dataset Structure ### Data Fields The dataset is split into two parquet files: - presentations (1,335,171 entries): -  title : Title of the presentation (string) -  slide_text : Array of slide contents (list of strings) - documents (47,474 entries): -  title : Title of the document (string) -  document_text : Full text content of the document (string) ## Additional Information ### License This dataset is dedicated to the public domain under the Creative Commons Zero (CC0) license. This means you can: * Use it for any purpose, including commercial projects * Modify it however you like * Distribute it without asking permission No attribution is required, but it s always appreciated!</description>
<size>304218686</size>
</item><item>
<title>Animations Dataset</title>
<category>Dataset</category>
<infohash>8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</infohash>
<guid>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</guid>
<link>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</link>
<description># Dataset Card for Animations Dataset ### Dataset Summary This dataset contains 50,849 animations with their associated metadata and source images. Each animation consists of multiple frames composed of simple sketch-level drawings, text elements, and potentially embedded images. The dataset provides complete information about each animation, including frame components, source images, timing between frames, and canvas settings. This makes it suitable for various tasks such as animation analysis, generation, and modification. ### Languages The dataset is primarily monolingual: - English (en): Any text elements within animations are predominantly in English. ## Dataset Structure ### Data Files The dataset is stored as Parquet files with ZSTD compression: -  train-00000.parquet  through  train-00003.parquet  - Total: 4 shards, ~4.2 GB compressed ### Data Fields Each row in the Parquet files contains the following columns: | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  |  string  | Unique identifier (UUID) for the animation | |  settings  |  string  | JSON object containing canvas configuration | |  dtimes  |  list[int64]  | Time delays between frames in milliseconds | |  frames_data  |  string  | JSON array describing each frame s elements | |  images  |  list[binary]  | PNG images used in the animation (decoded bytes) | #### Settings Object The  settings  JSON contains: -  canvas_width ,  canvas_height : Dimensions of the animation canvas -  fillcolor : Background color of the canvas (if specified) -  default_font : Default font used for text elements -  default_font_size : Default font size #### Frames Data Structure The  frames_data  JSON is an array of arrays, where each inner array represents a frame s elements: -  type_for_loader : Element type (e.g., "text", "image") -  data : Object containing element properties: -  type : Element type -  centerx ,  centery : Position coordinates on the canvas -  text : Text content (for text elements) -  font ,  size : Font properties -  rotate_angle ,  angle : Rotation properties -  strokeColor ,  fillColor ,  textColor : Color properties -  src : Index into the  images  array (for image elements) -  children_data : Array of child elements (if any) ### Data Splits | Split | Number of Examples | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| |  train  | 50,849 |</description>
<size>4374676620</size>
</item></channel>
</rss>
