Academic Torrents

Places in the Wild: Ecologically-sampled RAW photographs

Places in the Wild comprises over 67,000 RAW-format images, each captured with a 45-megapixel Canon EOS R5 full-frame mirrorless camera at 5-degree intervals, providing 360-degree coverage across over 800 unique locations. These locations span 260 basic-level scene categories, including both indoor and outdoor environments such as bedrooms, train stations, forests, and parking garages.

Edus2 Ultrasounds

Ultrasound Videos Database: Collection of 32 medical ultrasound video files for simulations, case discussions, and training. Includes cardiac normal, tamponade, FAST exams (RUQ free fluid), AAA, and Edus2 open-source set. Free for non-commercial educational use. This license applies to all video in this directory. Copyright 2011,2012 Paul Kulyk and Paul Olszynski All videos made available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. http://creativecommons.org/licenses/by-nc-sa/3.0/

Reddit comments/submissions 2026-04

Reddit comments and submisReddit comments and submissions from 2026-04 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions

Wikipedia Asian languages 2026-05-01

Wikipedia database dumps of Asian language wikis of 10k articles or more. Wikipedia Multistream 2026-05-01. These 85 languages are included: Acehnese, Armenian, Assamese, Azerbaijani, Balinese, Bangla, Banjar, Banyumasan, Bashkir, Bishnupriya, Buginese, Burmese, Cantonese, Cebuno, Central Bikol, Central Kurdhish, Chechen, Chinese, Chuvash, Classical Chinese, Dimli, Eastern Mari, Georgian, Gilaki, Gorontalo, Gujarati, Hakka, Hebrew, Hindi, Iloko, Indonesian, Japanese, Javanese, Kannada, Kara-Kalpak, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Maithili, Malay, Malayalam, Manipuri, Marathi, Mazanderani, Minangkabau, Mindong, Mingrelian, Minnan, Mongolian, Nepali, Newari, Odia, Ossetic, Pampangan, Pashto, Persian, Punjabi, Russian, Sanskrit, Santali, Saraiki, Shan, Sindhi, Sinhala, South Azerbaijani, Sundanese, Tagalog, Tajik, Talysh, Tamil, Tatar, Telugu, Thai, Turkish, Urdu, Uzbek, Vietnamese, Waray, Western Armenian, Western Mari, Western Punjabi, Wu, Yakut.

Stack Exchange Data Dump (2026-03-31)

This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2026-03-31. The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or https://stackoverflow.com/help/licensing. For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at https://archive.org/details/stackexchange_20260331

IUGC: A benchmark of landmark detection in end-to-end intrapartum ultrasound biometry

In 2018, the World Health Organization (WHO) published 56 recommendations to improve the quality of intrapartum care and enhance women’s childbirth experiences. In response, the WHO developed the Labour Care Guide (LCG) in 2020, a next-generation tool designed to promote evidence-based, respectful, and woman-centered care during labor and delivery. The LCG was created through expert consultations, primary research with maternity healthcare providers, and usability studies across multiple countries. It serves as a practical tool for monitoring labor progress and maternal and fetal well-being by recording key clinical parameters. When deviations from normal labor progression are detected, the LCG highlights these issues, prompting timely interventions to ensure safe and effective care. Intrapartum ultrasound for labor progression analysis is a crucial examination in labor management. The core operation in this analysis is the identification of landmarks from intrapartum ultrasound images. These landmarks serve as the basis for subsequent qualitative evaluations of angles and distances, which offer valuable diagnostic information regarding labor arrest and influence decisions about the timing and type of intervention. However, obtaining reliable landmark annotations typically demands experienced physicians, and even for proficient obstetricians, manual landmark identification is a time-consuming and labor-intensive endeavor. Consequently, the development of fully automatic and precise landmark localization techniques has been an area of significant and persistent need. The Intrapartum Ultrasound Grand Challenge (IUGC) 2025 is a collaborative initiative involving the "Deep Learning in Intrapartum Ultrasound Image Analysis" cooperative group and prominent clinical societies such as the International Society of Ultrasound in Obstetrics & Gynecology (ISUOG), the World Association of Perinatal Medicine (WAPM), the Perinatal Medicine Foundation (PMF), and the National Institute for Health and Care-Excellence (NICE). The objective of this partnership is to formulate and promote clinically relevant challenges, thereby maximizing the potential clinical impact of innovative algorithmic contributions from participating teams. Since its inception at MICCAI 2023, the IUGC has advanced the Pubic Symphysis - Fetal Head Segmentation (PSFHS) by facilitating and benchmarking algorithmic progress and providing high-quality annotated image datasets. In MICCAI 2024, the IUGC expanded to incorporate multiple benchmarks: (1) The analysis objects were extended from images to videos; (2) The tasks were augmented from image segmentation to classification, segmentation, and biometric parameter measurement; (3) The quantitative parameters were increased from one (i.e., Angle of Progression (AOP)) to two (i.e., AOP and head - symphysis distance (HSD)); and (4) The data sources were broadened from being solely from Asia to include Asia, Europe, and Africa. This novel and inventive design has established a benchmarking ecosystem for the systematic comparison of algorithms across diverse tasks and clinical challenges. The significance of the IUGC 2025 lies in its concentration on addressing the actual clinical assessment of labor progress, covering (1) end-to-end measurements (which are currently indirect measurements based on segmentation results); (2) all fetal descent stations during the childbirth process (comprising five “minus”, one “zero", and three “plus” stations for reliable longitudinal assessment of labor progress); (3) computational tasks (such as regression, detection); and (4) learning methods (semi-supervised, weakly-supervised, and barely-supervised learning). In line with the IUGC s goal of addressing clinical requirements, authoritative and leading clinical organizations have allied with the IUGC. We have extended the IUGC 2024 Challenge from an indirect ultrasound measurement based on segmentation results to an end-to-end measurement based on landmarks. Specifically, we provide 300 labeled cases and 31,421 unlabeled cases in the training set, 100 visible cases for validation, and 501 hidden cases for testing. The targets are the coordinates of three landmarks and the corresponding biometric parameter. In addition to the typical Mean Radial Error (MRE) and the absolute difference between predicted and manually measured parameters, our evaluation metrics also emphasize inference speed. In summary, the IUGC 2025 challenge exhibits three primary characteristics: (1) Task: Employing semi-supervised landmark detection. (2) Dataset: Curating a large-scale and diverse fetal ultrasound dataset that accounts for all fetal descent stations during the childbirth process. It comprises 28,919 ultrasound images from over 20 medical groups. (3) Evaluation measures: Focusing on detection accuracy. (4) Multiple raters independently annotate a subset of test cases to compare algorithmic performance against human expert inter-rater variability. https://pubmed.ncbi.nlm.nih.gov/41604894/ Bai J, Tang Y, Liu X, Hu J, Li Y, Chen X, Wang Y, Ma C, Li Y, Guo B, Jiao J, Huang Y, Wang K, Li L, Ma Y, Han X, Shao H, Yang Z, Liu Q, Hu Y, Kuang J, Song S, Krishna A, Khan ZA, Li Z, Zhang Z, Zhang H, Cheng Y, Zhang X, Chen X, Yan H, Tong L, Du B, Deng B, Chen Y, Peng Z, Rezaei S, Gan J, Cai W, Wang F, Curran KM, Silvestre G, Khobo I, Lu Y, Ni D, Huang Y, Yaqub M, Ma J, Lekadir K, Li S. IUGC: A benchmark of landmark detection in end-to-end intrapartum ultrasound biometry. Med Image Anal. 2026 May;110:103960. doi: 10.1016/j.media.2026.103960. Epub 2026 Jan 23. PMID: 41604894.

Annotated Ultrasound Liver images

We public the ultrasound liver images, which were annotated to show the outlines, livers, and liver mass regions. Xu Yiming, Zheng Bowen, Liu Xiaohong, Wu Tao, Ju Jinxiu, Wang Shijie, Lian Yufan, Zhang Hongjun, Liang Tong, Sang Ye, Jiang Rui, Wang Guangyu, Ren Jie, & Chen Ting. (2022). Annotated Ultrasound Liver images [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7272660

TRUSTED: The Paired 3D Ultrasound and CT Human Data for Kidney Segmentation and Registration Research

We propose TRUSTED (the Tridimensional Renal Ultra Sound Tomod Ensitometrie Dataset), comprising paired transabdominal 3DUS and CT kidney images from 48 human patients (96 kidneys), including segmentation, and anatomical landmark annotations by two experienced radiographers. Abstract Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data have many important clinical applications, including image-guided surgery, automatic organ measurement, and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired transabdominal 3DUS and CT kidney images from 48 human patients (96 kidneys), including segmentation, and anatomical landmark annotations by two experienced radiographers. Inter-rater segmentation agreement was over 93% (Dice score), and gold-standard segmentations were generated using the STAPLE algorithm. Seven anatomical landmarks were annotated, for IMIR systems development and evaluation. To validate the dataset’s utility, 4 competitive Deep-Learning models for kidney segmentation were benchmarked, yielding average DICE scores from 79.63% to 90.09% for CT, and 70.51% to 80.70% for US images. Four IMIR methods were benchmarked, and Coherent Point Drift performed best with an average Target Registration Error of 4.47 mm and Dice score of 84.10%. The TRUSTED dataset may be used freely to develop and validate segmentation and IMIR methods. Ndzimbong, W., Fourniol, C., Themyr, L. et al. TRUSTED: The Paired 3D Transabdominal Ultrasound and CT Human Data for Kidney Segmentation and Registration Research. Sci Data 12, 615 (2025). https://doi.org/10.1038/s41597-025-04467-1

Abdominal Ultrasound Image Dataset for Organ Classification and Disease Detection

This is a dataset of Ultrasound (US) images of abdominal organs. US imaging is widely accessible and a very common diagnostic tool, as it is non-invasive and does not involve radiation risk. This dataset was curated solely for research in deep learning, with potential applications in supervised, semi-supervised, and unsupervised learning to support disease detection in resource-constrained settings. The dataset comprises 5,005 images of different abdominal organs, namely: Abdominal Aorta (0), Gallbladder (1), Hepatic Vein (2), Kidneys (3), Liver (4), Ovaries (5), Pancreas (6), Portal Vein (7), Spleen (8), and the Urinary System (9), which includes the Urinary Bladder, Prostate, and Uterus. Images were collected from 563 patients at MH Samorita Medical College and Hospital and in Dhaka, Bangladesh. The author tried her best to curate this dataset systematically and organize uniquely. Every folder and subfolder has correctly numbered, serially ordered images, making it the first dataset one of its kind in terms of structure and usability. This ensures reproducibility and reliability for researchers worldwide In total, the dataset is organized into five distinct formats/folders, described below. Two radiologists were examining the patients. ## Radiologist one: - organ_classification_1: Contains 2,784 images of the 10 organs listed above. Designed for classification tasks. - anomaly_detection_1: Contains two sub-folders: normal (2,014 images) and abnormal (799 images). Designed for anomaly detection tasks. - organ_classification+anomaly_detection: Contains 2 sub-folders. One represents the normal organs (1,948 images) and one represents abnormal cases (981 images) including a newly added ascites folder. Ascites is a condition where the abdominal cavity becomes overly filled with fluid, and it was separated as a distinct abnormal class. This hybrid dataset (organ_classification+anomaly_detection) is an experimental extension combining both tasks. While curated carefully, users are advised to double-check labels for their specific tasks. ## Radiologist two: - anomaly_detection_2: Contains two sub-folders: normal (656 images) and abnormal (269 images). This batch was collected last and was used for semi-supervised anomaly detection tasks. - organ_classification_2: Contains 10 sub-folders representing the 10 different organs, with a total of 1,293 images. - Patient_Wise: This folder contains 170 patient images, their diagnosis as metadata in an xlsx file and a text file, Update Version 01_USG.txt. ## Acknowledgements: This dataset was developed as part of the author s research under the supervision of Dr. John E. Ball, Mississippi State University. The author is grateful for his guidance, encouragement, and support throughout the course of this work. The author would like to thank logistical support of her father, Dr. Md. Enayet Karim for his support in coordinating with MH Samorita Medical College and Hospital to obtain the ultrasound images and metadata for this dataset. The author gratefully acknowledges the contributions of the radiologists, sonographers, and staff at MH Samorita Medical College and Hospital, for their assistance in conducting the ultrasound examinations and providing access to the imaging data. Their efforts in patient care and technical support were invaluable in making this dataset possible. Ethics and Data Access Permissions: This dataset was collected with formal approval from the Institutional Ethical Review Board (IERB) of MH Samorita Medical College and Hospital, Dhaka, Bangladesh. The ethical clearance was obtained before data collection and written institutional permission was granted before sharing. All images were anonymized and de-identified before inclusion in this dataset, ensuring patient privacy and compliance with ethical research standards. ## License: This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Users are required to cite this dataset when using it in publications, research, or derivative works. This dataset is openly available for research and academic purposes, supporting reproducibility and transparency in medical AI research. https://orcid.org/0009-0005-8413-4149

Reddit comments/submissions 2026-03

Reddit comments and submisReddit comments and submissions from 2026-03 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions

UIdataGB Gallblader Diseases Dataset

The dataset is composed of ultrasound images of the GB organ from inside the gastrointestinal tract. The dataset includes 9 classes according to anatomical landmarks. Each class represents a GB disease. Published: 23 January 2024 | Version 1 | DOI: 10.17632/r6h24d2d3y.1 Turki, Amina; Mahdi Obaid, Ahmed; Bellaaj, Hatem; Ksantini, Mohamed; Altaee, Abdulla (2024), “Gallblader Diseases Dataset ”, Mendeley Data, V1, doi: 10.17632/r6h24d2d3y.1 "The UIdataGB dataset consists of 10692 images, annotated, and verified by medical doctorsand experienced radiologists. It includes 9 classes according to anatomical landmarks. Each classcontains nearly 1200 images. Therefore, the dataset is balanced in terms of diseases. In total,1782 patients were involved in the data collection; the number of female images was 6246,with an average age of 63.4, while the number of male images was 4446, with an average ageof 59.6.The number of images is sufficient to be used for different tasks, e.g., image retrieval, ML, DL,and transfer learning (TL), etc. The anatomical landmark of the GB determines the pathologicalfinding like cholecystitis, stone of the GB and polyps.The dataset consists of images with a resolution of 90 0×120 0 pixels and they are sorted intoseparate nine folders named according to the content. Tables 1 and 2 show the distribution ofdiseases in terms of images and patients’ numbers as well as the distribution of images accord-ing to gender." https://www.sciencedirect.com/science/article/pii/S2352340924003950

enwiki-20260401-pages-articles-multistream-index.txt.bz2

English Wikipedia Multistream Index 2026-04-01 https://en.wikipedia.org/wiki/Wikipedia:Database_download Corresponding multistream file: https://academictorrents.com/details/2b2ffc80941b61dcfe55fc444a6a78a60eef5944

enwiki-20260401-pages-articles-multistream.xml.bz2

English Wikipedia Multistream 2026-04-01 https://en.wikipedia.org/wiki/Wikipedia:Database_download Corresponding index file: https://academictorrents.com/details/8ed9dcc05b0fdecb47f998186e9e5c30f8212cfc

EchoCP_dataset

A dataset of contrast transthoracic echocardiography, EchoCP, for patent foramen ovale diagnosis is published. We present EchoCP, the first dataset for cTTE based PFO diagnosis. EchoCP contains both VM and rest echocardiography videos captured from 30 patients. Data annotation including diagnosis annotation and segmentation annotation are performed by four experienced cardiovascular sonographers. As there are more than a thousand images in each patient s video, sparse labeling (only select representative frames) of the segmentation is adopted. EchoCP contains cTTE videos from 30 patients. For each patient, two videos corresponding to the rest and VM state of the patients are captured. Note that in the rest state, patients just relax and breathe normally. While in the VM, patients need to close their mouths, pinch their noses shut while expelling air out as if blowing up a balloon. The video is captured in the apical-4-chamber view and contains at least ten cardiac cycles. For the VM state, the action is performed three to five times during acquisition, and we selected the most representative one. If you used our dataset, please consider to cite our paper in MICCAI 2021, Tianchen Wang, Zhihe Li, Shanshan Bi, Meiping Huang, Jiawei Zhang, Jian Zhuang, Yiyu Shi, Hongwen Fei, Xiaowei Xu, "ImageCHD: A 3D Computed Tomography Image Dataset for Classification of Congenital Heart Disease," in Proc. of Medical Image Computing and Computer Assisted Interventions (MICCAI), Online, 2021. https://arxiv.org/abs/2101.10799 HIGHLIGHT 20231101: We have deployed the dataset on Kaggle! Please send emails to me xiao.wei.xu@foxmail.com if you have any questions about the dataset and the benchmark.

cardiacUDC_dataset

We collect CardiacUDA from our two hospitals: site G and site R. In order to guarantee all echocardiogram videos are standardscompliant, all cases of CardiacUDA are collected, annotated and approved by 5-6 experienced physicians. For ethical issues, we have required approval from medical institutions. Each patient underwent four views during scanning, which included parasternal left ventricle long axis (LVLA), pulmonary artery long axis (PALA), left ventricular short-axis (LVSA), and apical fourchamber heart (A4C), resulting in four videos per patient. The resolution of each video was either 800x600 or 1024x768, depending on the scanner used (Philips or HITACHI). A total of 516 and 476 videos were collected from Site G and Site R, respectively, from approximately 100 different patients. Each video consists of over 100 frames, covering at least one heartbeat cycle. We have provided pixel-level annotations for each view, including masks for the left ventricle (LV) and right ventricle (RV) in the LVLA view, masks for the pulmonary artery (PA) in the PALA view, masks for the LV and RV in the LVSA view, and masks for the LV, RV, left atrium (LA), and right atrium (RA) in the A4C view. The videos in both Site R and Site G were divided into a ratio of 8:1:1 for training, validation, and testing, respectively. To lower annotation costs in the training set, only five frames per video are provided with pixellevel annotation masks. To better measure the model performance, we provide pixel-level annotations for every frame in each video in the validation and testing sets. HIGHLIGHT 20231101: We have deployed the dataset on Kaggle! Please refer to the code (https://github.com/xmed-lab/GraphEcho) and our ICCV paper (https://arxiv.org/abs/2309.11145) for more detailes. Please send emails to me xiao.wei.xu@foxmail.com if you have any questions about the dataset and the benchmark.

ImperialAIEchocardiographyDataset_2020-12-05

This is the latest versions of the datasets and code. They are constantly being added to. The code lives on github. Download 2020-12-05 release: Unity Imaging Echocardiography Model Development Dataset Images: Download Unity Imaging Echocardiography Model Development Dataset Labels: Download Unity Imaging Code: https://github.com/UnityImaging For reproducibility, specific snapshots of the datasets and code used for publication are below. Images - png-cache.zip 1) We curate a collection of DICOM files that will contribute to a dataset. 2) Each DICOM file is assigned to a dataset class - currently there are two 01 - development - training / tuning / internal validation images 02 - external validation images 3) Each DICOM file is given a 64 character hexadecimal code, e.g. 4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66 4) Each image within a DICOM (typically an individual frame for echo) gets given a number padded to 4 digits, starting from 0000 and going to 9999. 5) These images are extracted from the DICOM file, burnt-in meta-data masked, and saved as a png with their code as a filename - e.g. 01-4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66-0000.png 6) The individual images that make up a dataset for a paper are saved in a folder called png-cache, with sub directories for the dataset class (e.g. /01) and then the first two pairs of hexadecimal digits (e.g. /4d/44), i.e. /png-cache/01/4d/44/4d44413619e0161c5ab795bc1b899f7fb4bd0b2f5ab2efc881ecfc663d3bfb66-0000.png 7) This folder is then compressed to form png-cache.zip Not all files may have an associated label - e.g. all the frames of a video may be included, but only a few of them have expert labels Labels - labels.zip These are stored as JSON files. The development dataset (provided as labels-all.json) is divided up into: labels-train.json - training labels-tune.json - tuning labels-ival.json - internal validation For each image file (which acts as the key), there is a dictionary for every possible label. Each label for an image may have a type of: "off": the structure is definitely not in the image - i.e the outputs would be expected to be all zeros "blurred": the structure is might be in the image, but there is no label available (either it was too blurry, or no one has tried to label it) - i.e the output would need to be masked from the loss function "point": the structure is a single point, with the x and y coordinate from the x and y keys "curve": the structure is a curve, repreesnted as a cubic spline, with the x and y coordinates of the control points in the x and y keys For convenience each of the .json files have an equivalent .txt file with a list of the contained images.

Cardiac Assessment and Classification of Ultrasound (CACTUS) dataset

The Cardiac Assessment and Classification of Ultrasound (CACTUS) dataset is an open-graded dataset designed for the evaluation and classification of cardiac ultrasound images. The dataset was created as part of the ARQUS project, which aims to develop an autonomous robotic system capable of performing ultrasound scans and extracting quantitative measurements. This project is funded by the NSERC (Natural Sciences and Engineering Research Council of Canada). The dataset contains ultrasound images obtained from scans of the CAE Blue Phantom, a synthetic model used to simulate the human heart. These images represent a variety of heart views and exhibit different quality levels. A detailed grading schema was developed by two medical imaging experts to assess the quality of each image, which ensures that the dataset contains a diverse range of both high- and low-quality ultrasound scans. The CACTUS dataset is particularly valuable for applications in artificial intelligence, specifically in the domain of echocardiography. It has been used in the development of automated system for the classification of cardiac ultrasound images and the assessment of image quality, which can assist medical practitioners in automating these traditionally labor-intensive tasks. 1. Title of Dataset: CACTUS: An open dataset and framework for automated Cardiac Assessment and Classification of Ultrasound images using deep transfer learning. 2. Author Information A. Principal Investigator Contact Information Name: Hanae Elmekki Institution: Concordia University Email: hanae.elmekki@mail.concordia.ca B. Associate or Co-investigator Contact Information Name: Amanda Spilkin Institution: Concordia University Email: amanda.spilkin@mail.concordia.ca 3. Date of data collection (single date, range, approximate date): 2025-03-05 4. Geographic location of data collection: Concordia University, Montreal, Quebec, Canada 5. Information about funding sources that supported the collection of the data: Natural Sciences and Engineering Research Council of Canada (NSERC), Discovery Horizons Program and Individual Discovery Grant —————————————- SHARING/ACCESS INFORMATION —————————————- 1. Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license 2. Links to publications that cite or use the data: https://doi.org/10.1016/j.compbiomed.2025.110003

EchoUNAL a benchmarking study for automatic classification of six echocardiographic views

# EchoUNAL Dataset & Classifier This repository contains the dataset and resources for the paper, "The EchoCardiography open data base EchoUNAL: a benchmarking study for automatic classification of six echocardiographic views". ## Description The study presents a comparative evaluation of four CNN architectures for classifying six standard echocardiographic views: * **A4C**: Apical Four Chamber * **A5C**: Apical Five Chamber * **PLAX**: Parasternal Long Axis * **PSAX**: Parasternal Short Axis * **S4C**: Subcostal Four Chambers * **IVC**: Inferior Vena Cava ## Dataset Details The database was created specifically for this study and is shared publicly to encourage further research. * **Data Source**: The dataset was obtained from **89 healthy volunteers**, with informed consent from all participants. * **Acquisition Hardware**: A **Butterfly iQ+** ultrasound device was used for image acquisition. * **Acquisition Protocol**: Images were captured by two cardiologists specializing in echocardiography. Some videos were excluded due to poor acoustic window quality. * **Total Contents**: The final dataset comprises **346 videos**, distributed as follows: * **A4C**: 61 videos * **A5C**: 49 videos * **PLAX**: 66 videos * **PSAX**: 61 videos * **S4C**: 54 videos * **IVC**: 55 videos ## Data Collection & Preprocessing * **Original Format**: Videos were stored in .AVI format with a 500x500 pixel resolution. The original frame rate varied between 26 and 30 fps. * **Applied Preprocessing**: 1. The ultrasound interface was cropped to center the cardiac image area. 2. The frame rate was standardized to **30 fps** using interpolation. 3. A manual, frame-level relabeling was performed to ensure label accuracy. A **"no view"** label was used for frames without a clear view, though this class was later excluded from the training setup due to its high heterogeneity. 4. All video frames were resized to **224x224** pixels, a standard input size for the pretrained CNN architectures. ## Acknowledgments This work was funded by the project "Estimation of cardiac work as an index of cardiovascular function in echocardiographic videos" with code 60946 from the call for research projects SUE Distrito Capital of 2023. J. D. S. Avila et al., "The EchoCardiography open data base EchoUNAL: a benchmarking study for automatic classification of six echocardiographic views," 2025 21st International Symposium on Biomedical Image Processing and Analysis (SIPAIM), Pasto, Colombia, 2025, pp. 1-4, doi: 10.1109/SIPAIM67325.2025.11283340. keywords: Training;Visualization;Ultrasonic imaging;Systematics;Echocardiography;Computational modeling;Benchmark testing;Convolutional neural networks;Videos;Residual neural networks

Phishing & Malware Website Snapshots

# Phishing & Malware Website Snapshots 136,414 phishing and malware website snapshots captured by a headless Chromium browser between July 24 and August 15, 2024. URLs were confirmed or high-confidence phishing/malware at the time of collection, though some hosts had already been blocked or taken down when the snapshot was taken. Each row contains the full HTML source, extracted visible text, complete network traffic from HAR recording, parsed page features, and resource fingerprints. Collected in mid-2024, published March 2026. All of the phishing domains and infrastructure captured here are long dead. The value is in page content, kit patterns, and network behavior rather than live IOCs. 55,339 unique domains, 23,588 of which appear more than once. ## Use Cases - Training phishing/malware classifiers (URL-level, page-level, or multimodal) - Static rule generation for phishing kit detection - Threat intelligence on phishing infrastructure and hosting - Feature engineering for browser-based detection extensions - Academic research on web-based social engineering ## Schema Single flat Parquet table, 42 columns per row. Nested columns use Parquet list types. ### Identification & Metadata | Column | Type | Description | |—-|—-|—-| | id | string | Unique archive ID ( hostname_rand5 ) | | domain | string | Target hostname | | url | string | Original URL visited | | scan_timestamp | string | ISO timestamp of capture | | language | string | Declared page language | ### Page Content | Column | Type | Description | |—-|—-|—-| | html | large_string | Full page HTML source | | html_length | int64 | HTML byte length | | extracted_text | large_string | Visible text via trafilatura | | text_length | int64 | Extracted text byte length | | title | string | Page content | | meta_tags | list<struct> | All <meta> tags (name + content) | | html_comments | list<string> | HTML comments (useful for kit signatures) | ### Scripts & Styles | Column | Type | Description | |—-|—-|—-| | external_script_urls | list<string> | External <script src> URLs | | inline_scripts | list<large_string> | Full inline JavaScript | | inline_scripts_count | int32 | Number of inline <script> blocks | | inline_scripts_total_bytes | int64 | Total inline JS size | | external_css_urls | list<string> | External stylesheet URLs | | inline_style_count | int32 | Number of inline <style> blocks | ### Forms & Inputs | Column | Type | Description | |—-|—-|—-| | form_actions | list<string> | <form action> URLs | | form_count | int32 | Number of <form> elements | | input_fields | list<struct> | Input fields with type, name, placeholder | | input_count | int32 | Number of <input> elements | | has_password_field | bool | Page contains a password input | | has_file_upload | bool | Page contains a file upload input | ### Linked Resources | Column | Type | Description | |—-|—-|—-| | favicon_urls | list<string> | Favicon link hrefs | | favicon_hashes | list<struct> | MD5/SHA256 of favicon content | | anchor_hrefs | list<string> | All <a href> targets | | image_srcs | list<string> | All URLs | | iframe_srcs | list<string> | All URLs | | external_domains | list<string> | Unique external domains referenced | ### Network & HTTP | Column | Type | Description | |—-|—-|—-| | final_url | string | URL after redirects | | redirect_chain | list<string> | Full redirect path | | server_header | string | Server response header | | x_powered_by | string | X-Powered-By response header | | content_security_policy | string | Content-Security-Policy header | | http_status | int32 | Final HTTP status code | | network_requests | list<struct> | All HAR entries (see below) | | network_request_count | int32 | Total network requests | | resource_hashes | list<struct> | MD5/SHA256 of served resources (see below) | ### Availability Flags | Column | Type | Description | |—-|—-|—-| | has_html | bool | Row has HTML content | | has_har | bool | Row has HAR data | | has_text | bool | Row has extracted text | ### Nested Struct: network_requests Each entry: method , url , url_domain , status , mime_type , response_size , server_ip , is_redirect , redirect_url , response_headers (list of key/value structs), request_cookies , response_cookies (with name, domain, httpOnly, secure). ### Nested Struct: resource_hashes Each entry: url , mime_type , resource_type (script / style / image / font / document / other), body_size , body_md5 , body_sha256 , is_favicon . ### Nested Struct: input_fields Each entry: type , name , placeholder , id , class_name , required , autocomplete . ### Nested Struct: meta_tags Each entry: name , content . ## Statistics | Metric | Value | |—-|—-| | Rows | 136,414 | | Unique domains | 55,339 | | Multi-snapshot domains | 23,588 | | Rows with HTML | 135,407 (99.3%) | | Rows with HAR data | 126,747 (92.9%) | | Rows with extracted text | 134,114 (98.3%) | | Rows with forms | 52,089 (38.2%) | | Rows with password fields | 33,199 (24.3%) | | Total resource hashes | 2,349,290 | | Median network requests/row | 11 | | Mean network requests/row | 20.2 | | Median HTML size | 25 KB | | Mean HTML size | 124 KB | | Total inline JS | 6.1 GB (uncompressed) | | Shards | 28 | | Total size on disk | 2.0 GB (zstd level 19) | ### Language Distribution (top 10) | Language | Rows | |—-|—-| | en | 50,925 | | en-US | 9,037 | | ru | 7,926 | | fr | 3,618 | | en-us | 1,759 | | zh-CN | 1,483 | | ja | 1,378 | | (empty) | 1,290 | | de | 1,226 | | fr-FR | 777 | ## Collection Method A headless Chromium browser (Playwright) running inside Docker, controlled by a Flask API, visited each URL. Per visit: 1. Navigate to URL with HAR recording active 2. Save rendered HTML ( index.html ) 3. Capture all network requests and responses ( requests.har , HAR 1.2 format) 4. Extract visible text via trafilatura ( trafilatura.txt ) 5. Compress into a .7z archive During conversion to Parquet, HAR response bodies were hashed (MD5 + SHA256) rather than stored. This reduced the dataset from ~76 GB of raw archives to 2 GB of compressed Parquet. Full HTML and inline JavaScript are preserved. Archives containing blocked or error pages were excluded. Filtered titles: "403 Forbidden", "Not found", "Attention Required! | Cloudflare", "Suspected phishing site | Cloudflare". Empty and corrupted archives were also removed. 31,793 archives were filtered in total. ## Safety This dataset contains snapshots of malicious websites. The HTML and scripts include: - Credential harvesting forms that mimic legitimate services - Obfuscated JavaScript for redirects, fingerprinting, or exploit delivery - References to attacker-controlled infrastructure Do not execute JavaScript or render HTML from this dataset in a browser without sandboxing. This data is for defensive security research. ## License CC0-1.0 </article> <article> <h1>PPT Online</h1> ### Dataset Summary This dataset contains metadata about 1,418,349 PowerPoint (.ppt) files hosted on the ppt-online.org platform. PPT Online is a service designed to display PowerPoint presentations. The dataset includes information such as presentation titles, categories, file sizes, and content snippets. The majority of the presentations are in Russian, Ukrainian, Belarusian, Kazakh, and English, but other languages are also present. ### Languages The dataset is multilingual, with the primary languages being Russian, Ukrainian, Belarusian, Kazakh, and English. However, presentations in other languages are also included. ## Dataset Structure ### Data Fields This dataset includes the following fields: - id : Unique identifier for the presentation (integer) - title : Title of the PowerPoint presentation (string) - category : Category or topic of the presentation (string) - file_size : Size of the PowerPoint file (string) - body_content : Snippet or summary of the presentation content. Generated by a service, quite low quality (string) ### Data Splits All examples are in a single split. </article> <article> <h1>March 2026 Public Data File from Crossref</h1> Note that this Crossref metadata is always openly available. The difference here is that we’ve done the time-saving work of putting all of the records registered through March 2026 into one file for download. To keep this metadata current, you can access new records via our public API at: https://api.crossref.org And, if you do use our API, we encourage you to read the section of the documentation on "etiquette". That is, how to use the API without making it impossible for others to use. </article> <article> <h1>[Sample Dataset] March 2026 Public Data File from Crossref</h1> [Sample Dataset] March 2026 Public Data File from Crossref. This dataset includes 100 random JSON records from the Crossref metadata corpus. </article> <article> <h1>Reddit comments/submissions 2026-02</h1> Reddit comments and submisReddit comments and submissions from 2026-02 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01 </article> <article> <h1>IUGC Ultrasound Dataset (MICCAI 2025)</h1> In 2018, the World Health Organization (WHO) published 56 recommendations to improve the quality of intrapartum care and enhance women’s childbirth experiences. In response, the WHO developed the Labour Care Guide (LCG) in 2020, a next-generation tool designed to promote evidence-based, respectful, and woman-centered care during labor and delivery. The LCG was created through expert consultations, primary research with maternity healthcare providers, and usability studies across multiple countries. It serves as a practical tool for monitoring labor progress and maternal and fetal well-being by recording key clinical parameters. When deviations from normal labor progression are detected, the LCG highlights these issues, prompting timely interventions to ensure safe and effective care. Intrapartum ultrasound for labor progression analysis is a crucial examination in labor management. The core operation in this analysis is the identification of landmarks from intrapartum ultrasound images. These landmarks serve as the basis for subsequent qualitative evaluations of angles and distances, which offer valuable diagnostic information regarding labor arrest and influence decisions about the timing and type of intervention. However, obtaining reliable landmark annotations typically demands experienced physicians, and even for proficient obstetricians, manual landmark identification is a time-consuming and labor-intensive endeavor. Consequently, the development of fully automatic and precise landmark localization techniques has been an area of significant and persistent need. The Intrapartum Ultrasound Grand Challenge (IUGC) 2025 is a collaborative initiative involving the "Deep Learning in Intrapartum Ultrasound Image Analysis" cooperative group and prominent clinical societies such as the International Society of Ultrasound in Obstetrics & Gynecology (ISUOG), the World Association of Perinatal Medicine (WAPM), the Perinatal Medicine Foundation (PMF), and the National Institute for Health and Care-Excellence (NICE). The objective of this partnership is to formulate and promote clinically relevant challenges, thereby maximizing the potential clinical impact of innovative algorithmic contributions from participating teams. Since its inception at MICCAI 2023, the IUGC has advanced the Pubic Symphysis - Fetal Head Segmentation (PSFHS) by facilitating and benchmarking algorithmic progress and providing high-quality annotated image datasets. In MICCAI 2024, the IUGC expanded to incorporate multiple benchmarks: (1) The analysis objects were extended from images to videos; (2) The tasks were augmented from image segmentation to classification, segmentation, and biometric parameter measurement; (3) The quantitative parameters were increased from one (i.e., Angle of Progression (AOP)) to two (i.e., AOP and head - symphysis distance (HSD)); and (4) The data sources were broadened from being solely from Asia to include Asia, Europe, and Africa. This novel and inventive design has established a benchmarking ecosystem for the systematic comparison of algorithms across diverse tasks and clinical challenges. The significance of the IUGC 2025 lies in its concentration on addressing the actual clinical assessment of labor progress, covering (1) end-to-end measurements (which are currently indirect measurements based on segmentation results); (2) all fetal descent stations during the childbirth process (comprising five “minus”, one “zero", and three “plus” stations for reliable longitudinal assessment of labor progress); (3) computational tasks (such as regression, detection); and (4) learning methods (semi-supervised, weakly-supervised, and barely-supervised learning). In line with the IUGC s goal of addressing clinical requirements, authoritative and leading clinical organizations have allied with the IUGC. We have extended the IUGC 2024 Challenge from an indirect ultrasound measurement based on segmentation results to an end-to-end measurement based on landmarks. Specifically, we provide 300 labeled cases and 31,421 unlabeled cases in the training set, 100 visible cases for validation, and 501 hidden cases for testing. The targets are the coordinates of three landmarks and the corresponding biometric parameter. In addition to the typical Mean Radial Error (MRE) and the absolute difference between predicted and manually measured parameters, our evaluation metrics also emphasize inference speed. In summary, the IUGC 2025 challenge exhibits three primary characteristics: (1) Task: Employing semi-supervised landmark detection. (2) Dataset: Curating a large-scale and diverse fetal ultrasound dataset that accounts for all fetal descent stations during the childbirth process. It comprises 28,919 ultrasound images from over 20 medical groups. (3) Evaluation measures: Focusing on detection accuracy. (4) Multiple raters independently annotate a subset of test cases to compare algorithmic performance against human expert inter-rater variability. </article> <article> <h1>Wikipedia European languages 2026-03-01</h1> Wikipedia database dumps of European language wikis of 10k articles or more. enwiki excluded. Wikipedia Multistream 2026-03-01. These 68 languages are included: Albanian, Alemannic, Aragonese, Asturian, Basque, Bavarian, Belarusian, Benetian, Bosnian, Breton, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Emilian-Romagnol, Esperanto, Estonian, Faroese, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Ladin, Latin, Latvian, Ligurian, Limburgish, Lithuanian, Lombard, Low German, Luxembourgish, Macedonian, Maltese, Neapolitan, North Frisian, Norwegian, Nynorsk, Occitan, Piedmontese, Polish, Portuguese, Romanian, Romansh, Rusyn, Samogitian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Silesian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Upper Sorbian, Walloon, Welsh, West Frisian, Yiddish. </article> <article> <h1>Public MediaWiki Collection</h1> # Dataset Card for Public MediaWiki Collection ### Dataset Summary This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages. ### Languages The dataset is multilingual, covering 35+ languages found across the collected wiki instances. ## Dataset Structure ### Data Fields This dataset includes the following fields: - id : Unique identifier for the article (string) - title : Title of the article (string) - text : Main content of the article (string) - metadata : Dictionary containing: - templates : List of templates used in the article - categories : List of categories the article belongs to - wikilinks : List of internal wiki links and their text - external_links : List of external links - sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances. To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/ </article> <article> <h1>9111.ru Questions Dataset</h1> # Dataset Card for 9111.ru Questions ### Dataset Summary This dataset includes legal questions and answers from the Russian law forum [9111.ru](https://9111.ru). It contains inquiries from users and corresponding responses from lawyers. The dataset was created by processing around 21 million questions, providing a significant corpus of legal discussions. ### Languages The dataset is mostly in Russian, but there may be other languages present. ## Dataset Structure ### Data Fields This dataset includes the following fields: - id : Identifier for the item (integer) - title : Title of the question (string) - description : Description of the question (string) - answers : An array of answer objects (array) - user_name : Name of the user who answered (string) - status : Status of the user (string) - rating : Rating of the user (integer) - text : Text of the answer (string) ### Data Format The dataset is stored as Apache Parquet files with zstd compression (level 19), split into 3 shards: - questions-00000-of-00003.parquet - questions-00001-of-00003.parquet - questions-00002-of-00003.parquet ### Data Splits All examples are in the train split, there is no validation split. </article> <article> <h1>Fandom.com Community Database Dumps Dataset</h1> # Dataset Card for Fandom.com Community Database Dumps ### Dataset Summary This dataset contains 7,040,984 current pages from all available [Fandom.com community wiki dumps](https://community.fandom.com/wiki/Help:Database_download) as of February 18, 2025. The dataset was created by processing the "Current pages" database dumps from all available Fandom.com wikis. These dumps contain only the current versions of pages without edit history and includes article text, metadata, and structural information across multiple languages. ### Languages The dataset is multilingual, covering [40+ languages](https://community.fandom.com/wiki/Help:Language_codes). ## Dataset Structure ### Data Fields This dataset includes the following fields: - id : Unique identifier for the article (string) - title : Title of the article (string) - text : Main content of the article (string) - metadata : Dictionary containing: - templates : List of templates used in the article - categories : List of categories the article belongs to - wikilinks : List of internal wiki links and their text - external_links : List of external links - sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset inherits the licenses from the source Fandom communities, which use Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0). To learn more about CC-BY-SA 3.0, visit: https://creativecommons.org/licenses/by-sa/3.0/ </article> <article> <h1>ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles</h1> High‑resolution aerial imagery from Ontario’s DRAPE 2024, packaged for fast web maps and offline use. Smooth panning, crisp detail, open data. Want to preview the file? Go to https://source.coop/dataforcanada/d4c-datapkg-orthoimagery/processed/ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles We are aware that there is a nodata issue with the product and will fix it in the next release. </article> <article> <h1>Street-Level Imagery Dataset</h1> # Street-Level Imagery Dataset Metadata for street-level imagery across Eastern Europe and Northern Asia. Each record includes image URLs, coordinates, camera orientation, timestamps, and links to similar images captured at the same location over time. ## Summary | Statistic | Value | |—————-|———-| | Total Records | 934,191 | | Unique Images | 905,940 | | Time Span | 2016–2025 | | File Format | Parquet | ## Geographic Coverage | Boundary | Value | |—————|———-| | Minimum Longitude | 20.49° E | | Maximum Longitude | 152.32° E | | Minimum Latitude | 38.55° N | | Maximum Latitude | 69.05° N | Coverage spans urban centers and rural routes. Density is higher in populated areas. ## Camera Specifications **Directions** | Direction | Count | |—————-|———-| | Front | 740,079 | | Right | 194,112 | **Resolutions** | Preview Size | Full Size | Count | |———————|—————-|———-| | 284×160 | 1920×1080 | 932,171 | | 90×160 | 1080×1920 | 1,886 | | 284×160 | 1536×864 | 77 | | 213×160 | 2016×1512 | 41 | ## Data Structure ### Fields | Field | Type | Description | |———-|———|——————-| | id | string | Unique image identifier | | sourceId | string | Source device or session identifier | | heading | float64 | Camera heading (0–360°) | | cameraDirection | string | Mount position ( front or right ) | | timestamp | string | ISO 8601 capture time | | imagePreview | struct | Thumbnail URL and dimensions | | imageFull | struct | Full resolution URL and dimensions | | pos | array | [longitude, latitude] | | geometry | struct | GeoJSON Point geometry | | similar | array | Related images at location | | targetGeometry | struct | Optional target reference | ### Image URL Schema Two resolution variants per entry: json "imagePreview": "url": "https://...", "width": 284, "height": 160 , "imageFull": "url": "https://...", "width": 1920, "height": 1080 ### Temporal Links Records reference similar images from other timestamps at the same coordinates. Average of 14.3 links per location. ## Limitations Image URLs may [rot](https://en.wikipedia.org/wiki/Link_rot). Coverage concentrates in urban areas, and historical density varies by location. ## License Research use permitted. Comply with source terms of service and local data regulations. </article> </main></body></html>