NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories
National Institutes of Health - Clinical Center

folder ChestXray-NIHCC (15 files)
fileBBox_List_2017.csv 92.42kB
fileData_Entry_2017.csv 8.66MB
fileimages_001.tar.gz 2.01GB
fileimages_002.tar.gz 3.95GB
fileimages_003.tar.gz 3.93GB
fileimages_004.tar.gz 3.84GB
fileimages_005.tar.gz 3.94GB
fileimages_006.tar.gz 3.99GB
fileimages_007.tar.gz 4.02GB
fileimages_008.tar.gz 4.02GB
fileimages_009.tar.gz 4.11GB
fileimages_010.tar.gz 4.18GB
fileimages_011.tar.gz 4.19GB
fileimages_012.tar.gz 2.91GB
fileREADME_ChestXray.pdf 846.74kB
Type: Dataset

title= {NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories},
journal= {},
author= {National Institutes of Health - Clinical Center},
year= {},
url= {},
abstract= {![](

(1, Atelectasis; 2, Cardiomegaly; 3, Effusion; 4, Infiltration; 5, Mass; 6, Nodule; 7, Pneumonia; 8, Pneumothorax; 9, Consolidation; 10, Edema; 11, Emphysema; 12, Fibrosis; 13, Pleural_Thickening; 14 Hernia) 

### Background & Motivation: 
Chest X-ray exam is one of the most frequent and cost-effective medical imaging examination. However clinical diagnosis of chest X-ray can be challenging, and sometimes believed to be harder than diagnosis via chest CT imaging. Even some promising work have been reported in the past, and especially in recent deep learning work on Tuberculosis (TB) classification. To achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites on all data settings of chest X-rays is still very difficult, if not impossible when only several thousands of images are employed for study. This is evident from [2] where the performance deep neural networks for thorax disease recognition is severely limited by the availability of only 4143 frontal view images [3] (Openi is the previous largest publicly available chest X-ray dataset to date).

In this database, we provide an enhanced version (with 6 more disease categories and more images as well) of the dataset used in the recent work [1] which is approximately 27 times of the number of frontal chest x-ray images in [3]. Our dataset is extracted from the clinical PACS database at National Institutes of Health Clinical Center and consists of ~60% of all frontal chest x-rays in the hospital. Therefore we expect this dataset is significantly more representative to the real patient population distributions and realistic clinical diagnosis challenges, than any previous chest x-ray datasets. Of course, the size of our dataset, in terms of the total numbers of images and thorax disease frequencies, would better facilitate deep neural network training [2]. Refer to [1] on the details of how the dataset is extracted and image labels are mined through natural language processing (NLP).

### Details:
ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper:

### Contents:
1. 112,120 frontal-view chest X-ray PNG images in 1024*1024 resolution (under images folder)
2. Meta data for all images (Data_Entry_2017.csv): Image Index, Finding Labels, Follow-up #, Patient ID, Patient Age, Patient Gender, View Position, Original Image Size and Original Image Pixel Spacing.
3. Bounding boxes for ~1000 images (BBox_List_2017.csv):Image Index, Finding Label, Bbox[x, y, w, h]. [x y] are coordinates of each box's topleft corner. [w h] represent the width and height of each box.

If you find the dataset useful for your research projects, please cite our CVPR 2017 paper:Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, MohammadhadiBagheri, Ronald M. Summers.ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, IEEE CVPR, pp. 3462-3471,2017

@InProceedings{wang2017chestxray,author    = {Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald},
title = {ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
booktitle = {2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR)},
pages     = {3462--3471},
year      = {2017}}

### Questions/Comments:

### Limitations:
1. The image labels are NLP extracted so there would be some erroneous labels but the NLP labelling accuracy is estimated to be >90%. 
2. Very limited numbers of disease region bounding boxes. 
3. Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation.

### Acknowledgement:
This work was supported by the Intramural Research Program of the NIH Clinical Center ( and National Library of Medicine ( We thank NVIDIA Corporation for the GPU donations.

### Reference:
[1] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, MohammadhadiBagheri, Ronald Summers, ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common ThoraxDiseases, IEEE CVPR, pp. 3462-3471,2017

[2] Hoo-chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, Ronald M. Summers, Learning to Read Chest X-Rays: Recurrent Neural CascadeModel for Automated Image Annotation, IEEE CVPR, pp. 2497-2506, 2016

[3] Open-i: An open access biomedical search engine. https: //

keywords= {},
terms= {},
license= {"The usage of the data set is unrestricted"},
superseded= {}

Send Feedback