NIH RePORTER 25-04-25
National Institutes of Health

folder nih-reporter-exporter-25-04-25 (456 files)
fileClinicalStudies.csv 4.57MB
filedatabook/career-development.zip 49.25kB
filedatabook/data-by-disability.zip 4.41kB
filedatabook/data-by-ethnicity.zip 4.42kB
filedatabook/data-by-race.zip 4.65kB
filedatabook/data-by-sex.zip 54.27kB
filedatabook/extramural-demographics.zip 42.99kB
filedatabook/extramural-overview.zip 52.67kB
filedatabook/funding-priorities.zip 6.43kB
filedatabook/funding-rates.zip 19.72kB
filedatabook/historical/NDB_2008_Final.pdf 8.65MB
filedatabook/historical/NDB_2008_Final.ppt 7.59MB
filedatabook/historical/NDB_2009_Final.pdf 14.71MB
filedatabook/historical/NDB_2009_Final.ppt 10.49MB
filedatabook/historical/NDB_2010_Final.pdf 13.31MB
filedatabook/historical/NDB_2010_Final.ppt 13.14MB
filedatabook/historical/NDB_2011_Final.pdf 13.51MB
filedatabook/historical/NDB_2011_Final.ppt 17.56MB
filedatabook/historical/NDB_2012_Final.pdf 13.65MB
filedatabook/historical/NDB_2012_Final.ppt 24.15MB
filedatabook/historical/NDB_2013_Final.pdf 14.59MB
filedatabook/historical/NDB_2013_Final.ppt 25.15MB
filedatabook/historical/NDB_2014_Final.pdf 6.49MB
filedatabook/historical/NDB_2014_Final.ppt 26.98MB
filedatabook/historical/NDB_2015_Final.pdf 8.71MB
filedatabook/historical/NDB_2015_Final.pptx 11.40MB
filedatabook/historical/NDB_2016_Final.pdf 16.34MB
filedatabook/historical/NDB_2016_Final.pptx 10.02MB
filedatabook/historical/NDB_2017_Final.pdf 7.15MB
filedatabook/historical/NDB_2017_Final.pptx 7.78MB
filedatabook/historical/NEDB_Core_Deck.pdf 6.37MB
filedatabook/historical/NEDB_Core_Deck.ppt 14.97MB
filedatabook/intramural-data-book.zip 45.82kB
filedatabook/investigator-career-stage.zip 24.65kB
filedatabook/national-statistics-on-graduate-students.zip 41.27kB
filedatabook/national-statistics-on-postdoctorates.zip 18.34kB
filedatabook/nih-budget-history.zip 24.15kB
filedatabook/nih-peer-review.zip 28.00kB
filedatabook/overview.zip 9.92kB
filedatabook/ph-d-recipients.zip 32.91kB
filedatabook/r01-equivalent-grants.zip 28.73kB
filedatabook/research-center-grants.zip 14.22kB
filedatabook/research-grants.zip 95.27kB
filedatabook/research-project-grants.zip 52.19kB
filedatabook/research-training-grants-and-fellowships.zip 43.80kB
filedatabook/research-training-training-grants.zip 60.68kB
filedatabook/small-business-research-sbir-sttr.zip 41.66kB
filedatabook/success-rates-non-research-project-grants.zip 45.20kB
filedatabook/success-rates-r01-equivalent-and-research-project-grants.zip 35.77kB
Too many files! Click here to view them all.
Type: Dataset
Tags: Research, bibliometrics, NIH, scholarly-publishing, grants, publications, patents, funding, demographics, clinical-trials

Bibtex:
@article{,
title= {NIH RePORTER 25-04-25},
journal= {},
author= {National Institutes of Health},
year= {},
url= {https://sciop.net/datasets/nih-reporter},
abstract= {RePORTER is the NIH's database of its grants, awards, and publications.

> In addition to carrying out its scientific mission, NIH exemplifies and promotes the highest level of public accountability. To that end, the Research Portfolio Online Reporting Tools (RePORT) website provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH-supported research.
> 
> One of the tools available on the RePORT website is the RePORTER (RePORT Expenditures and Results) module. RePORTER is an electronic tool that allows users to search a repository of both intramural and extramural NIH-funded research projects and access publications and patents resulting from NIH funding.
> 
> In addition to RePORTER, the RePORT website also contains other tools that provide access to reports and summary statistics on NIH funding and the organizations and people involved in NIH research and training. One of these tools is the NIH Data Book, which summarizes the most commonly asked questions about the NIH budget and extramural programs. Another tool is called Awards by Location, which summarizes NIH awards for a particular fiscal year by the location and organization of the awardees.

This upload also contains the [NIH databook](https://report.nih.gov/nihdatabook), exported to xlsx and csv.

## Threat

This data might seem very low risk, since it's just uncontroversial data about what the NIH has funded in the past, but the trump administration has moved to clear records from the NSF's records system, so one can expect RePORTER data to follow soon after.

Initially set on 'watchlist' because no specific threat has been identified.

## Method

### ExPORTER

Data was acquired from NIH's ExPORTER site: https://reporter.nih.gov/exporter/

The page does not contain regular links with download URLs, instead the table is dynamically loaded and the links are opaque single-use query parameter blobs.

A simple playwright script captures the data, and pandas is used to extrac tthe data dictionaries:

```python
from pathlib import Path

import pandas as pd
from playwright.sync_api import sync_playwright
from tqdm import trange

TABS = [
    "https://reporter.nih.gov/exporter/projects",
    "https://reporter.nih.gov/exporter/abstracts",
    "https://reporter.nih.gov/exporter/publications",
    "https://reporter.nih.gov/exporter/patents",
    "https://reporter.nih.gov/exporter/clinicalstudies",
    "https://reporter.nih.gov/exporter/linktables"
]

def scrape_reporter():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        for tab in TABS:
            page.goto(tab)
            print(tab)
            # wait for the table within the page to load
            page.wait_for_load_state("networkidle")
            # this really actually is the best way to find the download buttons.
            locs = page.locator("i.fa-cloud-download-alt")
            loc_count = locs.count()
            for idx in trange(loc_count):
                with page.expect_download() as download_info:
                    locs.nth(idx).click()
                download = download_info.value
                download.save_as("./" + download.suggested_filename)

def scrape_data_dictionary():
    path = Path() / "data_dictionaries"
    path.mkdir(exist_ok=True)
    tabs = pd.read_html("https://report.nih.gov/exporter-data-dictionary")
    names = [
        "project", "abstract", "publication", "patent", "clinical_study", 
        "link_table", "publication_author_affiliation_link_table"
    ]
    for tab, name in zip(tabs, names):
        tab.to_csv(path / (name + ".csv"))


if __name__ == "__main__":
    scrape_reporter()
    scrape_data_dictionary()
```

### Databook

Similarly, databook was scraped with playwright, exporting each report to excel and re-exporting to CSV, and then zipping folders based on the categories in the databook

```python
from pathlib import Path

import pandas as pd
from playwright.sync_api import sync_playwright
from tqdm import trange
from slugify import slugify


BASE_URL = "https://report.nih.gov/nihdatabook/category/"
REPORT_CATEGORIES=31
OUTPUT = Path("databook")
OUTPUT.mkdir(exist_ok=True)

def scrape_databook():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        for i in trange(REPORT_CATEGORIES, position=0):
            page.goto(BASE_URL + str(i))
            # wait for the table within the page to load
            page.wait_for_load_state("networkidle")
            header = page.locator("h1").first.inner_text()
            if "not found" in header.lower():
                continue
            directory = OUTPUT / slugify(header)
            directory.mkdir(exist_ok=True, parents=True)

            exports = page.locator("svg.fa-download").filter(visible=True)
            report_count = exports.count()
            for idx in trange(report_count, position=1):
                exports.nth(idx).click()
                with page.expect_download() as download_info:
                    page.get_by_role("button", name="Export Data as Excel").click()
                download = download_info.value
                download.save_as(directory / download.suggested_filename)

def scrape_historical():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto("https://report.nih.gov/nihdatabook/page/historical-data-books")
        directory = OUTPUT / "historical"
        directory.mkdir(exist_ok=True)

        downloads = page.get_by_text("Download").filter(visible=True)
        downloads_count = downloads.count()
        for idx in trange(downloads_count):
            with page.expect_download() as download_info:
                downloads.nth(idx).click()
            download = download_info.value
            download.save_as(directory / download.suggested_filename)

def reexport_csv():
    """Re-export all the xlsx's into csvs"""
    for xls in OUTPUT.rglob("*.xlsx"):
        tabs = pd.read_excel(xls, sheet_name=None)
        if len(tabs) == 1:
            tab = tabs[
                list(tabs.keys())[0]
            ].to_csv(
                xls.with_suffix(".csv"), 
                index=False,
            )
        else:
            for idx, (tab_name, tab) in enumerate(tabs.items()):
                tab.to_csv(
                    xls.with_suffix(f".{idx}.{tab_name}.csv"),
                    index=False,
                )


if __name__ == "__main__":
    scrape_databook()
    scrape_historical()
    reexport_csv()
```},
keywords= {nih, grants, publications, research, patents, scholarly-publishing, bibliometrics, funding, demographics, clinical-trials},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback