Name: NIH RePORTER 25-04-25
Creator: National Institutes of Health
Published: 2025-04-26 01:11:15
License: https://academictorrents.com/nolicensespecified

nih-reporter-exporter-25-04-25 (456 files)

ClinicalStudies.csv	4.57MB
databook/career-development.zip	49.25kB
databook/data-by-disability.zip	4.41kB
databook/data-by-ethnicity.zip	4.42kB
databook/data-by-race.zip	4.65kB
databook/data-by-sex.zip	54.27kB
databook/extramural-demographics.zip	42.99kB
databook/extramural-overview.zip	52.67kB
databook/funding-priorities.zip	6.43kB
databook/funding-rates.zip	19.72kB
databook/historical/NDB_2008_Final.pdf	8.65MB
databook/historical/NDB_2008_Final.ppt	7.59MB
databook/historical/NDB_2009_Final.pdf	14.71MB
databook/historical/NDB_2009_Final.ppt	10.49MB
databook/historical/NDB_2010_Final.pdf	13.31MB
databook/historical/NDB_2010_Final.ppt	13.14MB
databook/historical/NDB_2011_Final.pdf	13.51MB
databook/historical/NDB_2011_Final.ppt	17.56MB
databook/historical/NDB_2012_Final.pdf	13.65MB
databook/historical/NDB_2012_Final.ppt	24.15MB
databook/historical/NDB_2013_Final.pdf	14.59MB
databook/historical/NDB_2013_Final.ppt	25.15MB
databook/historical/NDB_2014_Final.pdf	6.49MB
databook/historical/NDB_2014_Final.ppt	26.98MB
databook/historical/NDB_2015_Final.pdf	8.71MB
databook/historical/NDB_2015_Final.pptx	11.40MB
databook/historical/NDB_2016_Final.pdf	16.34MB
databook/historical/NDB_2016_Final.pptx	10.02MB
databook/historical/NDB_2017_Final.pdf	7.15MB
databook/historical/NDB_2017_Final.pptx	7.78MB
databook/historical/NEDB_Core_Deck.pdf	6.37MB
databook/historical/NEDB_Core_Deck.ppt	14.97MB
databook/intramural-data-book.zip	45.82kB
databook/investigator-career-stage.zip	24.65kB
databook/national-statistics-on-graduate-students.zip	41.27kB
databook/national-statistics-on-postdoctorates.zip	18.34kB
databook/nih-budget-history.zip	24.15kB
databook/nih-peer-review.zip	28.00kB
databook/overview.zip	9.92kB
databook/ph-d-recipients.zip	32.91kB
databook/r01-equivalent-grants.zip	28.73kB
databook/research-center-grants.zip	14.22kB
databook/research-grants.zip	95.27kB
databook/research-project-grants.zip	52.19kB
databook/research-training-grants-and-fellowships.zip	43.80kB
databook/research-training-training-grants.zip	60.68kB
databook/small-business-research-sbir-sttr.zip	41.66kB
databook/success-rates-non-research-project-grants.zip	45.20kB
databook/success-rates-r01-equivalent-and-research-project-grants.zip	35.77kB
Too many files! Click here to view them all.

Type: Dataset

Tags: ResearchbibliometricsNIHscholarly-publishinggrantspublicationspatentsfundingdemographicsclinical-trials

Bibtex:

@article{,
title= {NIH RePORTER 25-04-25},
journal= {},
author= {National Institutes of Health},
year= {},
url= {https://sciop.net/datasets/nih-reporter},
abstract= {RePORTER is the NIH's database of its grants, awards, and publications.

> In addition to carrying out its scientific mission, NIH exemplifies and promotes the highest level of public accountability. To that end, the Research Portfolio Online Reporting Tools (RePORT) website provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH-supported research.
> 
> One of the tools available on the RePORT website is the RePORTER (RePORT Expenditures and Results) module. RePORTER is an electronic tool that allows users to search a repository of both intramural and extramural NIH-funded research projects and access publications and patents resulting from NIH funding.
> 
> In addition to RePORTER, the RePORT website also contains other tools that provide access to reports and summary statistics on NIH funding and the organizations and people involved in NIH research and training. One of these tools is the NIH Data Book, which summarizes the most commonly asked questions about the NIH budget and extramural programs. Another tool is called Awards by Location, which summarizes NIH awards for a particular fiscal year by the location and organization of the awardees.

This upload also contains the [NIH databook](https://report.nih.gov/nihdatabook), exported to xlsx and csv.

## Threat

This data might seem very low risk, since it's just uncontroversial data about what the NIH has funded in the past, but the trump administration has moved to clear records from the NSF's records system, so one can expect RePORTER data to follow soon after.

Initially set on 'watchlist' because no specific threat has been identified.

## Method

### ExPORTER

Data was acquired from NIH's ExPORTER site: https://reporter.nih.gov/exporter/

The page does not contain regular links with download URLs, instead the table is dynamically loaded and the links are opaque single-use query parameter blobs.

A simple playwright script captures the data, and pandas is used to extrac tthe data dictionaries:

```python
from pathlib import Path

import pandas as pd
from playwright.sync_api import sync_playwright
from tqdm import trange

TABS = [
    "https://reporter.nih.gov/exporter/projects",
    "https://reporter.nih.gov/exporter/abstracts",
    "https://reporter.nih.gov/exporter/publications",
    "https://reporter.nih.gov/exporter/patents",
    "https://reporter.nih.gov/exporter/clinicalstudies",
    "https://reporter.nih.gov/exporter/linktables"
]

def scrape_reporter():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        for tab in TABS:
            page.goto(tab)
            print(tab)
            # wait for the table within the page to load
            page.wait_for_load_state("networkidle")
            # this really actually is the best way to find the download buttons.
            locs = page.locator("i.fa-cloud-download-alt")
            loc_count = locs.count()
            for idx in trange(loc_count):
                with page.expect_download() as download_info:
                    locs.nth(idx).click()
                download = download_info.value
                download.save_as("./" + download.suggested_filename)

def scrape_data_dictionary():
    path = Path() / "data_dictionaries"
    path.mkdir(exist_ok=True)
    tabs = pd.read_html("https://report.nih.gov/exporter-data-dictionary")
    names = [
        "project", "abstract", "publication", "patent", "clinical_study", 
        "link_table", "publication_author_affiliation_link_table"
    ]
    for tab, name in zip(tabs, names):
        tab.to_csv(path / (name + ".csv"))


if __name__ == "__main__":
    scrape_reporter()
    scrape_data_dictionary()
```

### Databook

Similarly, databook was scraped with playwright, exporting each report to excel and re-exporting to CSV, and then zipping folders based on the categories in the databook

```python
from pathlib import Path

import pandas as pd
from playwright.sync_api import sync_playwright
from tqdm import trange
from slugify import slugify


BASE_URL = "https://report.nih.gov/nihdatabook/category/"
REPORT_CATEGORIES=31
OUTPUT = Path("databook")
OUTPUT.mkdir(exist_ok=True)

def scrape_databook():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        for i in trange(REPORT_CATEGORIES, position=0):
            page.goto(BASE_URL + str(i))
            # wait for the table within the page to load
            page.wait_for_load_state("networkidle")
            header = page.locator("h1").first.inner_text()
            if "not found" in header.lower():
                continue
            directory = OUTPUT / slugify(header)
            directory.mkdir(exist_ok=True, parents=True)

            exports = page.locator("svg.fa-download").filter(visible=True)
            report_count = exports.count()
            for idx in trange(report_count, position=1):
                exports.nth(idx).click()
                with page.expect_download() as download_info:
                    page.get_by_role("button", name="Export Data as Excel").click()
                download = download_info.value
                download.save_as(directory / download.suggested_filename)

def scrape_historical():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto("https://report.nih.gov/nihdatabook/page/historical-data-books")
        directory = OUTPUT / "historical"
        directory.mkdir(exist_ok=True)

        downloads = page.get_by_text("Download").filter(visible=True)
        downloads_count = downloads.count()
        for idx in trange(downloads_count):
            with page.expect_download() as download_info:
                downloads.nth(idx).click()
            download = download_info.value
            download.save_as(directory / download.suggested_filename)

def reexport_csv():
    """Re-export all the xlsx's into csvs"""
    for xls in OUTPUT.rglob("*.xlsx"):
        tabs = pd.read_excel(xls, sheet_name=None)
        if len(tabs) == 1:
            tab = tabs[
                list(tabs.keys())[0]
            ].to_csv(
                xls.with_suffix(".csv"), 
                index=False,
            )
        else:
            for idx, (tab_name, tab) in enumerate(tabs.items()):
                tab.to_csv(
                    xls.with_suffix(f".{idx}.{tab_name}.csv"),
                    index=False,
                )


if __name__ == "__main__":
    scrape_databook()
    scrape_historical()
    reexport_csv()
```},
keywords= {nih, grants, publications, research, patents, scholarly-publishing, bibliometrics, funding, demographics, clinical-trials},
terms= {},
license= {},
superseded= {}
}

NIH RePORTER 25-04-25 National Institutes of Health

NIH RePORTER 25-04-25
National Institutes of Health