Info hash | 5cc64ab8d71e5f6c7732818842ea9a744a47bdfc |
Last mirror activity | 0:54 ago |
Size | 4.00GB (3,999,268,864 bytes) |
Added | 2025-04-26 01:11:15 |
Views | 1 |
Hits | 7 |
ID | 5437 |
Type | multi |
Downloaded | 16 time(s) |
Uploaded by | |
Folder | nih-reporter-exporter-25-04-25 |
Num files | 456 files [See full list] |
Mirrors | 11 complete, 2 downloading = 13 mirror(s) total [Log in to see full list] |

![]() |
4.57MB |
![]() |
49.25kB |
![]() |
4.41kB |
![]() |
4.42kB |
![]() |
4.65kB |
![]() |
54.27kB |
![]() |
42.99kB |
![]() |
52.67kB |
![]() |
6.43kB |
![]() |
19.72kB |
![]() |
8.65MB |
![]() |
7.59MB |
![]() |
14.71MB |
![]() |
10.49MB |
![]() |
13.31MB |
![]() |
13.14MB |
![]() |
13.51MB |
![]() |
17.56MB |
![]() |
13.65MB |
![]() |
24.15MB |
![]() |
14.59MB |
![]() |
25.15MB |
![]() |
6.49MB |
![]() |
26.98MB |
![]() |
8.71MB |
![]() |
11.40MB |
![]() |
16.34MB |
![]() |
10.02MB |
![]() |
7.15MB |
![]() |
7.78MB |
![]() |
6.37MB |
![]() |
14.97MB |
![]() |
45.82kB |
![]() |
24.65kB |
![]() |
41.27kB |
![]() |
18.34kB |
![]() |
24.15kB |
![]() |
28.00kB |
![]() |
9.92kB |
![]() |
32.91kB |
![]() |
28.73kB |
![]() |
14.22kB |
![]() |
95.27kB |
![]() |
52.19kB |
![]() |
43.80kB |
![]() |
60.68kB |
![]() |
41.66kB |
![]() |
45.20kB |
![]() |
35.77kB |
|
Type: Dataset
Tags: Research, bibliometrics, NIH, scholarly-publishing, grants, publications, patents, funding, demographics, clinical-trials
Bibtex:
Tags: Research, bibliometrics, NIH, scholarly-publishing, grants, publications, patents, funding, demographics, clinical-trials
Bibtex:
@article{, title= {NIH RePORTER 25-04-25}, journal= {}, author= {National Institutes of Health}, year= {}, url= {https://sciop.net/datasets/nih-reporter}, abstract= {RePORTER is the NIH's database of its grants, awards, and publications. > In addition to carrying out its scientific mission, NIH exemplifies and promotes the highest level of public accountability. To that end, the Research Portfolio Online Reporting Tools (RePORT) website provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH-supported research. > > One of the tools available on the RePORT website is the RePORTER (RePORT Expenditures and Results) module. RePORTER is an electronic tool that allows users to search a repository of both intramural and extramural NIH-funded research projects and access publications and patents resulting from NIH funding. > > In addition to RePORTER, the RePORT website also contains other tools that provide access to reports and summary statistics on NIH funding and the organizations and people involved in NIH research and training. One of these tools is the NIH Data Book, which summarizes the most commonly asked questions about the NIH budget and extramural programs. Another tool is called Awards by Location, which summarizes NIH awards for a particular fiscal year by the location and organization of the awardees. This upload also contains the [NIH databook](https://report.nih.gov/nihdatabook), exported to xlsx and csv. ## Threat This data might seem very low risk, since it's just uncontroversial data about what the NIH has funded in the past, but the trump administration has moved to clear records from the NSF's records system, so one can expect RePORTER data to follow soon after. Initially set on 'watchlist' because no specific threat has been identified. ## Method ### ExPORTER Data was acquired from NIH's ExPORTER site: https://reporter.nih.gov/exporter/ The page does not contain regular links with download URLs, instead the table is dynamically loaded and the links are opaque single-use query parameter blobs. A simple playwright script captures the data, and pandas is used to extrac tthe data dictionaries: ```python from pathlib import Path import pandas as pd from playwright.sync_api import sync_playwright from tqdm import trange TABS = [ "https://reporter.nih.gov/exporter/projects", "https://reporter.nih.gov/exporter/abstracts", "https://reporter.nih.gov/exporter/publications", "https://reporter.nih.gov/exporter/patents", "https://reporter.nih.gov/exporter/clinicalstudies", "https://reporter.nih.gov/exporter/linktables" ] def scrape_reporter(): with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() for tab in TABS: page.goto(tab) print(tab) # wait for the table within the page to load page.wait_for_load_state("networkidle") # this really actually is the best way to find the download buttons. locs = page.locator("i.fa-cloud-download-alt") loc_count = locs.count() for idx in trange(loc_count): with page.expect_download() as download_info: locs.nth(idx).click() download = download_info.value download.save_as("./" + download.suggested_filename) def scrape_data_dictionary(): path = Path() / "data_dictionaries" path.mkdir(exist_ok=True) tabs = pd.read_html("https://report.nih.gov/exporter-data-dictionary") names = [ "project", "abstract", "publication", "patent", "clinical_study", "link_table", "publication_author_affiliation_link_table" ] for tab, name in zip(tabs, names): tab.to_csv(path / (name + ".csv")) if __name__ == "__main__": scrape_reporter() scrape_data_dictionary() ``` ### Databook Similarly, databook was scraped with playwright, exporting each report to excel and re-exporting to CSV, and then zipping folders based on the categories in the databook ```python from pathlib import Path import pandas as pd from playwright.sync_api import sync_playwright from tqdm import trange from slugify import slugify BASE_URL = "https://report.nih.gov/nihdatabook/category/" REPORT_CATEGORIES=31 OUTPUT = Path("databook") OUTPUT.mkdir(exist_ok=True) def scrape_databook(): with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() for i in trange(REPORT_CATEGORIES, position=0): page.goto(BASE_URL + str(i)) # wait for the table within the page to load page.wait_for_load_state("networkidle") header = page.locator("h1").first.inner_text() if "not found" in header.lower(): continue directory = OUTPUT / slugify(header) directory.mkdir(exist_ok=True, parents=True) exports = page.locator("svg.fa-download").filter(visible=True) report_count = exports.count() for idx in trange(report_count, position=1): exports.nth(idx).click() with page.expect_download() as download_info: page.get_by_role("button", name="Export Data as Excel").click() download = download_info.value download.save_as(directory / download.suggested_filename) def scrape_historical(): with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() page.goto("https://report.nih.gov/nihdatabook/page/historical-data-books") directory = OUTPUT / "historical" directory.mkdir(exist_ok=True) downloads = page.get_by_text("Download").filter(visible=True) downloads_count = downloads.count() for idx in trange(downloads_count): with page.expect_download() as download_info: downloads.nth(idx).click() download = download_info.value download.save_as(directory / download.suggested_filename) def reexport_csv(): """Re-export all the xlsx's into csvs""" for xls in OUTPUT.rglob("*.xlsx"): tabs = pd.read_excel(xls, sheet_name=None) if len(tabs) == 1: tab = tabs[ list(tabs.keys())[0] ].to_csv( xls.with_suffix(".csv"), index=False, ) else: for idx, (tab_name, tab) in enumerate(tabs.items()): tab.to_csv( xls.with_suffix(f".{idx}.{tab_name}.csv"), index=False, ) if __name__ == "__main__": scrape_databook() scrape_historical() reexport_csv() ```}, keywords= {nih, grants, publications, research, patents, scholarly-publishing, bibliometrics, funding, demographics, clinical-trials}, terms= {}, license= {}, superseded= {} }