Type: Dataset
Tags: usa, united states, archives.gov, nara
Bibtex:
Tags: usa, united states, archives.gov, nara
Bibtex:
@article{, title= {catalog.archives.gov-lgbt}, journal= {}, author= {}, year= {}, url= {}, abstract= {A partial mirror of catalog.archives.gov, filtered for the LGBTQ-related keywords found in keywords.txt. The list of keywords was obtained from catalog-links that were in turn manually collected from https://www.archives.gov/research/lgbt For each keyword in keywords.txt, we fire off a search and attempt to download all metadata (folder "search-results") and attachments (folders "tifs", "jpgs", "pdfs", "others"). Folders are packed as ZStandard-compressed tarballs to save space and to reduce overhead in torrent metadata. All data unpacked is approximately 3 TB, tifs being 2.6 TB of that. Overview: search-results.tar.zst contains all JSON metadata that would be available on search results pages. This includes descriptions, authorship, year of each record found, and a list of download URLs for PDFs etc. It's best to download those files first to determine whether this dataset contains something specific you need. pdfs.tar.zst, tifs.tar.zst, jpgs.tar.zst, others.tar.zst contain the actual downloads, segmented by file-type for compression purposes. download-urls.txt.zst contains the list of AWS S3 urls that were downloaded into those folders. generate-urls.py was used to scrape the catalog for metadata. The detailed procedure for scraping is outlined in steps.txt Data captured around 2025-02-23.}, keywords= {united states,usa,archives.gov,nara}, terms= {}, license= {}, superseded= {} }