@article{,
title= {catalog.archives.gov-lgbt},
journal= {},
author= {},
year= {},
url= {},
abstract= {A partial mirror of catalog.archives.gov, filtered for the LGBTQ-related
keywords found in keywords.txt. The list of keywords was obtained from
catalog-links that were in turn manually collected from
https://www.archives.gov/research/lgbt
For each keyword in keywords.txt, we fire off a search and attempt to download
all metadata (folder "search-results") and attachments (folders "tifs",
"jpgs", "pdfs", "others").
Folders are packed as ZStandard-compressed tarballs to save space and to
reduce overhead in torrent metadata. All data unpacked is approximately 3 TB,
tifs being 2.6 TB of that.
Overview:
search-results.tar.zst contains all JSON metadata that would be available on
search results pages. This includes descriptions, authorship, year of each
record found, and a list of download URLs for PDFs etc. It's best to download
those files first to determine whether this dataset contains something
specific you need.
pdfs.tar.zst, tifs.tar.zst, jpgs.tar.zst, others.tar.zst contain the actual
downloads, segmented by file-type for compression purposes. download-urls.txt.zst contains the list of AWS S3 urls that were downloaded into those folders.
generate-urls.py was used to scrape the catalog for metadata. The detailed procedure for scraping is outlined in steps.txt
Data captured around 2025-02-23.},
keywords= {united states,usa,archives.gov,nara},
terms= {},
license= {},
superseded= {}
}