Name: The New York Times Annotated Corpus
Creator: Sandhaus, Evan
Published: 2022-07-01 06:57:40
License: https://academictorrents.com/nolicensespecified

LDC2008T19_NYT_Annotated_Corpus.tar.zst	3.23GB

Type: Dataset

Tags: Datasetnlpenglishnatural languagelanguagecorpusnewsdatatextNew York TimesAnnotatedlinguisticsNITFtaggedarticlesnatural language processingNYT

Metadata:

@article{,
title= {The New York Times Annotated Corpus},
journal= {},
author= {Sandhaus, Evan},
year= {2008},
url= {https://catalog.ldc.upenn.edu/LDC2008T19},
abstract= {The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

- Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
- Over 650,000 article summaries written by library scientists.
- Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.
- Java tools for parsing corpus documents from .xml into a memory resident object.

As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".

The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.

Data

The text in this corpus is formatted in News Industry Text Format (NITF) developed by the International Press Telecommunications Council, an independent association of news agencies and publishers. NITF is an XML specification that provides a standardized representation for the content and structure of discrete news articles. NITF encompasses structural markup such as bylines, headlines and paragraphs. The format also provides management attributes for categorizing articles into topics, summarization usage restrictions and revision histories. The goals of NITF are to answer the essential questions inherent in news articles: Who, What, When, Where and Why.

- Who: Who owns the copyright, who has rights to republish the article and who the article is about.
- What: The subjects reported, the named entities inside the article and the events it describes.
- When: When the article was written, when it was issued and when it was revised.
- Where: Where the article was written, where the events took place and where it was delivered.
- Why: The metadata describing the newsworthiness of the article.

An updated manual is available. https://catalog.ldc.upenn.edu/docs/LDC2008T19/new_york_times_annotated_corpus.pdf


Metadata

Item Name:	The New York Times Annotated Corpus
Author(s):	Evan Sandhaus
LDC Catalog No.:	LDC2008T19
ISBN:	1-58563-486-7
ISLRN:	429-488-225-160-9
DOI:	https://doi.org/10.35111/77ba-9x74
Release Date:	October 17, 2008
Member Year(s):	2008
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	summarization, metadata extraction, information retrieval, information extraction
Language(s):	English
Language ID(s):	eng
License(s):	The New York Times Annotated Corpus Agreement https://catalog.ldc.upenn.edu/license/the-new-york-times-annotated-corpus-ldc2008t19.pdf
Online Documentation:	LDC2008T19 Documents https://catalog.ldc.upenn.edu/docs/LDC2008T19/
Citation:	Sandhaus, Evan. The New York Times Annotated Corpus LDC2008T19. Web Download. Philadelphia: Linguistic Data Consortium, 2008.},
keywords= {Dataset, nlp, english, natural language, language, corpus, news, data, text, New York Times, Annotated, linguistics, NITF, tagged, articles, natural language processing, NYT},
terms= {},
doi= {11272.1/AB2/GZC6PL},
isbn= {1-58563-486-7},
islrn= {429-488-225-160-9},
ldc= {LDC2008T19},
language= {English},
license= {},
superseded= {}
}

Citation:

Sandhaus, E.. (2008). The New York Times Annotated Corpus [Data set]. Academic Torrents. https://academictorrents.com/details/a48d52cbb929b7e2a601dcc1e48a30d0f16284ca

Date	2025-06-10 09:02:04
Submitter Name	Denise DiPersio
Submitter Email	dipersio@ldc.upenn.edu
Provide a description of the content in question:	This distribution is not authorized by LDC. Please remove these, and any other datasets, from the site immediately. The data, annotations and related material in LDC Databases are protected under US copyright laws and under various legal agreements between LDC/University of Pennsylvania and source data providers. LDC/University of Pennsylvania owns the LDC Databases and/or has the right to distribute them. LDC’s membership agreements and license agreements provide that users cannot publish, retransmit, disclose, display, copy, reproduce or redistribute the LDC Databases to others outside their organizations.
How are you authorized to make the request?	LDC/University of Pennsylvania either owns the data or has the right to distribute it pursuant to legal agreements with data providers.
How is the content not covered under the Fair Use Act sections 107 or 108?	We are not qualified to say what would constitute a fair use of LDC Databases. However, in this instance, the user has either breached an existing license agreement or hacked our catalog to obtain these datasets and is offering them on the platform to the same community served by LDC
Provide a statement that the complaining party has a good faith belief.	We have a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner(s), its agent, or the law.

The New York Times Annotated Corpus Sandhaus, Evan

The New York Times Annotated Corpus
Sandhaus, Evan