Penn Treebank II 2 - LDC95T7 - Academic Torrents

LDC95T7_Penn_Treebank_2.tar.zst	134.99MB

Type: Dataset

Tags: Datasetnlpnatural languagecorpusnewstextnewswireTreebankLDCcorporaPenn TreebankPTBLDC95T7PTB2transcripttranscribed speech

Metadata:

@article{,
title= {Penn Treebank II 2 - LDC95T7},
journal= {},
author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz},
year= {1995},
doi= {10.35111/wf9p-g717},
isbn= {1-58563-054-3},
islrn= {650-146-755-602-3},
languages= {english},
language= {english},
url= {https://doi.org/10.35111/wf9p-g717},
abstract= {# Penn Treebank II

## Metadata

* Item Name:	Treebank-2
* Author(s):	Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz
* LDC Catalog No.:	LDC95T7
* ISBN:	1-58563-054-3
* ISLRN:	650-146-755-602-3
* DOI:	https://doi.org/10.35111/wf9p-g717
* Member Year(s):	1995
* DCMI Type(s):	Text
* Data Source(s):	varied, transcribed speech, newswire, microphone speech
* Application(s):	parsing, natural language processing, tagging
* Language(s):	English
* Language ID(s):	eng
* License(s):	LDC User Agreement for Non-Members
* Online Documentation:	LDC95T7 Documents
* Licensing Instructions:	Subscription & Standard Members, and Non-Members
* Citation:	Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. Treebank-2 LDC95T7. Web Download. Philadelphia: Linguistic Data Consortium, 1995.
* Related Works:
	* isAnnotationOf
		* LDC93T3A TIPSTER Complete
	* hasAnnotation
		* LDC2004T14 Proposition Bank I
		* LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
		* LDC2008T05 Penn Discourse Treebank Version 2.0
		* LDC2008T23 NomBank v 1.0
		* LDC2012T04 2009 CoNLL Shared Task Part 2
		* LDC2013T22 The ARRAU Corpus of Anaphoric Information
		* LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
		* LDC2019T05 Penn Discourse Treebank Version 3.0
	* hasOutcome
		* LDC2005T13 CCGbank
		* LDC2008T24 COMNOM v 1.0
	* isContinuationOf
		* LDC93T1 ACL/DCI
	* hasContinuation
		* LDC99T42 Treebank-3
	* isSimilarWith
		* LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1

## Intro

**Original release was included in LDC Catalog No. LDC93T1**

**Original Treebank Release**

This release contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.

In addition, the release includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. Also included is a pre-compiled program file for tgrep, built for use on Sun sparc systems.)

**Release - 2**

The PTB Project Release 2 features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This release also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.

The contents of Treebank Release 2 are as follows:

-   One million words of 1989 Wall Street Journal material annotated in Treebank-2 style.
-   A small sample of ATIS-3 material annotated in Treebank-2 style.
-   300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines.
-   The contents of the previous Treebank release (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style).
-   Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port).

In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.

 The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 & Treebank-3 both include the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Updates

See also:

- [Penn Treebank III - 1999](https://academictorrents.com/details/504381e9ebabbdd41e1611b543a6ce0d2dde7695)
- [Penn Treebank Revised - 2015](https://academictorrents.com/details/e15c3ac148c5d13b74171a86ee3b4e0439418bb7)},
keywords= {Dataset, nlp, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, LDC, PTB, LDC95T7, PTB2, transcript, transcribed speech},
terms= {},
license= {},
superseded= {}
}

Citation:

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A.. (1995). Penn Treebank II 2 - LDC95T7 [Data set]. Academic Torrents. https://academictorrents.com/details/ad4131ac6712481cce61cbc478ef4f1e62863f38

Date	2025-06-10 09:12:24
Submitter Name	Denise DiPersio
Submitter Email	dipersio@ldc.upenn.edu
Provide a description of the content in question:	This distribution is not authorized by LDC. Please remove these, and any other datasets, from the site immediately. The data, annotations and related material in LDC Databases are protected under US copyright laws and under various legal agreements between LDC/University of Pennsylvania and source data providers. LDC/University of Pennsylvania owns the LDC Databases and/or has the right to distribute them. LDC’s membership agreements and license agreements provide that users cannot publish, retransmit, disclose, display, copy, reproduce or redistribute the LDC Databases to others outside their organizations.
How are you authorized to make the request?	LDC/University of Pennsylvania either owns the data or has the right to distribute it pursuant to legal agreements with data providers.
How is the content not covered under the Fair Use Act sections 107 or 108?	We are not qualified to say what would constitute a fair use of LDC Databases. However, in this instance, the user has either breached an existing license agreement or hacked our catalog to obtain these datasets and is offering them on the platform to the same community served by LDC
Provide a statement that the complaining party has a good faith belief.	We have a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner(s), its agent, or the law.

Penn Treebank II 2 - LDC95T7 Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz

Penn Treebank II 2 - LDC95T7
Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz