Penn Treebank II 2 - LDC95T7
Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz

Type: Dataset
Tags: Dataset, nlp, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, LDC, PTB, LDC95T7, PTB2, transcript, transcribed speech

title= {Penn Treebank II 2 - LDC95T7},
journal= {},
author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz},
year= {1995},
doi= {10.35111/wf9p-g717},
isbn= {1-58563-054-3},
islrn= {650-146-755-602-3},
languages= {english},
language= {english},
url= {},
abstract= {# Penn Treebank II

## Metadata

* Item Name:	Treebank-2
* Author(s):	Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz
* LDC Catalog No.:	LDC95T7
* ISBN:	1-58563-054-3
* ISLRN:	650-146-755-602-3
* DOI:
* Member Year(s):	1995
* DCMI Type(s):	Text
* Data Source(s):	varied, transcribed speech, newswire, microphone speech
* Application(s):	parsing, natural language processing, tagging
* Language(s):	English
* Language ID(s):	eng
* License(s):	LDC User Agreement for Non-Members
* Online Documentation:	LDC95T7 Documents
* Licensing Instructions:	Subscription & Standard Members, and Non-Members
* Citation:	Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. Treebank-2 LDC95T7. Web Download. Philadelphia: Linguistic Data Consortium, 1995.
* Related Works:
	* isAnnotationOf
		* LDC93T3A TIPSTER Complete
	* hasAnnotation
		* LDC2004T14 Proposition Bank I
		* LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
		* LDC2008T05 Penn Discourse Treebank Version 2.0
		* LDC2008T23 NomBank v 1.0
		* LDC2012T04 2009 CoNLL Shared Task Part 2
		* LDC2013T22 The ARRAU Corpus of Anaphoric Information
		* LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
		* LDC2019T05 Penn Discourse Treebank Version 3.0
	* hasOutcome
		* LDC2005T13 CCGbank
		* LDC2008T24 COMNOM v 1.0
	* isContinuationOf
	* hasContinuation
		* LDC99T42 Treebank-3
	* isSimilarWith
		* LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1

## Intro

**Original release was included in LDC Catalog No. LDC93T1**

**Original Treebank Release**

This release contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.

In addition, the release includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. Also included is a pre-compiled program file for tgrep, built for use on Sun sparc systems.)

**Release - 2**

The PTB Project Release 2 features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This release also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.

The contents of Treebank Release 2 are as follows:

-   One million words of 1989 Wall Street Journal material annotated in Treebank-2 style.
-   A small sample of ATIS-3 material annotated in Treebank-2 style.
-   300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines.
-   The contents of the previous Treebank release (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style).
-   Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port).

In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from, in pub/treebank/doc/update.cd2.

 The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 & Treebank-3 both include the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Updates

See also:

- [Penn Treebank III - 1999](
- [Penn Treebank Revised - 2015](},
keywords= {LDC, PTB, penn treebank, treebank, NLP, natural language, corpus, corpora, dataset, LDC95T7, LDC, PTB2, text, news, newswire, transcript, transcribed speech},
terms= {},
license= {},
superseded= {}

Send Feedback