CCGBank: CCG Combinatory Categorical Grammar for Penn Treebank 2 - LDC2005T13
Julia Hockenmaier and Mark Steedman

Type: Dataset
Tags: nlp, natural language, Treebank, LDC, Penn Treebank, CCG, Combinatory Categorical Grammer, grammar, 2005, LDC2005T13, CCGBank

title= {CCGBank: CCG Combinatory Categorical Grammar for Penn Treebank 2 - LDC2005T13},
journal= {},
author= {Julia Hockenmaier and Mark Steedman},
year= {2005},
url= {},
isbn= {1-58563-340-2},
doi= {10.35111/a589-6d76},
ldc= {LDC2005T13},
islrn= {181-921-208-336-7},
abstract= {# CCGbank

* Item Name:	CCGbank
* Author(s):	Julia Hockenmaier, Mark Steedman
* LDC Catalog No.:	LDC2005T13
* ISBN:	1-58563-340-2
* ISLRN:	181-921-208-336-7
* DOI:
* Release Date:	May 15, 2005
* Member Year(s):	2005
* DCMI Type(s):	Text
* Data Source(s):	newswire
* Project(s):	GALE, TIDES
* Application(s):	automatic content extraction, cross-lingual information retrieval, information detection, natural language processing
* Language(s):	English
* Language ID(s):	eng
* Citation:	Hockenmaier, Julia, and Mark Steedman. CCGbank LDC2005T13. Web Download. Philadelphia: Linguistic Data Consortium, 2005.

## Introduction

CCGbank was developed by the University of Edinburgh and contains approximately 49,000 sentences of English text formatted in Combinatory Categorial Grammar (CCG) derivations. The sentences used for this corpus are from [Treebank-2 (LDC95T7)]( and represent 99.44% of the entire treebank. For the remaining 274 sentences, the translation algorithm failed to provide a CCG derivation.

CCG is a grammatical theory which provides a completely transparent interface between surface syntax and underlying semantics. Each (complete or partial) syntactic derivation corresponds directly to an interpretable structure. This allows CCG to provide an account for the incremental nature of human language processing. The syntactic rules of CCG are based on categorial calculus and combinatory logic. The main attraction of using CCG for parsing is that it facilitates the recovery of the non-local dependencies involved in constructions such as extraction, coordination, control, and raising.

## Data

There are three sets of files which mirror the directory and file structure of the Penn Treebank: the human readable files in HTML format, the machine-readable corpus files (.auto), and the predicate-argument structure files (.parg). The corpus also includes a lexicon specifying the categories that the words of a language can take and files detailing grammar rule instantiations.

## Samples

For an example of the data in this corpus, please view this [sample (HTML)](

## Update

The current version, 1.1, is a bug fix that supersedes the old package. It is available for download.
keywords= {nlp, natural language, Treebank, LDC, Penn Treebank, CCGBank, CCG, Combinatory Categorical Grammer, grammar, 2005, LDC2005T13},
terms= {},
license= {},
superseded= {}

Hosted by users:

Send Feedback