Name | DL | Torrents | Total Size | Natural Language Processing Datasets (via ResearchRat) [edit] | 25 | 128.02GB | 93 | 0 |
LDC2015T13_Penn_Treebank_revised.tar.zst | 6.86MB |
Type: Dataset
Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB
Bibtex:
Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB
Bibtex:
@article{, title= {Penn Treebank Revised: English News Text Treebank LDC2015T13}, journal= {}, author= {Ann Bies and Justin Mott and Colin Warner}, year= {2015}, url= {https://doi.org/10.35111/xpjy-at91}, doi= {10.35111/xpjy-at91}, isbn= {1-58563-724-6}, dcmi= {text}, languages= {english}, language= {english}, ldc= {LDC2015T13}, abstract= {# Penn Treebank Revised: English News Text Treebank - 2015 ## Metadata * Item Name: English News Text Treebank: Penn Treebank Revised * Author(s): Ann Bies, Justin Mott, Colin Warner * LDC Catalog No.: LDC2015T13 * ISBN: 1-58563-724-6 * DOI: https://doi.org/10.35111/xpjy-at91 * Release Date: July 15, 2015 * Member Year(s): 2015 * DCMI Type(s): Text * Data Source(s): newswire * Application(s): parsing, tagging, part of speech tagging, natural language processing * Language(s): English * Language ID(s): eng * License(s): LDC User Agreement for Non-Members * Online Documentation: LDC2015T13 Documents * Licensing Instructions: Subscription & Standard Members, and Non-Members * Citation: Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015. * Related Works: View ## Introduction English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files. ## Data This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank ([LDC2012T13](https://catalog.ldc.upenn.edu/LDC2012T13)), OntoNotes ([LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire ([LDC2012T02](https://catalog.ldc.upenn.edu/LDC2012T02)). English Treebank Supplemental Guidelines are included in this release. ## Samples Please view this [treebank](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.tree.txt) and [tokenized](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.txt) samples. ## Updates None at this time. }, keywords= {nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, PTB, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ}, terms= {}, license= {}, superseded= {} }