Name | DL | Torrents | Total Size | Natural Language Processing Datasets (via ResearchRat) [edit] | 25 | 128.02GB | 95 | 0 |
LDC99T42_Penn_Treebank_3.tar.zst | 29.83MB |
Type: Dataset
Tags: Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB
Bibtex:
Tags: Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB
Bibtex:
@article{, title= {Penn Treebank III 3 LDC99T42}, journal= {}, author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor}, year= {1999}, isbn= {1-58563-163-9}, islrn= {141-282-691-413-2}, dcmi= {text}, language= {english}, doi= {10.35111/gq1x-j780}, url= {https://doi.org/10.35111/gq1x-j780}, abstract= {# Penn Treebank III ## Metadata - _Item Name:_ Treebank-3 - _Author(s):_ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor - _LDC Catalog No.:_ LDC99T42 - _ISBN:_ 1-58563-163-9 - _ISLRN:_ 141-282-691-413-2 - _DOI:_ [https://doi.org/10.35111/gq1x-j780](https://doi.org/10.35111/gq1x-j780) - _Member Year(s):_ 1999 - _DCMI Type(s):_ Text - _Data Source(s):_ telephone speech, newswire, microphone speech, transcribed speech, varied - _Project(s):_ TIDES, GALE - _Application(s):_ parsing, natural language processing, tagging - _Language(s):_ English - _Language ID(s):_ eng - _License(s):_ [LDC User Agreement for Non-Members](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf) - _Online Documentation:_ [LDC99T42 Documents](https://catalog.ldc.upenn.edu/docs/LDC99T42/) - _Citation:_ Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999. ## Introduction This release contains the following [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) Material: - One million words of 1989 Wall Street Journal material annotated in Treebank II style. - A small sample of ATIS-3 material annotated in Treebank II style. - A fully tagged version of the Brown Corpus. and the following new material: - Switchboard tagged, dysfluency-annotated, and parsed text - Brown parsed text The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied. ## Data The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)) and Treebank-3 ([LDC99T42](https://catalog.ldc.upenn.edu/LDC99T42)) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER. ## Samples Please view the following samples: - [Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.pos.txt) - [Dysfluency Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dff.txt) - [Dysfluency Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mgd.txt) - [Dysfluency Annotation, Part-of-Speech Tags & Turns Joined](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dps.txt) - [Syntactic Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.prd.txt) - [Syntactic Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mrg.txt) ## Updates After publication, it was discovered that not all of the postscript (\*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to [addenda](https://catalog.ldc.upenn.edu/desc/addenda/LDC1999T42) for a list of the files available. As of October 5, 2016 252 wsj files from [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) were added that were previously missing. As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)). Corpus downoads after these dates will include these missing files.}, keywords= {Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB}, terms= {}, license= {}, superseded= {} }