Name: Chinese Gigaword 5th edition LDC2011T13
Creator: Robert Parker and David Graff and Ke Chen and Junbo Kong and Kazuaki Maeda
Published: 2022-07-12 03:43:27
License: https://academictorrents.com/nolicensespecified

LDC2011T13_chinese_gigaword_5e.tar.zst	4.39GB

Type: Dataset

Tags: Datasetnlpnatural languagemodelcorpusAnnotatedDARPALDCgigawordMandarinlanguage modelChinese

Metadata:

@article{,
title= {Chinese Gigaword 5th edition LDC2011T13},
journal= {},
author= {Robert Parker and David Graff and Ke Chen and Junbo Kong and Kazuaki Maeda},
isbn= {1-58563-599-5},
islrn= {162-966-215-437-6},
year= {2011},
dcmi= {Text},
languages= {Chinese, Mandarin Chinese, Mandarin, cmn},
doi= {10.35111/102m-dr17},
url= {https://doi.org/10.35111/102m-dr17},
applications= {language modeling, information retrieval, natural language processing},
abstract= {# Chinese Gigaword Fifth Edition - Linguistic Data Consortium

### Introduction

Chinese Gigaword Fifth Edition was produced by the Linguistic Data Consortium (LDC). It is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by LDC at the University of Pennsylvania. Chinese Gigaword Fifth Edition includes all of the content of the fourth edition of Chinese Gigaword ([LDC2009T27](http://catalog.ldc.upenn.edu/LDC2009T27)) plus new data covering the period from January 2009 through December 2010.

Eight distinct sources of Chinese newswire are represented here:

-   Agence France Presse(afp\_cmn)
-   Central News Agency, Taiwan(cna\_cmn)
-   Central News Service(cns\_cmn)
-   Guangming Daily(gmw\_cmn)
-   Peoples Daily(pda\_cmn)
-   Peoples Liberation Army Daily(pla\_cmn)
-   Xinhua News Agency(xin\_cmn)
-   Zaobao Newspaper(zbn\_cmn)

The seven-letter codes in the parentheses above are used for the directory names and data files for each source and are also used (in ALL\_CAPS) as part of the unique DOC id string assigned to each news article.

Articles covering the period from January 2009 through December 2010 have been added to the Agence France Presse, Central News Agency (CNA), Central News Service, Guangming Daily, Peoples Liberation Army Daily and Xinhua News Agency data sets. The data from Peoples Daily covers the period from late June 2009 through December 2010. No new data from Zaobao has been added. Additionally, Zaobao and CNA data included in previous releases were found to contain non-normalized full-width characters. Those files have been normalized to correct that issue.

### Data

Each data file name consists of the 7-letter prefix (e.g., xin\_cmn) and an underscore character (\_) followed by a 6-digit date (representing the year and month during which the file contents were originally published by the respective news source), followed by a .gz file extension, indicating that the file contents have been compressed using the GNU gzip compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword\_c.dtd in the docs directory provides the formal Document Type Declaration for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.

For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and \_approximate\_) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are:

-   story: This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
-   multi: This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on.
-   advis : (short for advisory) These are DOCs which the news service addresses to news editors -- they are not intended for publication to the end users (the populations who read the news). We also find a lot of formulaic, repetitive content in DOCs of this type (contact phone numbers, etc).
-   other: This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCs), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on.

### Sample

Please view this [sample](https://catalog.ldc.upenn.edu/desc/addenda/LDC2011T13.jpg).

### Sponsorship

This work was supported in part by the Defensed Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

### Metadata

- _Item Name:_ Chinese Gigaword Fifth Edition
- _Author(s):_ Robert Parker, David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda
- _LDC Catalog No.:_ LDC2011T13
- _ISBN:_ 1-58563-599-5
- _ISLRN:_ 162-966-215-437-6
- _DOI:_ [https://doi.org/10.35111/102m-dr17](https://doi.org/10.35111/102m-dr17)
- _Release Date:_ November 16, 2011
- _Member Year(s):_ 2011
- _DCMI Type(s):_ Text
- _Data Source(s):_ newswire
- _Project(s):_ GALE
- _Application(s):_ language modeling, information retrieval, natural language processing
- _Language(s):_ Mandarin Chinese
- _Language ID(s):_ cmn
- _License(s):_ [LDC User Agreement for Non-Members](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf)
- _Online Documentation:_ [LDC2011T13 Documents](https://catalog.ldc.upenn.edu/docs/LDC2011T13/)
- _Licensing Instructions:_ [Subscription & Standard Members, and Non-Members](http://www.ldc.upenn.edu/language-resources/data/obtaining)
- _Citation:_ Parker, Robert, et al. Chinese Gigaword Fifth Edition LDC2011T13. Web Download. Philadelphia: Linguistic Data Consortium, 2011.

},
keywords= {Dataset, nlp, natural language, model, corpus, Annotated, DARPA, LDC, gigaword, Mandarin, language model, Chinese},
terms= {},
license= {},
superseded= {}
}

Citation:

Parker, R., Graff, D., Chen, K., Kong, J., & Maeda, K.. (2011). Chinese Gigaword 5th edition LDC2011T13 [Data set]. Academic Torrents. https://academictorrents.com/details/3154947ad1bc51b82d616b58aecb0de0c993a837

Date	2025-06-10 09:11:30
Submitter Name	Denise DiPersio
Submitter Email	dipersio@ldc.upenn.edu
Provide a description of the content in question:	This distribution is not authorized by LDC. Please remove these, and any other datasets, from the site immediately. The data, annotations and related material in LDC Databases are protected under US copyright laws and under various legal agreements between LDC/University of Pennsylvania and source data providers. LDC/University of Pennsylvania owns the LDC Databases and/or has the right to distribute them. LDC’s membership agreements and license agreements provide that users cannot publish, retransmit, disclose, display, copy, reproduce or redistribute the LDC Databases to others outside their organizations.
How are you authorized to make the request?	LDC/University of Pennsylvania either owns the data or has the right to distribute it pursuant to legal agreements with data providers.
How is the content not covered under the Fair Use Act sections 107 or 108?	We are not qualified to say what would constitute a fair use of LDC Databases. However, in this instance, the user has either breached an existing license agreement or hacked our catalog to obtain these datasets and is offering them on the platform to the same community served by LDC
Provide a statement that the complaining party has a good faith belief.	We have a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner(s), its agent, or the law.

Chinese Gigaword 5th edition LDC2011T13 Robert Parker and David Graff and Ke Chen and Junbo Kong and Kazuaki Maeda

Chinese Gigaword 5th edition LDC2011T13
Robert Parker and David Graff and Ke Chen and Junbo Kong and Kazuaki Maeda