main (72 files)
data/google_code_0000.parquet |
734.06MB |
data/google_code_0001.parquet |
669.87MB |
data/google_code_0002.parquet |
649.12MB |
data/google_code_0003.parquet |
692.76MB |
data/google_code_0004.parquet |
712.58MB |
data/google_code_0005.parquet |
693.88MB |
data/google_code_0006.parquet |
712.86MB |
data/google_code_0007.parquet |
728.65MB |
data/google_code_0008.parquet |
723.45MB |
data/google_code_0009.parquet |
715.97MB |
data/google_code_0010.parquet |
706.24MB |
data/google_code_0011.parquet |
755.59MB |
data/google_code_0012.parquet |
733.36MB |
data/google_code_0013.parquet |
689.51MB |
data/google_code_0014.parquet |
726.90MB |
data/google_code_0015.parquet |
712.51MB |
data/google_code_0016.parquet |
679.11MB |
data/google_code_0017.parquet |
738.23MB |
data/google_code_0018.parquet |
692.97MB |
data/google_code_0019.parquet |
671.82MB |
data/google_code_0020.parquet |
714.01MB |
data/google_code_0021.parquet |
701.05MB |
data/google_code_0022.parquet |
671.48MB |
data/google_code_0023.parquet |
720.06MB |
data/google_code_0024.parquet |
737.45MB |
data/google_code_0025.parquet |
724.12MB |
data/google_code_0026.parquet |
708.04MB |
data/google_code_0027.parquet |
738.88MB |
data/google_code_0028.parquet |
702.19MB |
data/google_code_0029.parquet |
701.68MB |
data/google_code_0030.parquet |
671.92MB |
data/google_code_0031.parquet |
716.41MB |
data/google_code_0032.parquet |
738.84MB |
data/google_code_0033.parquet |
736.72MB |
data/google_code_0034.parquet |
685.81MB |
data/google_code_0035.parquet |
704.99MB |
data/google_code_0036.parquet |
689.84MB |
data/google_code_0037.parquet |
685.29MB |
data/google_code_0038.parquet |
721.99MB |
data/google_code_0039.parquet |
731.18MB |
data/google_code_0040.parquet |
758.89MB |
data/google_code_0041.parquet |
654.67MB |
data/google_code_0042.parquet |
693.49MB |
data/google_code_0043.parquet |
690.05MB |
data/google_code_0044.parquet |
710.00MB |
data/google_code_0045.parquet |
693.12MB |
data/google_code_0046.parquet |
715.06MB |
data/google_code_0047.parquet |
722.32MB |
data/google_code_0048.parquet |
695.32MB |
|
|
|
Type: Dataset
Bibtex:
Tags:
Bibtex:
@article{,
title= {Google Code Archive Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/google-code-archive},
abstract= {## Dataset Description
This dataset was compiled from the [Google Code Archive](https://code.google.com/archive/), a preserved snapshot of projects hosted on Google Code, Google's open-source project hosting service that operated from 2006 to 2016. Google Code was one of the major code hosting platforms of its era, hosting hundreds of thousands of open-source projects before its shutdown. The archive provides a unique historical record of open-source development during a formative period of modern software engineering.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 65,825,565 |
| **Total Repositories** | 488,618 |
| **Total Size** | 47 GB (compressed Parquet) |
| **Programming Languages** | 454 |
| **File Format** | Parquet with Zstd compression (71 files) |
### Key Features
- **Historical open-source corpus**: Contains code from over 488K repositories hosted on Google Code during 2006-2016
- **Diverse language coverage**: Spans 454 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules)
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content
- **Era-specific patterns**: Captures coding conventions and library usage from the pre-modern era of software development
### Languages
The dataset includes 454 programming languages. The top 30 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | Java | 16,331,993 |
| 2 | PHP | 12,764,574 |
| 3 | HTML | 5,705,184 |
| 4 | C++ | 5,090,685 |
| 5 | JavaScript | 4,937,765 |
| 6 | C | 4,179,202 |
| 7 | C# | 3,872,245 |
| 8 | Python | 2,207,240 |
| 9 | CSS | 1,697,385 |
| 10 | Objective-C | 1,186,050 |
| 11 | Shell | 639,183 |
| 12 | Java Server Pages | 541,498 |
| 13 | ActionScript | 540,557 |
| 14 | Makefile | 481,563 |
| 15 | ASP.NET | 381,389 |
| 16 | Smarty | 339,555 |
| 17 | Ruby | 331,743 |
| 18 | Go | 316,427 |
| 19 | Perl | 307,960 |
| 20 | Vim Script | 216,236 |
| 21 | Lua | 215,226 |
| 22 | HTML+PHP | 150,781 |
| 23 | HTML+Razor | 149,131 |
| 24 | MATLAB | 145,686 |
| 25 | Batchfile | 138,523 |
| 26 | Pascal | 135,992 |
| 27 | Visual Basic .NET | 118,732 |
| 28 | TeX | 110,379 |
| 29 | Less | 98,221 |
| 30 | Unix Assembly | 94,758 |
### Licenses
The dataset includes files from repositories with various licenses as specified in the Google Code Archive:
| License | File Count |
|---------|------------|
| Apache License 2.0 (asf20) | 21,568,143 |
| GNU GPL v3 (gpl3) | 14,843,470 |
| GNU GPL v2 (gpl2) | 6,824,185 |
| Other Open Source (oos) | 5,433,436 |
| MIT License (mit) | 4,754,567 |
| GNU LGPL (lgpl) | 4,073,137 |
| BSD License (bsd) | 3,787,348 |
| Artistic License (art) | 1,910,047 |
| Eclipse Public License (epl) | 1,587,289 |
| Mozilla Public License 1.1 (mpl11) | 580,102 |
| Multiple Licenses (multiple) | 372,457 |
| Google Summer of Code (gsoc) | 63,292 |
| Public Domain (publicdomain) | 28,092 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the Google Code project |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) |
| `license` | string | License of the repository (Google Code license identifier) |
| `size` | int64 | Size of the source file in bytes |
### Data Format
- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 71 files (`google_code_0000.parquet` to `google_code_0070.parquet`)
### Data Splits
All examples are in the train split. There is no validation or test split.
### Example Data Point
```
{
'code': 'public class HundredIntegers {ntpublic static void main (String[] args) {nttfor (int i = 1; i<=100; i++) {ntttSystem.out.println(i);ntt}nt}n}',
'repo_name': '100integers',
'path': 'HundredIntegers.java',
'language': 'Java',
'license': 'epl',
'size': 147
}
```
## Dataset Creation
### Pipeline Overview
The dataset was created through a multi-stage pipeline:
1. **Project Discovery**: Fetching project metadata from the Google Code Archive
2. **Source Filtering**: Selecting projects that have source code available (`hasSource: true`)
3. **Archive Downloading**: Downloading source archives from the Google Code Archive storage
4. **Content Extraction**: Extracting and filtering source code files
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression
### Language Detection
Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub's Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded).
### License Detection
Licenses are obtained directly from the Google Code Archive project metadata. The archive preserves the original license selection made by project owners when creating their repositories on Google Code.
### File Filtering
Extensive filtering is applied to ensure data quality:
#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max single file size | 2 MB |
| Max line length | 1,000 characters |
#### Excluded Directories
- **Configuration**: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/`
- **Vendor/Dependencies**: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/`
- **Build Output**: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`
#### Excluded Files
- **Lock Files**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`
- **Minified Files**: Any file containing `.min.` in the name
- **Binary Files**: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`
- **System Files**: `.DS_Store`, `thumbs.db`
#### Content Filtering
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text
- **Binary Detection**: Files detected as binary by go-enry are excluded
- **Generated Files**: Files with generation markers in the first 500 bytes are excluded:
- `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `@generated`, `<auto-generated`
- **Empty Files**: Files that are empty or contain only whitespace are excluded
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **go-enry Filters**: Additional filtering using go-enry's `IsVendor()`, `IsImage()`, `IsDotFile()`, `IsTest()`, and `IsGenerated()` functions
- **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped
### Source Data
All data originates from the [Google Code Archive](https://code.google.com/archive/), which preserves projects hosted on Google Code before its shutdown in January 2016.
## Considerations for Using the Data
### Historical Context
This dataset represents code from 2006-2016 and may contain:
- Outdated coding patterns and deprecated APIs
- Legacy library dependencies that are no longer maintained
- Security vulnerabilities that have since been discovered and patched
- Code written for older language versions (Python 2, older Java versions, etc.)
Users should be aware that this code reflects historical practices and may not represent modern best practices.
### Personal and Sensitive Information
The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation
Users should exercise caution and implement appropriate filtering when using this data.
### Licensing Information
This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}
data/google_code_0000.parquet