Name: Google Code Archive Dataset
Creator: nyuuzyou
Published: 2026-01-09 09:18:14
License: https://academictorrents.com/nolicensespecified

main (72 files)

data/google_code_0000.parquet	734.06MB
data/google_code_0001.parquet	669.87MB
data/google_code_0002.parquet	649.12MB
data/google_code_0003.parquet	692.76MB
data/google_code_0004.parquet	712.58MB
data/google_code_0005.parquet	693.88MB
data/google_code_0006.parquet	712.86MB
data/google_code_0007.parquet	728.65MB
data/google_code_0008.parquet	723.45MB
data/google_code_0009.parquet	715.97MB
data/google_code_0010.parquet	706.24MB
data/google_code_0011.parquet	755.59MB
data/google_code_0012.parquet	733.36MB
data/google_code_0013.parquet	689.51MB
data/google_code_0014.parquet	726.90MB
data/google_code_0015.parquet	712.51MB
data/google_code_0016.parquet	679.11MB
data/google_code_0017.parquet	738.23MB
data/google_code_0018.parquet	692.97MB
data/google_code_0019.parquet	671.82MB
data/google_code_0020.parquet	714.01MB
data/google_code_0021.parquet	701.05MB
data/google_code_0022.parquet	671.48MB
data/google_code_0023.parquet	720.06MB
data/google_code_0024.parquet	737.45MB
data/google_code_0025.parquet	724.12MB
data/google_code_0026.parquet	708.04MB
data/google_code_0027.parquet	738.88MB
data/google_code_0028.parquet	702.19MB
data/google_code_0029.parquet	701.68MB
data/google_code_0030.parquet	671.92MB
data/google_code_0031.parquet	716.41MB
data/google_code_0032.parquet	738.84MB
data/google_code_0033.parquet	736.72MB
data/google_code_0034.parquet	685.81MB
data/google_code_0035.parquet	704.99MB
data/google_code_0036.parquet	689.84MB
data/google_code_0037.parquet	685.29MB
data/google_code_0038.parquet	721.99MB
data/google_code_0039.parquet	731.18MB
data/google_code_0040.parquet	758.89MB
data/google_code_0041.parquet	654.67MB
data/google_code_0042.parquet	693.49MB
data/google_code_0043.parquet	690.05MB
data/google_code_0044.parquet	710.00MB
data/google_code_0045.parquet	693.12MB
data/google_code_0046.parquet	715.06MB
data/google_code_0047.parquet	722.32MB
data/google_code_0048.parquet	695.32MB
Too many files! Click here to view them all.

Type: Dataset

Tags:

Bibtex:

@article{,
title= {Google Code Archive Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/google-code-archive},
abstract= {## Dataset Description

This dataset was compiled from the [Google Code Archive](https://code.google.com/archive/), a preserved snapshot of projects hosted on Google Code, Google's open-source project hosting service that operated from 2006 to 2016. Google Code was one of the major code hosting platforms of its era, hosting hundreds of thousands of open-source projects before its shutdown. The archive provides a unique historical record of open-source development during a formative period of modern software engineering.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 65,825,565 |
| **Total Repositories** | 488,618 |
| **Total Size** | 47 GB (compressed Parquet) |
| **Programming Languages** | 454 |
| **File Format** | Parquet with Zstd compression (71 files) |

### Key Features

- **Historical open-source corpus**: Contains code from over 488K repositories hosted on Google Code during 2006-2016
- **Diverse language coverage**: Spans 454 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules)
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content
- **Era-specific patterns**: Captures coding conventions and library usage from the pre-modern era of software development

### Languages

The dataset includes 454 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | Java | 16,331,993 |
| 2 | PHP | 12,764,574 |
| 3 | HTML | 5,705,184 |
| 4 | C++ | 5,090,685 |
| 5 | JavaScript | 4,937,765 |
| 6 | C | 4,179,202 |
| 7 | C# | 3,872,245 |
| 8 | Python | 2,207,240 |
| 9 | CSS | 1,697,385 |
| 10 | Objective-C | 1,186,050 |
| 11 | Shell | 639,183 |
| 12 | Java Server Pages | 541,498 |
| 13 | ActionScript | 540,557 |
| 14 | Makefile | 481,563 |
| 15 | ASP.NET | 381,389 |
| 16 | Smarty | 339,555 |
| 17 | Ruby | 331,743 |
| 18 | Go | 316,427 |
| 19 | Perl | 307,960 |
| 20 | Vim Script | 216,236 |
| 21 | Lua | 215,226 |
| 22 | HTML+PHP | 150,781 |
| 23 | HTML+Razor | 149,131 |
| 24 | MATLAB | 145,686 |
| 25 | Batchfile | 138,523 |
| 26 | Pascal | 135,992 |
| 27 | Visual Basic .NET | 118,732 |
| 28 | TeX | 110,379 |
| 29 | Less | 98,221 |
| 30 | Unix Assembly | 94,758 |

### Licenses

The dataset includes files from repositories with various licenses as specified in the Google Code Archive:

| License | File Count |
|---------|------------|
| Apache License 2.0 (asf20) | 21,568,143 |
| GNU GPL v3 (gpl3) | 14,843,470 |
| GNU GPL v2 (gpl2) | 6,824,185 |
| Other Open Source (oos) | 5,433,436 |
| MIT License (mit) | 4,754,567 |
| GNU LGPL (lgpl) | 4,073,137 |
| BSD License (bsd) | 3,787,348 |
| Artistic License (art) | 1,910,047 |
| Eclipse Public License (epl) | 1,587,289 |
| Mozilla Public License 1.1 (mpl11) | 580,102 |
| Multiple Licenses (multiple) | 372,457 |
| Google Summer of Code (gsoc) | 63,292 |
| Public Domain (publicdomain) | 28,092 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the Google Code project |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) |
| `license` | string | License of the repository (Google Code license identifier) |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 71 files (`google_code_0000.parquet` to `google_code_0070.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': 'public class HundredIntegers {ntpublic static void main (String[] args) {nttfor (int i = 1; i<=100; i++) {ntttSystem.out.println(i);ntt}nt}n}',
    'repo_name': '100integers',
    'path': 'HundredIntegers.java',
    'language': 'Java',
    'license': 'epl',
    'size': 147
}
```

## Dataset Creation

### Pipeline Overview

The dataset was created through a multi-stage pipeline:

1. **Project Discovery**: Fetching project metadata from the Google Code Archive
2. **Source Filtering**: Selecting projects that have source code available (`hasSource: true`)
3. **Archive Downloading**: Downloading source archives from the Google Code Archive storage
4. **Content Extraction**: Extracting and filtering source code files
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression

### Language Detection

Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub's Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded).

### License Detection

Licenses are obtained directly from the Google Code Archive project metadata. The archive preserves the original license selection made by project owners when creating their repositories on Google Code.

### File Filtering

Extensive filtering is applied to ensure data quality:

#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max single file size | 2 MB |
| Max line length | 1,000 characters |

#### Excluded Directories
- **Configuration**: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/`
- **Vendor/Dependencies**: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/`
- **Build Output**: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`

#### Excluded Files
- **Lock Files**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`
- **Minified Files**: Any file containing `.min.` in the name
- **Binary Files**: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`
- **System Files**: `.DS_Store`, `thumbs.db`

#### Content Filtering
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text
- **Binary Detection**: Files detected as binary by go-enry are excluded
- **Generated Files**: Files with generation markers in the first 500 bytes are excluded:
  - `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `@generated`, `<auto-generated`
- **Empty Files**: Files that are empty or contain only whitespace are excluded
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **go-enry Filters**: Additional filtering using go-enry's `IsVendor()`, `IsImage()`, `IsDotFile()`, `IsTest()`, and `IsGenerated()` functions
- **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped

### Source Data

All data originates from the [Google Code Archive](https://code.google.com/archive/), which preserves projects hosted on Google Code before its shutdown in January 2016.

## Considerations for Using the Data

### Historical Context

This dataset represents code from 2006-2016 and may contain:
- Outdated coding patterns and deprecated APIs
- Legacy library dependencies that are no longer maintained
- Security vulnerabilities that have since been discovered and patched
- Code written for older language versions (Python 2, older Java versions, etc.)

Users should be aware that this code reflects historical practices and may not represent modern best practices.

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}

Google Code Archive Dataset nyuuzyou

Google Code Archive Dataset
nyuuzyou