Google Code Archive Dataset
nyuuzyou

folder main (72 files)
filedata/google_code_0000.parquet 734.06MB
filedata/google_code_0001.parquet 669.87MB
filedata/google_code_0002.parquet 649.12MB
filedata/google_code_0003.parquet 692.76MB
filedata/google_code_0004.parquet 712.58MB
filedata/google_code_0005.parquet 693.88MB
filedata/google_code_0006.parquet 712.86MB
filedata/google_code_0007.parquet 728.65MB
filedata/google_code_0008.parquet 723.45MB
filedata/google_code_0009.parquet 715.97MB
filedata/google_code_0010.parquet 706.24MB
filedata/google_code_0011.parquet 755.59MB
filedata/google_code_0012.parquet 733.36MB
filedata/google_code_0013.parquet 689.51MB
filedata/google_code_0014.parquet 726.90MB
filedata/google_code_0015.parquet 712.51MB
filedata/google_code_0016.parquet 679.11MB
filedata/google_code_0017.parquet 738.23MB
filedata/google_code_0018.parquet 692.97MB
filedata/google_code_0019.parquet 671.82MB
filedata/google_code_0020.parquet 714.01MB
filedata/google_code_0021.parquet 701.05MB
filedata/google_code_0022.parquet 671.48MB
filedata/google_code_0023.parquet 720.06MB
filedata/google_code_0024.parquet 737.45MB
filedata/google_code_0025.parquet 724.12MB
filedata/google_code_0026.parquet 708.04MB
filedata/google_code_0027.parquet 738.88MB
filedata/google_code_0028.parquet 702.19MB
filedata/google_code_0029.parquet 701.68MB
filedata/google_code_0030.parquet 671.92MB
filedata/google_code_0031.parquet 716.41MB
filedata/google_code_0032.parquet 738.84MB
filedata/google_code_0033.parquet 736.72MB
filedata/google_code_0034.parquet 685.81MB
filedata/google_code_0035.parquet 704.99MB
filedata/google_code_0036.parquet 689.84MB
filedata/google_code_0037.parquet 685.29MB
filedata/google_code_0038.parquet 721.99MB
filedata/google_code_0039.parquet 731.18MB
filedata/google_code_0040.parquet 758.89MB
filedata/google_code_0041.parquet 654.67MB
filedata/google_code_0042.parquet 693.49MB
filedata/google_code_0043.parquet 690.05MB
filedata/google_code_0044.parquet 710.00MB
filedata/google_code_0045.parquet 693.12MB
filedata/google_code_0046.parquet 715.06MB
filedata/google_code_0047.parquet 722.32MB
filedata/google_code_0048.parquet 695.32MB
Too many files! Click here to view them all.
Type: Dataset
Tags:

Bibtex:
@article{,
title= {Google Code Archive Dataset},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/google-code-archive},
abstract= {## Dataset Description

This dataset was compiled from the [Google Code Archive](https://code.google.com/archive/), a preserved snapshot of projects hosted on Google Code, Google's open-source project hosting service that operated from 2006 to 2016. Google Code was one of the major code hosting platforms of its era, hosting hundreds of thousands of open-source projects before its shutdown. The archive provides a unique historical record of open-source development during a formative period of modern software engineering.

### Dataset Summary

| Statistic | Value |
|-----------|-------|
| **Total Files** | 65,825,565 |
| **Total Repositories** | 488,618 |
| **Total Size** | 47 GB (compressed Parquet) |
| **Programming Languages** | 454 |
| **File Format** | Parquet with Zstd compression (71 files) |

### Key Features

- **Historical open-source corpus**: Contains code from over 488K repositories hosted on Google Code during 2006-2016
- **Diverse language coverage**: Spans 454 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules)
- **Rich metadata**: Includes repository name, file path, detected language, license information, and file size
- **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content
- **Era-specific patterns**: Captures coding conventions and library usage from the pre-modern era of software development

### Languages

The dataset includes 454 programming languages. The top 30 languages by file count:

| Rank | Language | File Count |
|------|----------|------------|
| 1 | Java | 16,331,993 |
| 2 | PHP | 12,764,574 |
| 3 | HTML | 5,705,184 |
| 4 | C++ | 5,090,685 |
| 5 | JavaScript | 4,937,765 |
| 6 | C | 4,179,202 |
| 7 | C# | 3,872,245 |
| 8 | Python | 2,207,240 |
| 9 | CSS | 1,697,385 |
| 10 | Objective-C | 1,186,050 |
| 11 | Shell | 639,183 |
| 12 | Java Server Pages | 541,498 |
| 13 | ActionScript | 540,557 |
| 14 | Makefile | 481,563 |
| 15 | ASP.NET | 381,389 |
| 16 | Smarty | 339,555 |
| 17 | Ruby | 331,743 |
| 18 | Go | 316,427 |
| 19 | Perl | 307,960 |
| 20 | Vim Script | 216,236 |
| 21 | Lua | 215,226 |
| 22 | HTML+PHP | 150,781 |
| 23 | HTML+Razor | 149,131 |
| 24 | MATLAB | 145,686 |
| 25 | Batchfile | 138,523 |
| 26 | Pascal | 135,992 |
| 27 | Visual Basic .NET | 118,732 |
| 28 | TeX | 110,379 |
| 29 | Less | 98,221 |
| 30 | Unix Assembly | 94,758 |

### Licenses

The dataset includes files from repositories with various licenses as specified in the Google Code Archive:

| License | File Count |
|---------|------------|
| Apache License 2.0 (asf20) | 21,568,143 |
| GNU GPL v3 (gpl3) | 14,843,470 |
| GNU GPL v2 (gpl2) | 6,824,185 |
| Other Open Source (oos) | 5,433,436 |
| MIT License (mit) | 4,754,567 |
| GNU LGPL (lgpl) | 4,073,137 |
| BSD License (bsd) | 3,787,348 |
| Artistic License (art) | 1,910,047 |
| Eclipse Public License (epl) | 1,587,289 |
| Mozilla Public License 1.1 (mpl11) | 580,102 |
| Multiple Licenses (multiple) | 372,457 |
| Google Summer of Code (gsoc) | 63,292 |
| Public Domain (publicdomain) | 28,092 |

## Dataset Structure

### Data Fields

| Field | Type | Description |
|-------|------|-------------|
| `code` | string | Content of the source file (UTF-8 encoded) |
| `repo_name` | string | Name of the Google Code project |
| `path` | string | Path of the file within the repository (relative to repo root) |
| `language` | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) |
| `license` | string | License of the repository (Google Code license identifier) |
| `size` | int64 | Size of the source file in bytes |

### Data Format

- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 71 files (`google_code_0000.parquet` to `google_code_0070.parquet`)

### Data Splits

All examples are in the train split. There is no validation or test split.

### Example Data Point

```
{
    'code': 'public class HundredIntegers {ntpublic static void main (String[] args) {nttfor (int i = 1; i<=100; i++) {ntttSystem.out.println(i);ntt}nt}n}',
    'repo_name': '100integers',
    'path': 'HundredIntegers.java',
    'language': 'Java',
    'license': 'epl',
    'size': 147
}
```

## Dataset Creation

### Pipeline Overview

The dataset was created through a multi-stage pipeline:

1. **Project Discovery**: Fetching project metadata from the Google Code Archive
2. **Source Filtering**: Selecting projects that have source code available (`hasSource: true`)
3. **Archive Downloading**: Downloading source archives from the Google Code Archive storage
4. **Content Extraction**: Extracting and filtering source code files
5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression

### Language Detection

Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub's Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded).

### License Detection

Licenses are obtained directly from the Google Code Archive project metadata. The archive preserves the original license selection made by project owners when creating their repositories on Google Code.

### File Filtering

Extensive filtering is applied to ensure data quality:

#### Size Limits
| Limit | Value |
|-------|-------|
| Max repository archive size | 64 MB |
| Max single file size | 2 MB |
| Max line length | 1,000 characters |

#### Excluded Directories
- **Configuration**: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/`
- **Vendor/Dependencies**: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/`
- **Build Output**: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`

#### Excluded Files
- **Lock Files**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`
- **Minified Files**: Any file containing `.min.` in the name
- **Binary Files**: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`
- **System Files**: `.DS_Store`, `thumbs.db`

#### Content Filtering
- **UTF-8 Validation**: Files must be valid UTF-8 encoded text
- **Binary Detection**: Files detected as binary by go-enry are excluded
- **Generated Files**: Files with generation markers in the first 500 bytes are excluded:
  - `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `@generated`, `<auto-generated`
- **Empty Files**: Files that are empty or contain only whitespace are excluded
- **Long Lines**: Files with any line exceeding 1,000 characters are excluded
- **go-enry Filters**: Additional filtering using go-enry's `IsVendor()`, `IsImage()`, `IsDotFile()`, `IsTest()`, and `IsGenerated()` functions
- **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped

### Source Data

All data originates from the [Google Code Archive](https://code.google.com/archive/), which preserves projects hosted on Google Code before its shutdown in January 2016.

## Considerations for Using the Data

### Historical Context

This dataset represents code from 2006-2016 and may contain:
- Outdated coding patterns and deprecated APIs
- Legacy library dependencies that are no longer maintained
- Security vulnerabilities that have since been discovered and patched
- Code written for older language versions (Python 2, older Java versions, etc.)

Users should be aware that this code reflects historical practices and may not represent modern best practices.

### Personal and Sensitive Information

The dataset may contain:
- Email addresses in code comments or configuration files
- API keys or credentials that were accidentally committed
- Personal information in comments or documentation

Users should exercise caution and implement appropriate filtering when using this data.

### Licensing Information

This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.},
keywords= {},
terms= {},
license= {},
superseded= {}
}


Send Feedback