Crossref Event Data Archive
Crossref

folder Crossref Event Data Archive (2 files)
filearchive.tar.gz 107.59GB
fileREADME.md 9.08kB
Type: Dataset

Metadata:
@article{,
title= {Crossref Event Data Archive},
doi= {https://doi.org/10.13003/wjyr-rv9j},
journal= {},
author= {Crossref},
year= {},
url= {},
abstract= {# Crossref Event Data archive

This is an archive of all events collected by selected Crossref Event Data agents between its launch on 2017/02/17 and its deprecation on 2026/04/23.

The DOI of this dataset is https://doi.org/10.13003/wjyr-rv9j

## File format

The data are provided in [JSONL](https://jsonlines.org/) format.

Each data file has a `.jsonl` file extension and contains up to 5000 entries (lines).

Filenames follow this pattern: `agent-nnnn.jsonl`

## Structure of the archive

The data are hierarchically grouped in directories by agent, year, month, day and a 4-digit directory ID.

For example:

```shell
$ tree -L 6 data | head -n 18

data
├── crossref
│   ├── 2021
│   │   ├── 01
│   │   │   └── 01
│   │   │       └── 0000
│   │   │           └── crossref-0001.jsonl
│   │   ├── 04
│   │   │   ├── 28
│   │   │   │   └── 0000
│   │   │   │       └── crossref-0001.jsonl
│   │   │   ├── 29
│   │   │   │   └── 0000
│   │   │   │       ├── crossref-0001.jsonl
│   │   │   │       └── crossref-0002.jsonl
│   │   │   └── 30
│   │   │       └── 0000
│   │   │           └── crossref-0001.jsonl
```

Each daily directory contains one or more 4 digit directories containing JSONL data files.

There can be at most 1000 files within each 4 digit directory.

## Agents

The export contains data for the following agents:

| Agent name      | Description                                                                                                |
| --------------- | ---------------------------------------------------------------------------------------------------------- |
| crossref        | Relationships and references to datasets and DOI registration agencies other than Crossref (e.g. DataCite) |
| f1000           | Recommendations of research publications                                                                   |
| facultyopinions | Recommendations of research publications (formerly F1000)                                                  |
| hypothesis      | Annotations in Hypothes\.is                                                                                |
| newsfeed        | Discussed in blogs and media                                                                               |
| reddit          | Discussed on Reddit                                                                                        |
| reddit-links    | Discussed on sites linked to in subreddits                                                                 |
| stackexchange   | Discussed on StackExchange sites                                                                           |
| web             | Discussed on selected webpages                                                                             |
| wikipedia       | References on Wikipedia pages                                                                              |
| wordpressdotcom | Discussed on Wordpress\.com sites                                                                          |

## Event data structure

The main purpose of each event is to capture a relationship between a subject and an object, a triplet of [subject, relationship, object].

The `event` data structure has a few properties but the most important ones are:

| Property              | Description                                                                                                                                            |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| license               | The license per event. Different agents may have a different license and that may also change over time. It is best to consider the license per event. |
| obj_id                | The id of the object                                                                                                                                   |
| subj_id               | The id of the subject                                                                                                                                  |
| occurred_at           | When was the event observed?                                                                                                                           |
| id                    | A unique id, can be helpful when processing the archive                                                                                                |
| subj:pid              | Same as subj_id                                                                                                                                        |
| obj:pid               | Same as obj_id                                                                                                                                         |
| subj:url              | The source of the event                                                                                                                                |
| obj:url               | Same as obj:pid                                                                                                                                        |
| subj:title            | Some agents (for example Wikipedia) capture a title for the subject's url                                                                              |
| subj/obj:work_type_id | The type of the identified pid                                                                                                                         |
| source_id             | The agent that captured this event                                                                                                                     |
| relation_type_id      | The relation type of the relationship between the subject and the object                                                                               |

One thing to note: Depending on the agent the `subj:url` may or may not be equal to the `subj:pid`. In any case the `subj:url` should be treated as an independent value.

An example of a Crossref agent event:

```json
{
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "obj_id": "https://doi.org/10.14383/cri.2017.12.2.149",
  "source_token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "occurred_at": "2025-01-01T11:10:07.000Z",
  "subj_id": "https://doi.org/10.3390/vetsci11020090",
  "id": "9d5be3e3-3141-4965-836a-8831178726b2",
  "action": "add",
  "subj": {
    "pid": "https://doi.org/10.3390/vetsci11020090",
    "url": "https://doi.org/10.3390/vetsci11020090",
    "work_type_id": "journal-article"
  },
  "source_id": "crossref",
  "obj": {
    "pid": "https://doi.org/10.14383/cri.2017.12.2.149",
    "url": "https://doi.org/10.14383/cri.2017.12.2.149",
    "method": "doi-literal",
    "verification": "literal"
  },
  "relation_type_id": "references"
}
```

An example of a Wikipedia agent event:

```json
{
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "obj_id": "https://doi.org/10.2307/2128863",
  "source_token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "occurred_at": "2025-01-01T17:02:25Z",
  "subj_id": "https://en.wikipedia.org/api/rest_v1/page/html/Mohammad_Reza_Pahlavi/1266654138",
  "id": "6618b137-da03-4f15-ac86-ca3283c28cb6",
  "action": "add",
  "subj": {
    "pid": "https://en.wikipedia.org/wiki/Mohammad_Reza_Pahlavi",
    "url": "https://en.wikipedia.org/w/index.php?title=Mohammad_Reza_Pahlavi&oldid=1266654138",
    "title": "Mohammad Reza Pahlavi",
    "work_type_id": "entry-encyclopedia",
    "api-url": "https://en.wikipedia.org/api/rest_v1/page/html/Mohammad_Reza_Pahlavi/1266654138"
  },
  "source_id": "wikipedia",
  "obj": {
    "pid": "https://doi.org/10.2307/2128863",
    "url": "https://doi.org/10.2307/2128863",
    "method": "doi-literal",
    "verification": "literal"
  },
  "relation_type_id": "references"
}
```

## Archive statistics

### Size uncompressed 

On a typical `ext4` Linux partition the reported uncompressed size of the data directory is 1.4 Terabytes.

```shell
$ du -h -d 1 data

27M     data/stackexchange
202M    data/reddit-links
20M     data/facultyopinions
963M    data/wordpressdotcom
804G    data/wikipedia
351M    data/web
565G    data/crossref
125M    data/f1000
1.5G    data/newsfeed
1.5G    data/hypothesis
172M    data/reddit
1.4T    data
```

### Events per agent

| Agent           | Total       |
| --------------- | ----------- |
| crossref        | 859,509,464 |
| f1000           | 202,406     |
| facultyopinions | 27,593      |
| hypothesis      | 1,118,798   |
| newsfeed        | 1,714,471   |
| reddit          | 165,388     |
| reddit-links    | 257,478     |
| stackexchange   | 19,395      |
| web             | 589,891     |
| wikipedia       | 962,980,572 |
| wordpressdotcom | 1,079,650   |},
keywords= {'Crossref', ' Scholarly Metadata'},
terms= {},
license= {},
superseded= {}
}

Citation:
Crossref. (2026). Crossref Event Data Archive [Data set]. Academic Torrents. https://academictorrents.com/details/16396475b640d8487a6b723eada8a440fb33d3ce
Hosted by users
10 day statistics (2 downloads)
Average Time 3 hrs, 07 mins, 15 secs
Average Speed 9.58MB/s
Best Time 1 hrs, 21 mins, 01 secs
Best Speed 22.13MB/s
Worst Time 4 hrs, 53 mins, 29 secs
Worst Speed 6.11MB/s

Send Feedback