Crossref Event Data Archive (2 files)
archive.tar.gz |
107.59GB |
README.md |
9.08kB |
Type: Dataset
Metadata:
Metadata:
@article{,
title= {Crossref Event Data Archive},
doi= {https://doi.org/10.13003/wjyr-rv9j},
journal= {},
author= {Crossref},
year= {},
url= {},
abstract= {# Crossref Event Data archive
This is an archive of all events collected by selected Crossref Event Data agents between its launch on 2017/02/17 and its deprecation on 2026/04/23.
The DOI of this dataset is https://doi.org/10.13003/wjyr-rv9j
## File format
The data are provided in [JSONL](https://jsonlines.org/) format.
Each data file has a `.jsonl` file extension and contains up to 5000 entries (lines).
Filenames follow this pattern: `agent-nnnn.jsonl`
## Structure of the archive
The data are hierarchically grouped in directories by agent, year, month, day and a 4-digit directory ID.
For example:
```shell
$ tree -L 6 data | head -n 18
data
├── crossref
│ ├── 2021
│ │ ├── 01
│ │ │ └── 01
│ │ │ └── 0000
│ │ │ └── crossref-0001.jsonl
│ │ ├── 04
│ │ │ ├── 28
│ │ │ │ └── 0000
│ │ │ │ └── crossref-0001.jsonl
│ │ │ ├── 29
│ │ │ │ └── 0000
│ │ │ │ ├── crossref-0001.jsonl
│ │ │ │ └── crossref-0002.jsonl
│ │ │ └── 30
│ │ │ └── 0000
│ │ │ └── crossref-0001.jsonl
```
Each daily directory contains one or more 4 digit directories containing JSONL data files.
There can be at most 1000 files within each 4 digit directory.
## Agents
The export contains data for the following agents:
| Agent name | Description |
| --------------- | ---------------------------------------------------------------------------------------------------------- |
| crossref | Relationships and references to datasets and DOI registration agencies other than Crossref (e.g. DataCite) |
| f1000 | Recommendations of research publications |
| facultyopinions | Recommendations of research publications (formerly F1000) |
| hypothesis | Annotations in Hypothes\.is |
| newsfeed | Discussed in blogs and media |
| reddit | Discussed on Reddit |
| reddit-links | Discussed on sites linked to in subreddits |
| stackexchange | Discussed on StackExchange sites |
| web | Discussed on selected webpages |
| wikipedia | References on Wikipedia pages |
| wordpressdotcom | Discussed on Wordpress\.com sites |
## Event data structure
The main purpose of each event is to capture a relationship between a subject and an object, a triplet of [subject, relationship, object].
The `event` data structure has a few properties but the most important ones are:
| Property | Description |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| license | The license per event. Different agents may have a different license and that may also change over time. It is best to consider the license per event. |
| obj_id | The id of the object |
| subj_id | The id of the subject |
| occurred_at | When was the event observed? |
| id | A unique id, can be helpful when processing the archive |
| subj:pid | Same as subj_id |
| obj:pid | Same as obj_id |
| subj:url | The source of the event |
| obj:url | Same as obj:pid |
| subj:title | Some agents (for example Wikipedia) capture a title for the subject's url |
| subj/obj:work_type_id | The type of the identified pid |
| source_id | The agent that captured this event |
| relation_type_id | The relation type of the relationship between the subject and the object |
One thing to note: Depending on the agent the `subj:url` may or may not be equal to the `subj:pid`. In any case the `subj:url` should be treated as an independent value.
An example of a Crossref agent event:
```json
{
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"obj_id": "https://doi.org/10.14383/cri.2017.12.2.149",
"source_token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"occurred_at": "2025-01-01T11:10:07.000Z",
"subj_id": "https://doi.org/10.3390/vetsci11020090",
"id": "9d5be3e3-3141-4965-836a-8831178726b2",
"action": "add",
"subj": {
"pid": "https://doi.org/10.3390/vetsci11020090",
"url": "https://doi.org/10.3390/vetsci11020090",
"work_type_id": "journal-article"
},
"source_id": "crossref",
"obj": {
"pid": "https://doi.org/10.14383/cri.2017.12.2.149",
"url": "https://doi.org/10.14383/cri.2017.12.2.149",
"method": "doi-literal",
"verification": "literal"
},
"relation_type_id": "references"
}
```
An example of a Wikipedia agent event:
```json
{
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"obj_id": "https://doi.org/10.2307/2128863",
"source_token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"occurred_at": "2025-01-01T17:02:25Z",
"subj_id": "https://en.wikipedia.org/api/rest_v1/page/html/Mohammad_Reza_Pahlavi/1266654138",
"id": "6618b137-da03-4f15-ac86-ca3283c28cb6",
"action": "add",
"subj": {
"pid": "https://en.wikipedia.org/wiki/Mohammad_Reza_Pahlavi",
"url": "https://en.wikipedia.org/w/index.php?title=Mohammad_Reza_Pahlavi&oldid=1266654138",
"title": "Mohammad Reza Pahlavi",
"work_type_id": "entry-encyclopedia",
"api-url": "https://en.wikipedia.org/api/rest_v1/page/html/Mohammad_Reza_Pahlavi/1266654138"
},
"source_id": "wikipedia",
"obj": {
"pid": "https://doi.org/10.2307/2128863",
"url": "https://doi.org/10.2307/2128863",
"method": "doi-literal",
"verification": "literal"
},
"relation_type_id": "references"
}
```
## Archive statistics
### Size uncompressed
On a typical `ext4` Linux partition the reported uncompressed size of the data directory is 1.4 Terabytes.
```shell
$ du -h -d 1 data
27M data/stackexchange
202M data/reddit-links
20M data/facultyopinions
963M data/wordpressdotcom
804G data/wikipedia
351M data/web
565G data/crossref
125M data/f1000
1.5G data/newsfeed
1.5G data/hypothesis
172M data/reddit
1.4T data
```
### Events per agent
| Agent | Total |
| --------------- | ----------- |
| crossref | 859,509,464 |
| f1000 | 202,406 |
| facultyopinions | 27,593 |
| hypothesis | 1,118,798 |
| newsfeed | 1,714,471 |
| reddit | 165,388 |
| reddit-links | 257,478 |
| stackexchange | 19,395 |
| web | 589,891 |
| wikipedia | 962,980,572 |
| wordpressdotcom | 1,079,650 |},
keywords= {'Crossref', ' Scholarly Metadata'},
terms= {},
license= {},
superseded= {}
}
Citation:
Crossref. (2026). Crossref Event Data Archive [Data set]. Academic Torrents. https://academictorrents.com/details/16396475b640d8487a6b723eada8a440fb33d3ce
archive.tar.gz