The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
Zeber, David and Bird, Sarah and Oliveira, Camila and Rudametkin, Walter and Segall, Ilana and Wolls\'{e}n, Fredrik and Lopatka, Martin

webconf-crawl-representativeness-data (54 files)
2019-07-jestr-vs-crawl/2019_08_09_main_crawl_7a.tar 13.53GB
2019-07-jestr-vs-crawl/2019_08_09_main_crawl_7b.tar 13.17GB
2019-07-jestr-vs-crawl/simulcrawl-gcp-1.tar 5.43GB
2019-07-jestr-vs-crawl/simulcrawl-gcp-4.tar 5.77GB
2019-07-jestr-vs-crawl/simulcrawl-gcp-5.tar 4.50GB
2019-07-jestr-vs-crawl/simulcrawl-platform-1.tar 1.52GB
2019-07-jestr-vs-crawl/simulcrawl-platform-2.tar 1.92GB
2019-07-jestr-vs-crawl/simulcrawl-platform-3.tar 1.76GB
2019-07-jestr-vs-crawl/simulcrawl-platform-4.tar 1.79GB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_0.tar 23.61MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_1.tar 251.59MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_10.tar 299.17MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_11.tar 355.26MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_12.tar 299.66MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_13.tar 300.74MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_14.tar 327.89MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_15.tar 308.53MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_16.tar 335.82MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_17.tar 284.52MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_18.tar 301.50MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_19.tar 306.63MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_2.tar 336.61MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_20.tar 301.38MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_21.tar 354.32MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_22.tar 318.04MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_23.tar 363.54MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_24.tar 339.41MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_25.tar 333.54MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_26.tar 358.99MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_27.tar 303.99MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_28.tar 338.98MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_29.tar 315.46MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_3.tar 239.42MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_30.tar 328.95MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_31.tar 318.15MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_32.tar 358.10MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_33.tar 331.07MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_34.tar 368.16MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_35.tar 318.82MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_36.tar 316.05MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_37.tar 386.23MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_38.tar 321.72MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_39.tar 325.83MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_4.tar 401.73MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_40.tar 362.33MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_41.tar 209.45MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_42.tar 344.30MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_43.tar 353.58MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_44.tar 340.95MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_5.tar 409.84MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_6.tar 306.59MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_7.tar 342.01MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_8.tar 321.15MB
jestr-vs-crawl-n-variability/raw_data/alexa_top_1k/openwpm_crawl_9.tar 335.81MB
Type: Dataset
Tags: Tracking, Web Crawling, Online Privacy, Browser Fingerprinting, World Wide Web

Bibtex:
@inproceedings{10.1145/3366423.3380104,
author= {Zeber, David and Bird, Sarah and Oliveira, Camila and Rudametkin, Walter and Segall, Ilana and Wolls\'{e}n, Fredrik and Lopatka, Martin},
title= {The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing},
year= {2020},
isbn= {9781450370233},
publisher= {Association for Computing Machinery},
address= {New York, NY, USA},
url= {https://doi.org/10.1145/3366423.3380104},
doi= {10.1145/3366423.3380104},
booktitle= {Proceedings of The Web Conference 2020},
pages= {167–178},
numpages= {12},
keywords= {Web Crawling, Online Privacy, Tracking, Browser Fingerprinting, World Wide Web},
location= {Taipei, Taiwan},
series= {WWW ’20},
abstract= {Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don’t require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.},
terms= {},
license= {Mozilla Public License 2.0},
superseded= {}
}

Report