LAION-400-MILLION OPEN DATASET

folder laion400m-met-release (1265 files)
filelaion400m-embeddings/images/img_emb_268.npy 1.02GB
filelaion400m-embeddings/images/img_emb_267.npy 1.02GB
filelaion400m-embeddings/images/img_emb_266.npy 1.02GB
filelaion400m-embeddings/images/img_emb_265.npy 1.02GB
filelaion400m-embeddings/images/img_emb_264.npy 1.02GB
filelaion400m-embeddings/images/img_emb_263.npy 1.02GB
filelaion400m-embeddings/images/img_emb_262.npy 1.02GB
filelaion400m-embeddings/images/img_emb_261.npy 1.02GB
filelaion400m-embeddings/images/img_emb_260.npy 1.02GB
filelaion400m-embeddings/images/img_emb_26.npy 1.02GB
filelaion400m-embeddings/images/img_emb_259.npy 1.02GB
filelaion400m-embeddings/images/img_emb_258.npy 1.02GB
filelaion400m-embeddings/images/img_emb_257.npy 1.02GB
filelaion400m-embeddings/images/img_emb_256.npy 1.02GB
filelaion400m-embeddings/images/img_emb_255.npy 1.02GB
filelaion400m-embeddings/images/img_emb_254.npy 1.02GB
filelaion400m-embeddings/images/img_emb_253.npy 1.02GB
filelaion400m-embeddings/images/img_emb_252.npy 1.02GB
filelaion400m-embeddings/images/img_emb_251.npy 1.02GB
filelaion400m-embeddings/images/img_emb_250.npy 1.02GB
filelaion400m-embeddings/images/img_emb_25.npy 1.02GB
filelaion400m-embeddings/images/img_emb_249.npy 1.02GB
filelaion400m-embeddings/images/img_emb_248.npy 1.02GB
filelaion400m-embeddings/images/img_emb_247.npy 1.02GB
filelaion400m-embeddings/images/img_emb_246.npy 1.02GB
filelaion400m-embeddings/images/img_emb_245.npy 1.02GB
filelaion400m-embeddings/images/img_emb_244.npy 1.02GB
filelaion400m-embeddings/images/img_emb_243.npy 1.02GB
filelaion400m-embeddings/images/img_emb_242.npy 1.02GB
filelaion400m-embeddings/images/img_emb_241.npy 1.02GB
filelaion400m-embeddings/images/img_emb_240.npy 1.02GB
filelaion400m-embeddings/images/img_emb_24.npy 1.02GB
filelaion400m-embeddings/images/img_emb_239.npy 1.02GB
filelaion400m-embeddings/images/img_emb_238.npy 1.02GB
filelaion400m-embeddings/images/img_emb_237.npy 1.02GB
filelaion400m-embeddings/images/img_emb_236.npy 1.02GB
filelaion400m-embeddings/images/img_emb_235.npy 1.02GB
filelaion400m-embeddings/images/img_emb_234.npy 1.02GB
filelaion400m-embeddings/images/img_emb_233.npy 1.02GB
filelaion400m-embeddings/images/img_emb_232.npy 1.02GB
filelaion400m-embeddings/images/img_emb_231.npy 1.02GB
filelaion400m-embeddings/images/img_emb_230.npy 1.02GB
filelaion400m-embeddings/images/img_emb_23.npy 1.02GB
filelaion400m-embeddings/images/img_emb_229.npy 1.02GB
filelaion400m-embeddings/images/img_emb_228.npy 1.02GB
filelaion400m-embeddings/images/img_emb_227.npy 1.02GB
filelaion400m-embeddings/images/img_emb_226.npy 1.02GB
filelaion400m-embeddings/images/img_emb_225.npy 1.02GB
filelaion400m-embeddings/images/img_emb_224.npy 1.02GB
Too many files! Click here to view them all.
Type: Dataset
Tags:

Bibtex:
@article{,
title= {LAION-400-MILLION OPEN DATASET},
journal= {},
author= {},
year= {},
url= {https://laion.ai/laion-400-open-dataset/},
abstract= {LAION-400M 
 
The world’s largest openly available image-text-pair dataset with 400 million samples.

# Concept and Content
The LAION-400M dataset is completely openly, freely accessible.

All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. 
The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

# Download Information
You can find

The CLIP image embeddings (NumPy files)

The parquet files

KNN index of image embeddings

# LAION-400M Dataset Statistics
The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like this:
```
Number of unique samples 413M
Number with height or width >= 1024 26M                                   
Number with height and width >= 1024 9.6M                                  
Number with height or width >= 512 112M                                  
Number with height and width >= 512 67M                                 
Number with height or width >= 256 268M                                  
Number with height and width >= 256 211M
```
By using the KNN index specialized datasets can also be extracted by domains of interest. They are (or will be) sufficient in size to train domain specialized models.

# Disclaimer & Content Warning
Our filtering protocol only removed NSFW images that were detected as illegal but the dataset still has NSFW content accordingly marked in the metadata. Please use the demo links with caution. You can extract a “safe” subset by filtering out samples marked with NSFW or via stricter CLIP filtering.

 
There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs causing duplicates. The same image with different captions is not, however, considered duplicated.
 
Using KNN clustering should make it easy to further deduplicate by image content.

# License

We are distributing the metadata dataset (the parquet files) under the most open creative common CC-BY 4.0 license. It poses no particular restriction. The images are under their own copyright.},
keywords= {},
terms= {},
license= {https://creativecommons.org/licenses/by/4.0/},
superseded= {}
}


Send Feedback