Illinois DOC labeled faces dataset
Illinois DOC

folder illinois_doc_dataset (8 files)
filecsv/marks.csv 8.61MB
filecsv/person.csv 9.97MB
filecsv/sentencing.csv 22.74MB
filefront.7z 3.43GB
filehtmltocsv.py 13.70kB
fileinmates.7z 12.81MB
filereadme 3.19kB
fileside.7z 2.88GB
Type: Dataset
Tags: machine learning, Dataset, images, prisoners

Bibtex:
@article{,
title= {Illinois DOC labeled faces dataset},
journal= {},
author= {Illinois DOC},
year= {},
url= {},
abstract= {This is a dataset of prisoner mugshots and associated data (height, weight, etc). The copyright status is public domain, since it's produced by the government, the photographs do not have sufficient artistic merit, and a mere collection of facts aren't copyrightable.  
  
The source is the Illinois Dept. of Corrections. In total, there are 68149 entries, of which a few hundred have shoddy data.  
  
It's useful for neural network training, since it has pictures from both front and side, and they're (manually) labeled with date of birth, name (useful for clustering), weight, height, hair color, eye color, sex, race, and some various goodies such as sentence duration and whether they're sex offenders.  
  
Here is the readme file:  
  
---BEGIN README---  
Scraped from the Illinois DOC.  
  
https://www.idoc.state.il.us/subsections/search/inms_print.asp?idoc=  
https://www.idoc.state.il.us/subsections/search/pub_showfront.asp?idoc=  
https://www.idoc.state.il.us/subsections/search/pub_showside.asp?idoc=  
  
paste <(cat ids.txt | sed 's/^/http:\/\/www.idoc.state.il.us\/subsections\/search\/pub_showside.asp\?idoc\=/g') <(cat ids.txt| sed 's/^/  out=/g' | sed 's/$/.jpg/g') -d '\n' > showside.txt  
paste <(cat ids.txt | sed 's/^/http:\/\/www.idoc.state.il.us\/subsections\/search\/pub_showfront.asp\?idoc\=/g') <(cat ids.txt| sed 's/^/  out=/g' | sed 's/$/.jpg/g') -d '\n' > showfront.txt  
paste <(cat ids.txt | sed 's/^/http:\/\/www.idoc.state.il.us\/subsections\/search\/inms_print.asp\?idoc\=/g') <(cat ids.txt| sed 's/^/  out=/g' | sed 's/$/.html/g') -d '\n' > inmates_print.txt  
  
aria2c -i ../inmates_print.txt -j4 -x4 -l ../log-$(pwd|rev|cut -d/ -f 1|rev)-$(date +%s).txt  
  
Then use htmltocsv.py to get the csv. Note that the script is very poorly written and may have errors. It also doesn't do anything with the warrant-related info, although there are some commented-out lines which may be relevant.  
Also note that it assumes all the HTML files are located in the inmates directory., and overwrites any csv files in csv if there are any.  
  
front.7z contains mugshots from the front  
side.7z contains mugshots from the side  
inmates.7z contains all the html files  
csv contains the html files converted to CSV  
  
The reason for packaging the images is that many torrent clients would otherwise crash if attempting to load the torrent.  
  
All CSV files contain headers describing the nature of the columns. For person.csv, the id is unique. For marks.csv and sentencing.csv, it is not.  
Note that the CSV files use semicolons as delimiters and also end with a trailing semicolon. If this is unsuitable, edit the arr2csvR function in htmltocsv.py.  
  
There are 68149 inmates in total, although some (a few hundred) are marked as "Unknown"/"N/A"/"" in one or more fields.  
  
The "height" column has been processed to contain the height in inches, rather than the height in feet and inches expressed as "X ft YY in."  
Some inmates were marked "Not Available", this has been replaced with "N/A".  
Likewise, the "weight" column has been altered "XXX lbs." -> "XXX". Again, some are marked "N/A".  
  
The "date of birth" column has some inmates marked as "Not Available" and others as "". There doesn't appear to be any pattern. It may be related to the institution they are kept in. Otherwise, the format is MM/DD/YYYY.  
  
The "weight" column is often rounded to the nearest 5 lbs.  
  
Statistics for hair:  
  43305 Black  
  17371 Brown  
   2887 Blonde or Strawberry  
   2539 Gray or Partially Gray  
    740 Red or Auburn  
    624 Bald  
    396 Not Available  
    209 Salt and Pepper  
     70 White  
      7 Sandy  
      1 Unknown  
  
Statistics for sex:  
  63409 Male  
   4740 Female  
  
Statistics for race:  
  37991 Black  
  20992 White  
   8637 Hispanic  
    235 Asian  
    104 Amer Indian  
     94 Unknown  
     92 Bi-Racial  
      4  
  
Statistics for eyes:  
  51714 Brown  
   7808 Blue  
   4259 Hazel  
   2469 Green  
   1382 Black  
    420 Not Available  
     87 Gray  
      9 Maroon  
      1 Unknown  
---END README---  
  
Here is a formal summary:  
  
---BEGIN SUMMARY---  
 Documentation:  
  
1. Title: Illinois DOC dataset  
  
2. Source Information  
   -- Creators: Illinois DOC  
     -- Illinois Department of Corrections  
        1301 Concordia Court  
        P.O. Box 19277  
        Springfield, IL 62794-9277  
        (217) 558-2200 x 2008  
   -- Donor: Anonymous  
   -- Date: 2019  
  
3. Past Usage:  
   -- None  
  
4. Relevant Information:  
   -- All CSV files contain headers describing the nature of the columns. For person.csv, the id is unique. For marks.csv and sentencing.csv, it is not.  
   -- Note that the CSV files use semicolons as delimiters and also end with a trailing semicolon. If this is unsuitable, edit the arr2csvR function in htmltocsv.py.  
   -- The "height" column has been processed to contain the height in inches, rather than the height in feet and inches expressed as "X ft YY in."  
   -- Some inmates were marked "Not Available", this has been replaced with "N/A".  
   -- Likewise, the "weight" column has been altered "XXX lbs." -> "XXX". Again, some are marked "N/A".  
   -- The "date of birth" column has some inmates marked as "Not Available" and others as "". There doesn't appear to be any pattern. It may be related to the institution they are kept in. Otherwise, the format is MM/DD/YYYY.  
   -- The "weight" column is often rounded to the nearest 5 lbs.  
  
5. Number of Instances: 68149  
  
6. Number of Attributes: 30 (in some instances, information is missing. If so, it should be treated as unknown or undefined information)  
  
7. Attribute Information:  
   1. ID: Alphanumeric internal ID (string)  
   2. mark: Human-readable string describing marks and scars. May have zero, one, or multiple entries for one ID. (string)  
   3. name: First and last name in format "SURNAME, GIVEN" - upper case. Redacted in provided copy, script must be executed to regenerate column. (string/void)  
   4. date_of_birth: Date of birth in format MM/DD/YYYY. Some inmates are marked as "Not Available" and some inmates are marked as "". There doesn't appear to be any pattern. It may be related to the institution they are kept in. (date OR enumeration)  
   5. weight: Physical weight in pounds OR "N/A". Often rounded to 5 lb increments. It may be related to the institution they are kept in. (integer OR void)  
   6. hair: Hair color. One of ("Black", "Brown", "Blonde or Strawberry", "Gray or Partially Gray", "Red or Auburn", "Bald", "Not Available", "Salt and Pepper", "White", "Sandy", "Unknown") (enumeration)  
   7. sex: Sex. One of ("Male", "Female") (enumeration)  
   8. height: Height in inches. (integer)  
   9. race: Race. One of ("Black", "White", "Hispanic", "Asian", "Amer Indian", "Unknown", "Bi-Racial", "") (enumeration)  
  10. eyes: Eye color. One of ("Brown", "Blue", "Hazel", "Green", "Black", "Not Available", "Gray", "Maroon", "Unknown") (enumeration)  
  11. admission_date: Date of admission in format MM/DD/YYYY. (date)  
  12. projected_parole_date: Projected parole date in format MM/DD/YYYY OR one of ("TO BE DETERMINED", "Sexually D", "3yrs---Lif", "3yrs---Lif", "TO BE DETERMINED BY COMMITTING COURT") OR "" (if none projected) (date OR enumeration OR void)  
  13. last_paroled_date: Last paroled date in format MM/DD/YYYY OR "" (if not paroled). (date OR void)  
  14. projected_discharge_date: Projected discharge date in format MM/DD/YYYY OR one of ("TO BE DETERMINED", "3 YRS TO LIFE - TO BE DETERMINED", "INELIGIBLE", "SEXUALLY D", "TO BE DETERMINED BY COMMITTING COURT", "PENDING", "3 YRS TO L") OR "". (date OR enumeration OR void)  
  15. parole_date: Parole date in format MM/DD/YYYY OR "". (date OR void)  
  16. electronic_detention_date: Electronic detention date in format MM/DD/YYYY OR "". (date OR void)  
  17. discharge_date: Date of discharge from institution. Always "", since discharged offenders are not included in the data set. (void)  
  18. parent_institution: Institution at which offender is kept, or "PAROLE" if parole. One of ("STATEVILLE CORRECTIONAL CENTER", "SHERIDAN CORRECTIONAL CENTER", "PINCKNEYVILLE CORRECTIONAL CENTER", "MENARD CORRECTIONAL CENTER", "LOGAN CORRECTIONAL CENTER", "ILLINOIS RIVER CORRECTIONAL CENTER", "DIXON CORRECTIONAL CENTER", "VANDALIA CORRECTIONAL CENTER", "GRAHAM CORRECTIONAL CENTER", "LAWRENCE CORRECTIONAL CENTER", "EAST MOLINE CORRECTIONAL CENTER", "SHAWNEE CORRECTIONAL CENTER", "JACKSONVILLE CORRECTIONAL CENTER", "DANVILLE CORRECTIONAL CENTER", "VIENNA CORRECTIONAL CENTER", "HILL CORRECTIONAL CENTER", "BIG MUDDY CORRECTIONAL CENTER", "CENTRALIA CORRECTIONAL CENTER", "ROBINSON CORRECTIONAL CENTER", "WESTERN ILLINOIS CORRECTIONAL CENTER", "LINCOLN CORRECTIONAL CENTER", "TAYLORVILLE CORRECTIONAL CENTER", "SOUTHWESTERN CORRECTIONAL CENTER", "PONTIAC CORRECTIONAL CENTER", "CONCORDIA", "DECATUR CORRECTIONAL CENTER", "KEWANEE LIFE SKILLS RE-ENTRY CENTER", "JOLIET TREATMENT CENTER", "PAROLE") (enumeration)  
  19. offender_status: Status of offender. One of ("CUSTODY", "PAROLE", "ABSCONDER", "RECEPTION", "WORK RELEASE CUSTODY", "TEMP RESIDENT", "NON-IDOC CUSTODY", "WRIT", "BOND", "HOME CUSTODY", "DETAINER", "MEDICAL FURLOUGH", "ESCAPE") (enumeration)  
  20. location: Location. One of ("PAROLE DISTRICT 1", "PAROLE DISTRICT 2", "PAROLE DISTRICT 3", "MENARD", "INTERSTATE COMPACT", "PINCKNEYVILLE", "LAWRENCE CORRECTIONAL CENTER", "PAROLE DISTRICT 4", "ILLINOIS RIVER", "DANVILLE", "HILL", "SHAWNEE", "DIXON", "SHERIDAN", "BIG MUDDY RIVER", "LOGAN", "PAROLE", "GRAHAM", "CENTRALIA", "EAST MOLINE", "NORTHERN RECEPTION CENTER", "VANDALIA", "ROBINSON", "STATEVILLE", "WESTERN ILLINOIS", "VIENNA", "TAYLORVILLE", "LINCOLN", "JACKSONVILLE", "PAROLE DISTRICT 5", "PONTIAC", "DIXON CORRECTIONAL CENTER", "SOUTHWESTERN ILLINOIS", "DECATUR", "", "MENARD MEDIUM SECURITY UNIT", "PONTIAC MEDIUM SECURITY", "GRAHAM R&C", "CROSSROADS CCC", "KEWANEE", "ILL/OTH STATE/FED CONCURR", "PEORIA CCC", "NORTH LAWNDALE  ADULT TRANSITI", "STATEVILLE FARM", "GREENE COUNTY WORK CAMP", "COURT", "PITTSFIELD WORK CAMP", "FOX VALLEY CCC", "BOND", "SOUTHWESTERN IL WORK CAMP", "MENARD R&C", "ELECTRONIC DETENTION", "CLAYTON WORK CAMP", "DIXON SPRINGS BOOT", "DUQUOIN IMPACT INCARCERATION P", "DETAINER", "PAROLE DISTRICTS", "FURLOUGH", "ESCAPE", "DEPT. OF HUMAN SERVICES", "FED/STATE/TRANSFER OTH ST", "WOMENS TREATMENT CENTER", "JAIL", "CONCORDIA") (enumeration)  
  21. sex_offender_registry_required: Whether the offender is required to register as a sex offender. One of ("true", "") (boolean)  
  22. alias: Aliases, separated by pipe sign OR one of ("", "None Reported") (string OR enumeration)  
  23. mittimus: Mittimus ID (string)  
  24. class: Class of offender. One of ("4", "2", "3", "X", "1", "M", "U", "A", "B", "C") (enumeration)  
  25. count: Count of offenses (?) (integer)  
  26. offense: Offense. One of 1576 values. Appears to have been keyed in by hand. (enumeration/string)  
  27. custody_date: Date at which offender was taken into custody. (date)  
  28. sentence: Duration of sentence in format "X Years Y Months Z Days", where Y and Z may exceed 12 and 31 respectively OR one of ("DEATH", "LIFE", "SDP") (int[3] OR enumeration)  
  29. county: County or "out-of-state". One of ("COOK", "WILL", "WINNEBAGO", "KANE", "DUPAGE", "MADISON", "MACON", "LAKE", "PEORIA", "ST-CLAIR", "CHAMPAIGN", "MCLEAN", "SANGAMON", "KANKAKEE", "VERMILION", "LA SALLE", "TAZEWELL", "ADAMS", "LIVINGSTON", "STEPHENSON", "MCHENRY", "COLES", "WHITESIDE", "JEFFERSON", "MARION", "KENDALL", "ROCK-ISLAND", "KNOX", "HENRY", "DEKALB", "BOONE", "JACKSON", "MONTGOMERY", "MACOUPIN", "SALINE", "FRANKLIN", "LOGAN", "ROCK ISLAND", "CHRISTIAN", "FAYETTE", "CLINTON", "MORGAN", "WILLIAMSON", "JERSEY", "WHITE", "LEE", "MASON", "PIKE", "EDGAR", "RANDOLPH", "WOODFORD", "OGLE", "EFFINGHAM", "FULTON", "GRUNDY", "BOND", "IROQUOIS", "SHELBY", "UNION", "CRAWFORD", "LAWRENCE", "BUREAU", "CLAY", "MCDONOUGH", "DEWITT", "JOHNSON", "PERRY", "WAYNE", "MASSAC", "RICHLAND", "CLARK", "CASS", "HANCOCK", "ALEXANDER", "DOUGLAS", "WABASH", "HAMILTON", "GREENE", "WARREN", "FORD", "EDWARDS", "MONROE", "WASHINGTON", "MOULTRIE", "CUMBERLAND", "MERCER", "MENARD", "CARROLL", "GALLATIN", "SCHUYLER", "JASPER", "BROWN", "CALHOUN", "PIATT", "JO-DAVIESS", "POPE", "HARDIN", "PULASKI", "MARSHALL", "HENDERSON", "ST CLAIR", "PUTNAM", "SCOTT", "STARK", "OUT-OF-STATE", "OUT OF STATE", "JO DAVIESS") OR "" (enumeration or void)  
  30. sentence_discharged: Whether the sentence has been discharged. One of ("YES", "NO") (boolean)  
  
8. Missing Attribute Values: See values marked "void" above.  
  
9. Class Distribution:  
  
Statistics for hair:  
  43305 Black  
  17371 Brown  
   2887 Blonde or Strawberry  
   2539 Gray or Partially Gray  
    740 Red or Auburn  
    624 Bald  
    396 Not Available  
    209 Salt and Pepper  
     70 White  
      7 Sandy  
      1 Unknown  
  
Statistics for sex:  
  63409 Male  
   4740 Female  
  
Statistics for race:  
  37991 Black  
  20992 White  
   8637 Hispanic  
    235 Asian  
    104 Amer Indian  
     94 Unknown  
     92 Bi-Racial  
      4  
  
Statistics for eyes:  
  51714 Brown  
   7808 Blue  
   4259 Hazel  
   2469 Green  
   1382 Black  
    420 Not Available  
     87 Gray  
      9 Maroon  
      1 Unknown  
  
Summary Statistics:  
         median  
weight:  185  
height:  69  
---END SUMMARY---  

Image: ![](https://i.postimg.cc/D7pbKD0g/montage-0.jpg) https://i.postimg.cc/D7pbKD0g/montage-0.jpg},
keywords= {machine learning, Dataset, images, prisoners},
terms= {},
license= {Public Domain},
superseded= {}
}


Send Feedback