Skip to content

mattharrison/datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 

Repository files navigation

Social Security Names

from https://www.ssa.gov/oact/babynames/

Data at data/names-ss-1910-2022.csv.zip

Ames Housing Data

http://jse.amstat.org/v19n3/decock.pdf

Data at ./data/ames-housing-dataset.zip

Weather data for those years at ../data/asos-ames-2007-2010.txt

Stack Overflow 2019 Survey (developer_survey_2019.zip 18M)

https://insights.stackoverflow.com/survey/2019

License: ODbL

Automobile Fuel Economy 1984-2020 (vehicles.csv.zip)

https://www.fueleconomy.gov/feg/download.shtml

Presidents

From https://qrc.depaul.edu/Excel_File_Listing_Pages/Excellist.asp

Pokemon

From https://www.kaggle.com/rounakbanik/pokemon (CC0: Public Domain)

Alta

1980-2019

https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00420072/detail

Data at ../data/snow-alta-1990-2017.csv

Ecommerce Store Sample Transaction

../data/transaction_data.xlsx

Documentation - https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf

  • STATION_NAME (max 50 characters) is the (usually city/airport name). Optional output field.
  • STATION - 17 characters) is the station identification code. Please see http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
  • NAME - name of the station
  • LATITUDE
  • LONGITUDE
  • ELEVATION - meters
  • DATE - YYYY-MM-DD
  • DAPR - Number of days included in the multiday precipitation total (MDPR)
  • DAPR_ATTRIBUTES
  • DASF - Number of days included in the multiday snowfall total (MDSF)
  • DASF_ATTRIBUTES
  • MDPR - Multiday precipitation total (mm or inches as per user preference; use with DAPR and DWPR, if available)
  • MDPR_ATTRIBUTES
  • MDSF - Multiday snowfall total (mm or inches as per user preference)
  • MDSF_ATTRIBUTES
  • PRCP - Precipitation (mm or inches as per user preference, inches to hundredths on Daily Form pdf file)
  • PRCP_ATTRIBUTES
  • SNOW - Snowfall (mm or inches as per user preference, inches to tenths on Daily Form pdf file)
  • SNOW_ATTRIBUTES
  • SNWD - Snow depth (mm or inches as per user preference, inches on Daily Form pdf file)
  • SNWD_ATTRIBUTES
  • TMAX - Maximum temperature (Fahrenheit or Celsius as per user preference, Fahrenheit to tenths on Daily Form pdf file
  • TMAX_ATTRIBUTES
  • TMIN - Minimum temperature (Fahrenheit or Celsius as per user preference, Fahrenheit to tenths on Daily Form pdf file
  • TMIN_ATTRIBUTES
  • TOBS - Temperature at the time of observation (Fahrenheit or Celsius as per user preference)
  • TOBS_ATTRIBUTES
  • WT01 - Fog, ice fog, or freezing fog (may include heavy fog)
  • WT01_ATTRIBUTES
  • WT03 - Thunder
  • WT03_ATTRIBUTES
  • WT04 - Ice pellets, sleet, snow pellets, or small hail
  • WT04_ATTRIBUTES
  • WT05 - Hail (may include small hail)
  • WT05_ATTRIBUTES
  • WT06 - Glaze or rime
  • WT06_ATTRIBUTES
  • WT11 - High or damaging winds
  • WT11_ATTRIBUTES

El Nino (tao-all2.dat.gz)

https://archive.ics.uci.edu/ml/datasets/El+Nino

zonal winds (west<0, east>0), meridional winds (south<0, north>0),

# Data transformation from previous notebook
# col names in tao-all2.col from website
names = '''obs
year
month
day
date
latitude
longitude
zon.winds
mer.winds
humidity
air temp.
s.s.temp.'''.split('\n')

nino = pd.read_csv('data/tao-all2.dat.gz', sep=' ', names=names, na_values='.', 
                   parse_dates=[[1,2,3]])

def clean_cols(val):
    return val.replace('.', '_').replace(' ', '_')

nino = (nino
  .rename(columns=clean_cols)
  .assign(air_temp_F=lambda df_: df_.air_temp_ * 9/5 + 32,
        zon_winds_mph=lambda df_: df_.zon_winds*2.237,
        mer_winds_mph=lambda df_: df_.mer_winds*2.237)
  .drop(columns='obs')
)

Olympic Medals

athlete_events.csv.zip From Kaggle. Medal stats to 2016

Nvidia Data

nvda_spy.csv SPY and Nvidia data to 2024 from yfinance

Alone Data

Gemini assisted generated data from Alone TV series

  1. Demographic & Background Data
  • season: (Integer) The installment number of the show (1–12).
  • location: (String) The geographical region (e.g., "Vancouver Island," "Great Slave Lake"). This acts as a proxy for ecosystem type.
  • contestant: (String) The name of the participant.
  • age: (Integer) The age of the participant at the time of drop-off.
  • sex: (Categorical: M/F) Biological sex.
  • starting_weight_kg: (Continuous) The weight of the participant at the final medical check before drop-off.
  • profession: (String) The participant's primary career. Useful for "Professional Bias" analysis (e.g., comparing "Survival Instructor" vs. "Electrician").
  • hometown: (String) Origin city/state. Used to calculate "Home Field Advantage" if the climate matches the show location.
  • exp_years: (Integer) Self-reported years of experience in primitive or survival skills.
  1. Skill & Performance Metrics (1–5 Scale)
  • hunt_skill: (Ordinal) 1 = Basic/No experience; 5 = Professional guide/Elite bowyer. Measures the ability to track and kill game.
  • fish_skill: (Ordinal) 1 = Recreational; 5 = Commercial fisherman/Expert angler. Often the strongest predictor of longevity.
  • shelter_quality: (Ordinal) 1 = Minimalist tarp/bivouac; 3 = Insulated lean-to; 5 = Permanent log cabin with stone hearth/fireplace.
  1. Survival Events & Physiological Data
  • food_events: (Integer) The count of successful "significant" harvests (e.g., a large fish, a squirrel, or big game). Smaller foraged items like berries are usually excluded.
  • injury_count: (Integer) Total number of minor or major physical injuries recorded.
  • med_flags: (Integer) Number of official "Medical Warnings" issued by the production team during weekly checks.
  • gill_net: (Boolean: Y/N) Whether the participant successfully constructed and deployed a gill net.
  1. Environmental Stressors
  • precip_mm: (Continuous) Average monthly rainfall/snowfall for that location. High values correlate with "Psychological" tap-outs due to gear failure and wetness.
  • daylight_hrs: (Continuous) The average hours of sunlight during the final third of the stay.
  • temp_low_c: (Continuous) The average nightly low temperature during the participant's final week.
  1. Outcome Variables (Target Data)
  • tapout_reason: (Categorical) The primary reason for leaving. Common values:

  • Winner: Last person standing.

  • Medical: Forced removal for BMI, injury, or organ failure.

  • Psychological: Loneliness, "missing family," or "journey complete."

  • Starvation: Voluntary tap due to lack of calories.

  • days_lasted: (Integer) The total number of 24-hour periods spent in the wilderness. This is usually your Dependent Variable for regression analysis.

About

Datasets for ML, Analysis, etc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors