Skip to content

Commit fdcea52

Browse files
authored
Merge pull request #22 from advanced-computing/hanghai/update-utils
Lab 10: BigQuery-backed Streamlit dashboard
2 parents 02c3f66 + f374cf1 commit fdcea52

8 files changed

Lines changed: 1122 additions & 679 deletions

File tree

LAB_10_WRITEUP.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Lab 10 Writeup
2+
3+
## BigQuery Data Loading
4+
5+
This project now uses BigQuery for every dataset shown in the Streamlit app.
6+
7+
### Dataset 1: MTA Daily Ridership
8+
9+
- Source: `https://data.ny.gov/resource/vxuj-8kew`
10+
- BigQuery table: `sipa-adv-c-bouncing-penguin.mta_data.daily_ridership`
11+
- Loading type: batch full refresh
12+
- Why: the dataset is small, updated on a daily cadence, and easy to keep consistent by reloading the full table instead of managing row-by-row updates.
13+
14+
### Dataset 2: NYC COVID-19 Daily Cases
15+
16+
- Source: `https://data.cityofnewyork.us/resource/rc75-m7u3`
17+
- BigQuery table: `sipa-adv-c-bouncing-penguin.mta_data.nyc_covid_cases`
18+
- Loading type: batch full refresh
19+
- Why: this table is also small enough for a daily refresh, and full replacement keeps the historical series in sync without extra incremental-loading logic.
20+
21+
### Loader Script
22+
23+
The repository includes `load_data_to_bq.py`, which:
24+
25+
1. Authenticates with Google BigQuery
26+
2. Creates the `mta_data` dataset if it does not already exist
27+
3. Pulls source data from both Open Data APIs
28+
4. Cleans date and numeric fields before upload
29+
5. Replaces the target BigQuery tables
30+
6. Verifies each upload with row counts and date ranges
31+
32+
Run it with:
33+
34+
```bash
35+
python load_data_to_bq.py --dataset all
36+
```
37+
38+
You can also load a single table:
39+
40+
```bash
41+
python load_data_to_bq.py --dataset mta
42+
python load_data_to_bq.py --dataset covid
43+
```
44+
45+
## App Changes for BigQuery
46+
47+
The Streamlit app no longer reads API responses directly inside page files.
48+
49+
- `utils.py` now provides shared BigQuery helpers for both datasets
50+
- `streamlit_app.py` reads MTA data from BigQuery
51+
- `pages/1_MTA_Ridership.py` reads MTA data from BigQuery
52+
- `pages/2_Second_Dataset.py` reads COVID data from BigQuery
53+
54+
This keeps all pages aligned with the lab requirement that every dataset come from BigQuery.
55+
56+
## Performance Work
57+
58+
To improve load time and make performance visible:
59+
60+
- Each page uses a custom `display_load_time()` context manager and shows total load time in the UI
61+
- BigQuery results are cached with Streamlit caching
62+
- Queries select only the columns used by the app instead of `SELECT *`
63+
- Repeated client setup is cached with a shared BigQuery client helper
64+
- Basic data cleaning is centralized in `utils.py` so pages do less work on every rerun
65+
- The homepage dashboard is split into lighter sections so each view renders only the charts needed for that section
66+
- Default chart selections were reduced to fewer transit modes so the initial render sends fewer Plotly traces
67+
68+
These changes improve both initial and subsequent page loads, while keeping the code easier to maintain.
69+
70+
## Local Verification Steps
71+
72+
1. Run `python load_data_to_bq.py --dataset all`
73+
2. Run `streamlit run streamlit_app.py`
74+
3. Open each page and confirm the caption shows the page load time
75+
4. Record the screen while loading the main page and both sub-pages
76+
77+
## Assumption
78+
79+
I interpreted "repeat the middle steps from Part 5" as: load the datasets into BigQuery, point the app at BigQuery tables, and document the table-level setup in the repository.

README.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@ This project analyzes MTA daily ridership trends in New York City to understand
2525
- **NYC COVID-19 Daily Cases**
2626
https://data.cityofnewyork.us/Health/Coronavirus-Data/rc75-m7u3
2727

28+
Both datasets are loaded into BigQuery for the Streamlit app:
29+
30+
- `sipa-adv-c-bouncing-penguin.mta_data.daily_ridership`
31+
- `sipa-adv-c-bouncing-penguin.mta_data.nyc_covid_cases`
32+
2833
## Repository Structure
2934

3035
- `streamlit_app.py` - homepage and project introduction
@@ -33,7 +38,8 @@ This project analyzes MTA daily ridership trends in New York City to understand
3338
- `utils.py` - helper functions for cleaning and plotting
3439
- `validation.py` - Pandera schema validation
3540
- `tests/` - unit tests for utility and validation code
36-
- `load_data_to_bq.py` - script for loading data into BigQuery
41+
- `load_data_to_bq.py` - script for loading both datasets into BigQuery
42+
- `LAB_10_WRITEUP.md` - Lab 10 notes on data loading and performance
3743

3844
## Setup
3945

@@ -43,7 +49,18 @@ This project analyzes MTA daily ridership trends in New York City to understand
4349
- Mac/Linux: `source .venv/bin/activate`
4450
- Windows: `.venv\Scripts\activate`
4551
4. Install dependencies: `pip install -r requirements.txt`
52+
5. Load the BigQuery tables: `python load_data_to_bq.py --dataset all`
4653

4754
## Usage
4855

49-
Open `mta_ridership_project.ipynb` in Jupyter Notebook or VS Code to run the analysis.
56+
Run the Streamlit app locally:
57+
58+
```bash
59+
streamlit run streamlit_app.py
60+
```
61+
62+
You can still open `mta_ridership_project.ipynb` in Jupyter Notebook or VS Code for notebook-based exploration.
63+
64+
## Lab 10
65+
66+
Lab 10 documentation lives in [LAB_10_WRITEUP.md](./LAB_10_WRITEUP.md).

load_data_to_bq.py

Lines changed: 168 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,102 @@
1-
"""Load MTA ridership data from NYC Open Data API into BigQuery."""
1+
"""Load project datasets from NYC Open Data into BigQuery."""
22

3+
import argparse
4+
from dataclasses import dataclass
35
import sys
46

57
import pandas as pd
6-
import pydata_google_auth
7-
88
import pandas_gbq
9+
import pydata_google_auth
10+
import requests
11+
from google.cloud import bigquery
912

1013
PROJECT_ID = "sipa-adv-c-bouncing-penguin"
11-
DATASET_TABLE = "mta_data.daily_ridership"
14+
DATASET_ID = "mta_data"
1215

1316
SCOPES = [
1417
"https://www.googleapis.com/auth/bigquery",
1518
]
1619

1720

21+
@dataclass(frozen=True)
22+
class DataSource:
23+
name: str
24+
api_url: str
25+
destination_table: str
26+
order_column: str
27+
date_columns: tuple[str, ...]
28+
numeric_columns: tuple[str, ...]
29+
30+
31+
DATA_SOURCES = {
32+
"mta": DataSource(
33+
name="MTA ridership",
34+
api_url="https://data.ny.gov/resource/vxuj-8kew.json",
35+
destination_table=f"{DATASET_ID}.daily_ridership",
36+
order_column="date",
37+
date_columns=("date",),
38+
numeric_columns=(
39+
"subways_total_estimated_ridership",
40+
"subways_pct_of_comparable_pre_pandemic_day",
41+
"buses_total_estimated_ridership",
42+
"buses_pct_of_comparable_pre_pandemic_day",
43+
"lirr_total_estimated_ridership",
44+
"lirr_pct_of_comparable_pre_pandemic_day",
45+
"metro_north_total_estimated_ridership",
46+
"metro_north_pct_of_comparable_pre_pandemic_day",
47+
"access_a_ride_total_scheduled_trips",
48+
"access_a_ride_pct_of_comparable_pre_pandemic_day",
49+
"bridges_and_tunnels_total_traffic",
50+
"bridges_and_tunnels_pct_of_comparable_pre_pandemic_day",
51+
"staten_island_railway_total_estimated_ridership",
52+
"staten_island_railway_pct_of_comparable_pre_pandemic_day",
53+
),
54+
),
55+
"covid": DataSource(
56+
name="NYC COVID cases",
57+
api_url="https://data.cityofnewyork.us/resource/rc75-m7u3.json",
58+
destination_table=f"{DATASET_ID}.nyc_covid_cases",
59+
order_column="date_of_interest",
60+
date_columns=("date_of_interest",),
61+
numeric_columns=(
62+
"case_count",
63+
"probable_case_count",
64+
"hospitalized_count",
65+
"death_count",
66+
"probable_death_count",
67+
"bx_case_count",
68+
"bk_case_count",
69+
"mn_case_count",
70+
"qn_case_count",
71+
"si_case_count",
72+
),
73+
),
74+
}
75+
76+
MTA_RENAME_MAP = {
77+
"subways_of_comparable_pre_pandemic_day": "subways_pct_of_comparable_pre_pandemic_day",
78+
"buses_of_comparable_pre_pandemic_day": "buses_pct_of_comparable_pre_pandemic_day",
79+
"lirr_of_comparable_pre_pandemic_day": "lirr_pct_of_comparable_pre_pandemic_day",
80+
"metro_north_of_comparable_pre_pandemic_day": "metro_north_pct_of_comparable_pre_pandemic_day",
81+
"bridges_and_tunnels_of_comparable_pre_pandemic_day": "bridges_and_tunnels_pct_of_comparable_pre_pandemic_day",
82+
"access_a_ride_of_comparable_pre_pandemic_day": "access_a_ride_pct_of_comparable_pre_pandemic_day",
83+
"staten_island_railway_of_comparable_pre_pandemic_day": "staten_island_railway_pct_of_comparable_pre_pandemic_day",
84+
}
85+
86+
87+
def parse_args() -> argparse.Namespace:
88+
parser = argparse.ArgumentParser(
89+
description="Load project datasets from NYC Open Data into BigQuery."
90+
)
91+
parser.add_argument(
92+
"--dataset",
93+
choices=("all", "mta", "covid"),
94+
default="all",
95+
help="Which dataset to load. Defaults to all.",
96+
)
97+
return parser.parse_args()
98+
99+
18100
def get_credentials():
19101
"""Get Google credentials with browser-based auth flow."""
20102
print("Authenticating with Google... A browser window should open.")
@@ -27,40 +109,101 @@ def get_credentials():
27109
return credentials
28110

29111

30-
def fetch_mta_data() -> pd.DataFrame:
31-
"""Pull MTA ridership data from NYC Open Data API."""
32-
print("Fetching MTA data from NYC Open Data API...")
112+
def ensure_dataset_exists(credentials) -> None:
113+
"""Create the BigQuery dataset if it does not already exist."""
114+
client = bigquery.Client(project=PROJECT_ID, credentials=credentials)
115+
dataset = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
116+
dataset.location = "US"
117+
client.create_dataset(dataset, exists_ok=True)
118+
119+
120+
def fetch_source(source: DataSource) -> pd.DataFrame:
121+
"""Pull a dataset from an NYC Open Data endpoint."""
122+
print(f"Fetching {source.name} from {source.api_url} ...")
33123
sys.stdout.flush()
34-
url = "https://data.ny.gov/resource/vxuj-8kew.csv?$limit=50000"
35-
df = pd.read_csv(url)
36-
df["date"] = pd.to_datetime(df["date"])
37-
print(f"Fetched {len(df)} rows (from {df['date'].min().date()} to {df['date'].max().date()})")
38-
return df
124+
response = requests.get(
125+
source.api_url,
126+
params={"$limit": 50000, "$order": source.order_column},
127+
timeout=60,
128+
)
129+
response.raise_for_status()
39130

131+
df = pd.DataFrame(response.json())
132+
if df.empty:
133+
raise RuntimeError(f"{source.name} returned no rows.")
40134

41-
def main():
42-
# Step 1: Authenticate
43-
credentials = get_credentials()
135+
if source.destination_table.endswith("daily_ridership"):
136+
df = df.rename(columns=MTA_RENAME_MAP)
137+
138+
for column in source.date_columns:
139+
if column in df.columns:
140+
df[column] = pd.to_datetime(df[column])
44141

45-
# Step 2: Fetch data
46-
df = fetch_mta_data()
142+
for column in source.numeric_columns:
143+
if column in df.columns:
144+
df[column] = pd.to_numeric(df[column], errors="coerce")
47145

48-
# Step 3: Upload to BigQuery
49-
print(f"Uploading to BigQuery: {PROJECT_ID}.{DATASET_TABLE} ...")
146+
date_column = source.date_columns[0]
147+
print(
148+
"Fetched "
149+
f"{len(df)} rows "
150+
f"({df[date_column].min().date()} to {df[date_column].max().date()})"
151+
)
152+
return df
153+
154+
155+
def upload_source(df: pd.DataFrame, source: DataSource, credentials) -> None:
156+
"""Upload a dataframe into its destination BigQuery table."""
157+
print(f"Uploading to BigQuery: {PROJECT_ID}.{source.destination_table} ...")
50158
sys.stdout.flush()
51159
pandas_gbq.to_gbq(
52160
df,
53-
destination_table=DATASET_TABLE,
161+
destination_table=source.destination_table,
54162
project_id=PROJECT_ID,
55163
if_exists="replace",
56164
credentials=credentials,
57165
)
58-
print("Done! Data loaded to BigQuery successfully.")
166+
print("Upload complete.")
167+
168+
169+
def verify_source(source: DataSource, credentials) -> None:
170+
"""Print a quick verification summary for the target table."""
171+
date_column = source.date_columns[0]
172+
query = f"""
173+
SELECT
174+
COUNT(*) AS row_count,
175+
MIN(`{date_column}`) AS min_date,
176+
MAX(`{date_column}`) AS max_date
177+
FROM `{PROJECT_ID}.{source.destination_table}`
178+
"""
179+
result = pandas_gbq.read_gbq(
180+
query,
181+
project_id=PROJECT_ID,
182+
credentials=credentials,
183+
)
184+
row = result.iloc[0]
185+
print(
186+
"Verification: "
187+
f"{row['row_count']} rows "
188+
f"({pd.Timestamp(row['min_date']).date()} to {pd.Timestamp(row['max_date']).date()})"
189+
)
190+
191+
192+
def main() -> None:
193+
args = parse_args()
194+
selected_keys = list(DATA_SOURCES) if args.dataset == "all" else [args.dataset]
195+
196+
credentials = get_credentials()
197+
ensure_dataset_exists(credentials)
198+
199+
for key in selected_keys:
200+
source = DATA_SOURCES[key]
201+
df = fetch_source(source)
202+
upload_source(df, source, credentials)
203+
verify_source(source, credentials)
204+
print("")
59205

60-
# Step 4: Verify
61-
query = f"SELECT COUNT(*) as row_count FROM `{PROJECT_ID}.{DATASET_TABLE}`"
62-
result = pandas_gbq.read_gbq(query, project_id=PROJECT_ID, credentials=credentials)
63-
print(f"Verification: {result['row_count'].iloc[0]} rows in BigQuery table.")
206+
print("Done! BigQuery tables are ready for the Streamlit app.")
64207

65208

66209
if __name__ == "__main__":

0 commit comments

Comments
 (0)