You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Why: the dataset is small, updated on a daily cadence, and easy to keep consistent by reloading the full table instead of managing row-by-row updates.
- Why: this table is also small enough for a daily refresh, and full replacement keeps the historical series in sync without extra incremental-loading logic.
20
+
21
+
### Loader Script
22
+
23
+
The repository includes `load_data_to_bq.py`, which:
24
+
25
+
1. Authenticates with Google BigQuery
26
+
2. Creates the `mta_data` dataset if it does not already exist
27
+
3. Pulls source data from both Open Data APIs
28
+
4. Cleans date and numeric fields before upload
29
+
5. Replaces the target BigQuery tables
30
+
6. Verifies each upload with row counts and date ranges
31
+
32
+
Run it with:
33
+
34
+
```bash
35
+
python load_data_to_bq.py --dataset all
36
+
```
37
+
38
+
You can also load a single table:
39
+
40
+
```bash
41
+
python load_data_to_bq.py --dataset mta
42
+
python load_data_to_bq.py --dataset covid
43
+
```
44
+
45
+
## App Changes for BigQuery
46
+
47
+
The Streamlit app no longer reads API responses directly inside page files.
48
+
49
+
-`utils.py` now provides shared BigQuery helpers for both datasets
50
+
-`streamlit_app.py` reads MTA data from BigQuery
51
+
-`pages/1_MTA_Ridership.py` reads MTA data from BigQuery
52
+
-`pages/2_Second_Dataset.py` reads COVID data from BigQuery
53
+
54
+
This keeps all pages aligned with the lab requirement that every dataset come from BigQuery.
55
+
56
+
## Performance Work
57
+
58
+
To improve load time and make performance visible:
59
+
60
+
- Each page uses a custom `display_load_time()` context manager and shows total load time in the UI
61
+
- BigQuery results are cached with Streamlit caching
62
+
- Queries select only the columns used by the app instead of `SELECT *`
63
+
- Repeated client setup is cached with a shared BigQuery client helper
64
+
- Basic data cleaning is centralized in `utils.py` so pages do less work on every rerun
65
+
- The homepage dashboard is split into lighter sections so each view renders only the charts needed for that section
66
+
- Default chart selections were reduced to fewer transit modes so the initial render sends fewer Plotly traces
67
+
68
+
These changes improve both initial and subsequent page loads, while keeping the code easier to maintain.
69
+
70
+
## Local Verification Steps
71
+
72
+
1. Run `python load_data_to_bq.py --dataset all`
73
+
2. Run `streamlit run streamlit_app.py`
74
+
3. Open each page and confirm the caption shows the page load time
75
+
4. Record the screen while loading the main page and both sub-pages
76
+
77
+
## Assumption
78
+
79
+
I interpreted "repeat the middle steps from Part 5" as: load the datasets into BigQuery, point the app at BigQuery tables, and document the table-level setup in the repository.
0 commit comments