Subject tables fix by HarishC727 · Pull Request #1898 · datacommonsorg/data

HarishC727 · 2026-03-02T07:21:33Z

No description provided.

gemini-code-assist · 2026-03-02T07:21:50Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the data processing pipeline for US Census subject tables by improving the reliability of column name transformations and data downloading. The changes focus on making the column mapping more flexible and robust, while also adding error handling for common data ingestion issues, leading to a more stable and maintainable data processing workflow.

Highlights

Column Name Processing Refinement: The logic for finding and replacing column names in subject tables has been significantly refactored to handle long keys containing delimiters and multiple individual token replacements more robustly. This ensures that replacements occur in the correct order and cover various matching scenarios.
Improved Column Ignoring Logic: The mechanism for ignoring columns during statistical variable generation has been simplified and made more effective. It now checks for the presence of any specified ignore pattern within the column header, rather than relying on a complex token count.
Robust Data Consolidation: Error handling has been added to the data consolidation process to gracefully manage and skip empty CSV files, preventing crashes and improving the overall stability of the data downloader.
Import Statement Correction: Relative import statements in the census_api_data_downloader.py script were updated to absolute imports, resolving potential module resolution issues.

Changelog

scripts/us_census/acs5yr/subject_tables/common/generate_col_map.py
- Refactored the _find_and_replace_column_names method to support robust replacement of long keys and multiple individual tokens by sorting keys by length and applying both substring and token-based replacements.
- Updated the _generate_stat_vars_from_spec method to use a more straightforward and effective substring check for ignoring columns based on patterns in ignoreColumns.
scripts/us_census/api_utils/census_api_data_downloader.py
- Corrected relative import statements to use absolute paths for download_utils and status_file_utils.
- Added a debug print statement for url_list within the download_table function.
- Implemented a try-except block around pd.read_csv in consolidate_files to catch and log EmptyDataError for empty CSV files.
- Added debug print statements for year_list_int and FLAGS.summary_levels in the main function.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several fixes and improvements. In generate_col_map.py, the logic for renaming and ignoring columns has been updated, but the new implementations have potential issues with correctness and could lead to unexpected behavior. In census_api_data_downloader.py, while robustness is improved by handling empty CSV files, some debug print statements have been left in the code, and an import style has been changed in a way that reduces modularity. My review includes suggestions to address these points.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/acs5yr/subject_tables/common/generate_col_map.py

    def _find_and_replace_column_names(self, column):
        """
-        if spec has find_and_replace defined, this function updates column names
+        Final robust version: Handles long keys containing delimiters 
+        and multiple individual token replacements.
        """
-        if 'find_and_replace' in self.features['preprocess']:
-            find_and_replace_dict = self.features['preprocess'][
-                'find_and_replace']
-            # replace entire column name
-            if column in find_and_replace_dict:
-                return find_and_replace_dict[column]
-            # replace a token in the column name
-            else:
-                # TODO: Support the find_and_replace of more than one token
-                part_list = column.split(self.delimiter)
-                for idx, part in enumerate(part_list):
-                    if part in find_and_replace_dict:
-                        part_list[idx] = find_and_replace_dict[part]
-                        return self.delimiter.join(part_list)
-        return column
+        if 'find_and_replace' not in self.features.get('preprocess', {}):
+            return column
+
+        find_and_replace_dict = self.features['preprocess']['find_and_replace']
+        new_column = column
+
+        # 1. Handle Long Keys/Partial Strings (Most likely fix for your JSON)
+        # We sort by length (longest first) so we don't accidentally replace 
+        # a small part of a larger key.
+        sorted_keys = sorted(find_and_replace_dict.keys(), key=len, reverse=True)
+
+        for key in sorted_keys:
+            if key in new_column:
+                new_column = new_column.replace(key, find_and_replace_dict[key])
+
+        # 2. Token-based replacement (as a backup for exact token matches)
+        # This ensures that if 'INCOME' is a key, it only matches 'INCOME' 
+        # and not 'INCOMES'
+        parts = new_column.split(self.delimiter)
+        modified_tokens = False
+        for idx, part in enumerate(parts):
+            if part in find_and_replace_dict:
+                parts[idx] = find_and_replace_dict[part]
+                modified_tokens = True
+
+        if modified_tokens:
+            new_column = self.delimiter.join(parts)
+
+        return new_column


The logic in _find_and_replace_column_names is flawed. It performs two sequential replacement passes: a greedy substring replacement followed by a token-based replacement on the already modified string. This can lead to incorrect substitutions (e.g., replacing "INCOME" within "INCOMES") and chained replacements (e.g., A -> B, then B -> C). This could corrupt column names and cause downstream processing errors. A single, more robust replacement mechanism that correctly handles both multi-token keys and single tokens without these side effects should be used.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/acs5yr/subject_tables/common/generate_col_map.py

+            # Check if any string in ignoreColumns exists within the current header
+            is_ignored = False
+            for ignore_pattern in self.features['ignoreColumns']:
+                if ignore_pattern in col:
+                    is_ignored = True
+                    break


The logic for ignoring columns has been changed from exact token matching to substring matching (if ignore_pattern in col:). This is a significant change that could lead to incorrect behavior. For example, if an ignore pattern is "Total", it would match columns like "Total_Households", which might not be the intention. The previous implementation was more precise by checking for exact token matches. This change could cause valid data columns to be skipped from processing. Consider reverting to a more precise matching logic, like the one used previously, perhaps with performance improvements.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/api_utils/census_api_data_downloader.py

    url_list = get_table_url_list(dataset, table_id, q_variable, year_list,
                                  output_path, api_key, s_level_list,
                                  force_fetch_config, force_fetch_data)
+    print(url_list)


This print() statement appears to be for debugging purposes. Similar debug prints are present on lines 458 and 461. These should be removed before merging to avoid polluting the output with potentially large amounts of data.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/api_utils/census_api_data_downloader.py


 def main(argv):
    year_list_int = list(range(FLAGS.start_year, FLAGS.end_year + 1))
+    print("#########################",year_list_int)


This print() statement appears to be for debugging purposes and should be removed before merging.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/api_utils/census_api_data_downloader.py

+    print("#########################",year_list_int)
    year_list = [str(y) for y in year_list_int]
    out_path = os.path.expanduser(FLAGS.output_path)
+    print("#####",FLAGS.summary_levels)


This print() statement appears to be for debugging purposes and should be removed before merging.

gemini-code-assist · 2026-03-02T07:28:08Z

scripts/us_census/api_utils/census_api_data_downloader.py

+from download_utils import download_url_list_iterations
 from tools.download_utils.requests_wrappers import request_url_json
-from .status_file_utils import sync_status_list
+from status_file_utils import sync_status_list


The imports for download_utils and status_file_utils have been changed from relative to absolute. While this might make the script runnable directly, it reduces modularity and makes the code dependent on sys.path manipulation. It's best practice to use relative imports for modules within the same package to maintain encapsulation.

Suggested change

from download_utils import download_url_list_iterations

from tools.download_utils.requests_wrappers import request_url_json

from .status_file_utils import sync_status_list

from status_file_utils import sync_status_list

from .download_utils import download_url_list_iterations

from tools.download_utils.requests_wrappers import request_url_json

from .status_file_utils import sync_status_list

subject tables fix

9814a9c

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subject tables fix#1898

Subject tables fix#1898
HarishC727 wants to merge 1 commit intodatacommonsorg:masterfrom
HarishC727:stable_ref

HarishC727 commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HarishC727 commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant