Skip to content

Subject tables fix#1898

Open
HarishC727 wants to merge 1 commit intodatacommonsorg:masterfrom
HarishC727:stable_ref
Open

Subject tables fix#1898
HarishC727 wants to merge 1 commit intodatacommonsorg:masterfrom
HarishC727:stable_ref

Conversation

@HarishC727
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the data processing pipeline for US Census subject tables by improving the reliability of column name transformations and data downloading. The changes focus on making the column mapping more flexible and robust, while also adding error handling for common data ingestion issues, leading to a more stable and maintainable data processing workflow.

Highlights

  • Column Name Processing Refinement: The logic for finding and replacing column names in subject tables has been significantly refactored to handle long keys containing delimiters and multiple individual token replacements more robustly. This ensures that replacements occur in the correct order and cover various matching scenarios.
  • Improved Column Ignoring Logic: The mechanism for ignoring columns during statistical variable generation has been simplified and made more effective. It now checks for the presence of any specified ignore pattern within the column header, rather than relying on a complex token count.
  • Robust Data Consolidation: Error handling has been added to the data consolidation process to gracefully manage and skip empty CSV files, preventing crashes and improving the overall stability of the data downloader.
  • Import Statement Correction: Relative import statements in the census_api_data_downloader.py script were updated to absolute imports, resolving potential module resolution issues.
Changelog
  • scripts/us_census/acs5yr/subject_tables/common/generate_col_map.py
    • Refactored the _find_and_replace_column_names method to support robust replacement of long keys and multiple individual tokens by sorting keys by length and applying both substring and token-based replacements.
    • Updated the _generate_stat_vars_from_spec method to use a more straightforward and effective substring check for ignoring columns based on patterns in ignoreColumns.
  • scripts/us_census/api_utils/census_api_data_downloader.py
    • Corrected relative import statements to use absolute paths for download_utils and status_file_utils.
    • Added a debug print statement for url_list within the download_table function.
    • Implemented a try-except block around pd.read_csv in consolidate_files to catch and log EmptyDataError for empty CSV files.
    • Added debug print statements for year_list_int and FLAGS.summary_levels in the main function.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes and improvements. In generate_col_map.py, the logic for renaming and ignoring columns has been updated, but the new implementations have potential issues with correctness and could lead to unexpected behavior. In census_api_data_downloader.py, while robustness is improved by handling empty CSV files, some debug print statements have been left in the code, and an import style has been changed in a way that reduces modularity. My review includes suggestions to address these points.

Comment on lines 167 to +200
def _find_and_replace_column_names(self, column):
"""
if spec has find_and_replace defined, this function updates column names
Final robust version: Handles long keys containing delimiters
and multiple individual token replacements.
"""
if 'find_and_replace' in self.features['preprocess']:
find_and_replace_dict = self.features['preprocess'][
'find_and_replace']
# replace entire column name
if column in find_and_replace_dict:
return find_and_replace_dict[column]
# replace a token in the column name
else:
# TODO: Support the find_and_replace of more than one token
part_list = column.split(self.delimiter)
for idx, part in enumerate(part_list):
if part in find_and_replace_dict:
part_list[idx] = find_and_replace_dict[part]
return self.delimiter.join(part_list)
return column
if 'find_and_replace' not in self.features.get('preprocess', {}):
return column

find_and_replace_dict = self.features['preprocess']['find_and_replace']
new_column = column

# 1. Handle Long Keys/Partial Strings (Most likely fix for your JSON)
# We sort by length (longest first) so we don't accidentally replace
# a small part of a larger key.
sorted_keys = sorted(find_and_replace_dict.keys(), key=len, reverse=True)

for key in sorted_keys:
if key in new_column:
new_column = new_column.replace(key, find_and_replace_dict[key])

# 2. Token-based replacement (as a backup for exact token matches)
# This ensures that if 'INCOME' is a key, it only matches 'INCOME'
# and not 'INCOMES'
parts = new_column.split(self.delimiter)
modified_tokens = False
for idx, part in enumerate(parts):
if part in find_and_replace_dict:
parts[idx] = find_and_replace_dict[part]
modified_tokens = True

if modified_tokens:
new_column = self.delimiter.join(parts)

return new_column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in _find_and_replace_column_names is flawed. It performs two sequential replacement passes: a greedy substring replacement followed by a token-based replacement on the already modified string. This can lead to incorrect substitutions (e.g., replacing "INCOME" within "INCOMES") and chained replacements (e.g., A -> B, then B -> C). This could corrupt column names and cause downstream processing errors. A single, more robust replacement mechanism that correctly handles both multi-token keys and single tokens without these side effects should be used.

Comment on lines +221 to +226
# Check if any string in ignoreColumns exists within the current header
is_ignored = False
for ignore_pattern in self.features['ignoreColumns']:
if ignore_pattern in col:
is_ignored = True
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for ignoring columns has been changed from exact token matching to substring matching (if ignore_pattern in col:). This is a significant change that could lead to incorrect behavior. For example, if an ignore pattern is "Total", it would match columns like "Total_Households", which might not be the intention. The previous implementation was more precise by checking for exact token matches. This change could cause valid data columns to be skipped from processing. Consider reverting to a more precise matching logic, like the one used previously, perhaps with performance improvements.

url_list = get_table_url_list(dataset, table_id, q_variable, year_list,
output_path, api_key, s_level_list,
force_fetch_config, force_fetch_data)
print(url_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This print() statement appears to be for debugging purposes. Similar debug prints are present on lines 458 and 461. These should be removed before merging to avoid polluting the output with potentially large amounts of data.


def main(argv):
year_list_int = list(range(FLAGS.start_year, FLAGS.end_year + 1))
print("#########################",year_list_int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This print() statement appears to be for debugging purposes and should be removed before merging.

print("#########################",year_list_int)
year_list = [str(y) for y in year_list_int]
out_path = os.path.expanduser(FLAGS.output_path)
print("#####",FLAGS.summary_levels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This print() statement appears to be for debugging purposes and should be removed before merging.

Comment on lines +37 to +39
from download_utils import download_url_list_iterations
from tools.download_utils.requests_wrappers import request_url_json
from .status_file_utils import sync_status_list
from status_file_utils import sync_status_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The imports for download_utils and status_file_utils have been changed from relative to absolute. While this might make the script runnable directly, it reduces modularity and makes the code dependent on sys.path manipulation. It's best practice to use relative imports for modules within the same package to maintain encapsulation.

Suggested change
from download_utils import download_url_list_iterations
from tools.download_utils.requests_wrappers import request_url_json
from .status_file_utils import sync_status_list
from status_file_utils import sync_status_list
from .download_utils import download_url_list_iterations
from tools.download_utils.requests_wrappers import request_url_json
from .status_file_utils import sync_status_list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant