Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the data processing pipeline for US Census subject tables by improving the reliability of column name transformations and data downloading. The changes focus on making the column mapping more flexible and robust, while also adding error handling for common data ingestion issues, leading to a more stable and maintainable data processing workflow. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several fixes and improvements. In generate_col_map.py, the logic for renaming and ignoring columns has been updated, but the new implementations have potential issues with correctness and could lead to unexpected behavior. In census_api_data_downloader.py, while robustness is improved by handling empty CSV files, some debug print statements have been left in the code, and an import style has been changed in a way that reduces modularity. My review includes suggestions to address these points.
| def _find_and_replace_column_names(self, column): | ||
| """ | ||
| if spec has find_and_replace defined, this function updates column names | ||
| Final robust version: Handles long keys containing delimiters | ||
| and multiple individual token replacements. | ||
| """ | ||
| if 'find_and_replace' in self.features['preprocess']: | ||
| find_and_replace_dict = self.features['preprocess'][ | ||
| 'find_and_replace'] | ||
| # replace entire column name | ||
| if column in find_and_replace_dict: | ||
| return find_and_replace_dict[column] | ||
| # replace a token in the column name | ||
| else: | ||
| # TODO: Support the find_and_replace of more than one token | ||
| part_list = column.split(self.delimiter) | ||
| for idx, part in enumerate(part_list): | ||
| if part in find_and_replace_dict: | ||
| part_list[idx] = find_and_replace_dict[part] | ||
| return self.delimiter.join(part_list) | ||
| return column | ||
| if 'find_and_replace' not in self.features.get('preprocess', {}): | ||
| return column | ||
|
|
||
| find_and_replace_dict = self.features['preprocess']['find_and_replace'] | ||
| new_column = column | ||
|
|
||
| # 1. Handle Long Keys/Partial Strings (Most likely fix for your JSON) | ||
| # We sort by length (longest first) so we don't accidentally replace | ||
| # a small part of a larger key. | ||
| sorted_keys = sorted(find_and_replace_dict.keys(), key=len, reverse=True) | ||
|
|
||
| for key in sorted_keys: | ||
| if key in new_column: | ||
| new_column = new_column.replace(key, find_and_replace_dict[key]) | ||
|
|
||
| # 2. Token-based replacement (as a backup for exact token matches) | ||
| # This ensures that if 'INCOME' is a key, it only matches 'INCOME' | ||
| # and not 'INCOMES' | ||
| parts = new_column.split(self.delimiter) | ||
| modified_tokens = False | ||
| for idx, part in enumerate(parts): | ||
| if part in find_and_replace_dict: | ||
| parts[idx] = find_and_replace_dict[part] | ||
| modified_tokens = True | ||
|
|
||
| if modified_tokens: | ||
| new_column = self.delimiter.join(parts) | ||
|
|
||
| return new_column |
There was a problem hiding this comment.
The logic in _find_and_replace_column_names is flawed. It performs two sequential replacement passes: a greedy substring replacement followed by a token-based replacement on the already modified string. This can lead to incorrect substitutions (e.g., replacing "INCOME" within "INCOMES") and chained replacements (e.g., A -> B, then B -> C). This could corrupt column names and cause downstream processing errors. A single, more robust replacement mechanism that correctly handles both multi-token keys and single tokens without these side effects should be used.
| # Check if any string in ignoreColumns exists within the current header | ||
| is_ignored = False | ||
| for ignore_pattern in self.features['ignoreColumns']: | ||
| if ignore_pattern in col: | ||
| is_ignored = True | ||
| break |
There was a problem hiding this comment.
The logic for ignoring columns has been changed from exact token matching to substring matching (if ignore_pattern in col:). This is a significant change that could lead to incorrect behavior. For example, if an ignore pattern is "Total", it would match columns like "Total_Households", which might not be the intention. The previous implementation was more precise by checking for exact token matches. This change could cause valid data columns to be skipped from processing. Consider reverting to a more precise matching logic, like the one used previously, perhaps with performance improvements.
| url_list = get_table_url_list(dataset, table_id, q_variable, year_list, | ||
| output_path, api_key, s_level_list, | ||
| force_fetch_config, force_fetch_data) | ||
| print(url_list) |
|
|
||
| def main(argv): | ||
| year_list_int = list(range(FLAGS.start_year, FLAGS.end_year + 1)) | ||
| print("#########################",year_list_int) |
| print("#########################",year_list_int) | ||
| year_list = [str(y) for y in year_list_int] | ||
| out_path = os.path.expanduser(FLAGS.output_path) | ||
| print("#####",FLAGS.summary_levels) |
| from download_utils import download_url_list_iterations | ||
| from tools.download_utils.requests_wrappers import request_url_json | ||
| from .status_file_utils import sync_status_list | ||
| from status_file_utils import sync_status_list |
There was a problem hiding this comment.
The imports for download_utils and status_file_utils have been changed from relative to absolute. While this might make the script runnable directly, it reduces modularity and makes the code dependent on sys.path manipulation. It's best practice to use relative imports for modules within the same package to maintain encapsulation.
| from download_utils import download_url_list_iterations | |
| from tools.download_utils.requests_wrappers import request_url_json | |
| from .status_file_utils import sync_status_list | |
| from status_file_utils import sync_status_list | |
| from .download_utils import download_url_list_iterations | |
| from tools.download_utils.requests_wrappers import request_url_json | |
| from .status_file_utils import sync_status_list |
No description provided.