Modified scripts for better fuzzy matching #34

pushpak-gote-mapup · 2024-08-06T16:48:58Z

User description

Added scripts for matching fuzzy similarity scores from new OSM columns and state neighboring roads.
Added feature: Create bridge statistics

PR Type

enhancement, bug fix

Description

Added new functions to calculate similarity scores between multiple columns and a fixed column.
Introduced functionality to read specific columns from a large CSV file to optimize memory usage.
Enhanced error handling and logging across the scripts.
Refactored the main processing logic to integrate new similarity calculations and data merging.
Improved data filtering and output generation logic to ensure accurate results.

Changes walkthrough 📝

Relevant files

Enhancement

calculate_match_percentage.py `Enhanced similarity calculation and CSV reading functions` hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py Added new function `calculate_similarity` to compute similarity scores. Introduced `read_exploded_osm_data_csv` to read specific columns from a large CSV file. Modified `run` function to include new similarity calculations and data merging. Enhanced error handling and logging.	+77/-18
get_merged_association_output.py `Refactored and enhanced similarity calculations and data merging` merge-approaches/get_merged_association_output.py Refactored similarity calculation functions for better error handling. Added `calculate_similarity_for_neighbouring_roads` to handle neighbouring roads similarity. Updated `main` function to integrate new similarity calculations and data merging. Improved data filtering and output generation logic.	+153/-231

💡 PR-Agent usage:
Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

…s and state neighboring roads

github-actions · 2024-08-06T16:49:59Z

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Key issues to review Error Handling The error handling in the `calculate_similarity` and `read_exploded_osm_data_csv` functions uses a generic Exception catch, which might obscure the underlying issue. It's recommended to catch more specific exceptions or to at least log the stack trace along with the error message. Performance Concern The function `calculate_similarity` iterates over columns and performs fuzzy matching within a loop, which can be inefficient with large datasets. Consider vectorizing this operation or using more efficient data structures like numpy arrays or pandas methods that are optimized for performance. Exception Handling Similar to the first file, the use of generic Exception handling in functions like `calculate_osm_similarity` and `read_geopackage_to_dataframe` can be improved by specifying the type of exceptions expected or by enhancing the logging mechanism to provide more context about the errors.

github-actions · 2024-08-06T16:50:01Z

PR Code Suggestions ✨

Category	Suggestion	Score
Best practice	Improve error handling by specifying exception types Replace the generic exception handling with specific exceptions to improve error handling and debugging. Catching general exceptions can mask other unexpected issues. hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py [28-30] -except Exception as e: - print(f"Error occurred while calculating similarity: {e}") - raise Exception(f"Error occurred while calculating similarity: {e}") +except pd.errors.EmptyDataError as e: + print(f"Empty data error occurred while calculating similarity: {e}") + raise +except ValueError as e: + print(f"Value error occurred while calculating similarity: {e}") + raise Suggestion importance[1-10]: 9 Why: This suggestion improves error handling by specifying exception types, which enhances debugging and prevents masking other unexpected issues. It is a significant improvement in terms of best practices.	9
Best practice	Improve exception handling by using specific exceptions Refactor the exception handling to provide more specific error messages and avoid raising generic exceptions, which can make debugging more difficult. merge-approaches/get_merged_association_output.py [93-94] -except Exception as e: - print(f"Error occurred while calculating similarity for neighbouring roads: {e}") - raise Exception(f"Error occurred while calculating similarity for neighbouring roads: {e}") +except pd.errors.EmptyDataError as e: + print(f"No data available for calculating similarity: {e}") + raise +except KeyError as e: + print(f"Key error in data columns for similarity calculation: {e}") + raise Suggestion importance[1-10]: 9 Why: This suggestion improves error handling by specifying exception types, which enhances debugging and prevents masking other unexpected issues. It is a significant improvement in terms of best practices.	9
Performance	Optimize the removal of an item from a list Use a more efficient method for removing a specific item from a list, especially when the list can be large. Using `list.remove()` is more readable and direct than filtering with a list comprehension. hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py [80] -available_osm_road_names = [col for col in available_osm_road_names if col != 'osm_id'] +available_osm_road_names.remove('osm_id') Suggestion importance[1-10]: 7 Why: Using `list.remove()` is more readable and direct than filtering with a list comprehension. This change improves code readability and performance slightly.	7

pushpak-gote-mapup · 2024-08-06T16:52:10Z

Addressing PR agent's reviews.

…stated by PR agent

… OSM df

pushpak-gote-mapup · 2024-08-07T15:12:48Z

Error Handling
The error handling in the calculate_similarity and read_exploded_osm_data_csv functions uses a generic Exception catch, which might obscure the underlying issue. It's recommended to catch more specific exceptions or to at least log the stack trace along with the error message.
Solved
Added specific exceptions:

def read_exploded_osm_data_csv(exploded_osm_data_csv: str, osm_cols_for_road_names: List[str]) -> Tuple[pd.DataFrame, List[str]]:
    """
    Reads the 'exploded_osm_data' CSV file and returns a DataFrame with the required columns.

    Parameters:
        exploded_osm_data_csv (str): The path to the CSV file.
        osm_cols_for_road_names (list): The list of column names to read from the CSV file.

    Returns:
        pandas.DataFrame: The DataFrame containing the required columns.
    """
    series_list=[]
    available_osm_road_names=[]
    for col in osm_cols_for_road_names:
        try:
            series=pd.read_csv(exploded_osm_data_csv,usecols=[col])
            series_list.append(series)
            available_osm_road_names.append(col)
        except ValueError as e:
            logger.warning(f"Column {col} not found in CSV: {e}")
            raise
        except pd.errors.EmptyDataError as e:
            logger.error(f"Empty data error when reading CSV: {e}", exc_info=True)
            raise
        except Exception as e:
            logger.error(f"Unexpected error when reading CSV: {e}", exc_info=True)
            raise
    
    
    try:
        exploded_osm_data_df=pd.concat(series_list,axis=1)
        return exploded_osm_data_df, available_osm_road_names
    except ValueError as e:
        logger.error(f"ValueError when concatenating series: {e}", exc_info=True)
        raise
    except Exception as e:
        logger.error(f"Unexpected error when concatenating series: {e}", exc_info=True)
        raise

Performance Concern
The function calculate_similarity iterates over columns and performs fuzzy matching within a loop, which can be inefficient with large datasets. Consider vectorizing this operation or using more efficient data structures like numpy arrays or pandas methods that are optimized for performance.
Solved
Now using vectorized implementation:

def vectorized_fuzz(s1: pd.Series, s2: pd.Series) -> pd.Series:
        try:
            mask = s1.notna() & s2.notna()
            result = pd.Series(0, index=s1.index)
            if mask.any():
                result[mask] = np.vectorize(fuzz.token_sort_ratio, otypes=[np.int64])(
                    s1[mask].astype(str), s2[mask].astype(str)
                )
            return result
        except ValueError as e:
            logger.error(f"ValueError in vectorized_fuzz: {str(e)}", exc_info=True)
            raise 
        except TypeError as e:
            logger.error(f"TypeError in vectorized_fuzz: {str(e)}", exc_info=True)
            raise 
        except Exception as e:
            logger.error(f"Unexpected error in vectorized_fuzz: {str(e)}", exc_info=True)
            raise

Exception Handling
Similar to the first file, the use of generic Exception handling in functions like calculate_osm_similarity and read_geopackage_to_dataframe can be improved by specifying the type of exceptions expected or by enhancing the logging mechanism to provide more context about the errors.
Solved
Now using specific exceptions:

def read_geopackage_to_dataframe(filepath: str) -> gpd.GeoDataFrame:
    """
    Read a GeoPackage file into a GeoDataFrame.

    Args:
        filepath (str): Path to the GeoPackage file.

    Returns:
        gpd.GeoDataFrame: The read GeoDataFrame.
    """
    try:
        return gpd.read_file(filepath)
    except FileNotFoundError as e:
        logger.error(f"FileNotFoundError: GeoPackage file not found at {filepath}: {str(e)}", exc_info=True)
        raise
    except PermissionError as e:
        logger.error(f"PermissionError: Unable to access GeoPackage file at {filepath}: {str(e)}", exc_info=True)
        raise
    except gpd.io.file.DriverError as e:
        logger.error(f"DriverError: Unable to read GeoPackage file at {filepath}: {str(e)}", exc_info=True)
        raise
    except Exception as e:
        logger.error(f"Unexpected error while reading GeoPackage file at {filepath}: {str(e)}", exc_info=True)
        raise

In addition to this, PR code suggestions are also implemented.

hydrography-approach/processing_scripts/bridge_statistics/create_bridge_stats.py

…tive code redundancy.

Added scripts for matching fuzzy similarity score from new OSM column…

bad32ba

…s and state neighboring roads

github-actions bot added enhancement New feature or request bug fix Review effort [1-5]: 4 labels Aug 6, 2024

pushpak-gote-mapup added 2 commits August 7, 2024 14:05

included error handling, improved performance and exception handling …

1cfea76

…stated by PR agent

Added changes to flow the required column to output and save exploded…

c1c4275

… OSM df

Added feature: Create Bridge Statistics

092a4e3

pushpak-gote-mapup requested a review from varun-andhra-mapup August 7, 2024 16:17

varun-andhra-mapup requested changes Aug 7, 2024

View reviewed changes

Corrected docstrings, resolved duplicate KeyError, and reduced repeti…

15c6317

…tive code redundancy.

varun-andhra-mapup approved these changes Aug 8, 2024

View reviewed changes

pushpak-gote-mapup merged commit e6cbcbf into main Aug 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified scripts for better fuzzy matching #34

Modified scripts for better fuzzy matching #34

pushpak-gote-mapup commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading

pushpak-gote-mapup commented Aug 6, 2024 •

edited

Loading

pushpak-gote-mapup commented Aug 7, 2024

Modified scripts for better fuzzy matching #34

Modified scripts for better fuzzy matching #34

Conversation

pushpak-gote-mapup commented Aug 6, 2024 • edited Loading

User description

PR Type

Description

Changes walkthrough 📝

github-actions bot commented Aug 6, 2024

PR Reviewer Guide 🔍

github-actions bot commented Aug 6, 2024 • edited Loading

PR Code Suggestions ✨

pushpak-gote-mapup commented Aug 6, 2024 • edited Loading

pushpak-gote-mapup commented Aug 7, 2024

pushpak-gote-mapup commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024 •

edited

Loading

pushpak-gote-mapup commented Aug 6, 2024 •

edited

Loading