Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified scripts for better fuzzy matching #34

Merged
merged 5 commits into from
Aug 8, 2024

Conversation

pushpak-gote-mapup
Copy link
Contributor

@pushpak-gote-mapup pushpak-gote-mapup commented Aug 6, 2024

User description

Added scripts for matching fuzzy similarity scores from new OSM columns and state neighboring roads.
Added feature: Create bridge statistics


PR Type

enhancement, bug fix


Description

  • Added new functions to calculate similarity scores between multiple columns and a fixed column.
  • Introduced functionality to read specific columns from a large CSV file to optimize memory usage.
  • Enhanced error handling and logging across the scripts.
  • Refactored the main processing logic to integrate new similarity calculations and data merging.
  • Improved data filtering and output generation logic to ensure accurate results.

Changes walkthrough 📝

Relevant files
Enhancement
calculate_match_percentage.py
Enhanced similarity calculation and CSV reading functions

hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py

  • Added new function calculate_similarity to compute similarity scores.
  • Introduced read_exploded_osm_data_csv to read specific columns from a
    large CSV file.
  • Modified run function to include new similarity calculations and data
    merging.
  • Enhanced error handling and logging.
  • +77/-18 
    get_merged_association_output.py
    Refactored and enhanced similarity calculations and data merging

    merge-approaches/get_merged_association_output.py

  • Refactored similarity calculation functions for better error handling.
  • Added calculate_similarity_for_neighbouring_roads to handle
    neighbouring roads similarity.
  • Updated main function to integrate new similarity calculations and
    data merging.
  • Improved data filtering and output generation logic.
  • +153/-231

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    github-actions bot commented Aug 6, 2024

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Error Handling
    The error handling in the calculate_similarity and read_exploded_osm_data_csv functions uses a generic Exception catch, which might obscure the underlying issue. It's recommended to catch more specific exceptions or to at least log the stack trace along with the error message.

    Performance Concern
    The function calculate_similarity iterates over columns and performs fuzzy matching within a loop, which can be inefficient with large datasets. Consider vectorizing this operation or using more efficient data structures like numpy arrays or pandas methods that are optimized for performance.

    Exception Handling
    Similar to the first file, the use of generic Exception handling in functions like calculate_osm_similarity and read_geopackage_to_dataframe can be improved by specifying the type of exceptions expected or by enhancing the logging mechanism to provide more context about the errors.

    Copy link

    github-actions bot commented Aug 6, 2024

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Improve error handling by specifying exception types

    Replace the generic exception handling with specific exceptions to improve error
    handling and debugging. Catching general exceptions can mask other unexpected
    issues.

    hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py [28-30]

    -except Exception as e:
    -    print(f"Error occurred while calculating similarity: {e}")
    -    raise Exception(f"Error occurred while calculating similarity: {e}")
    +except pd.errors.EmptyDataError as e:
    +    print(f"Empty data error occurred while calculating similarity: {e}")
    +    raise
    +except ValueError as e:
    +    print(f"Value error occurred while calculating similarity: {e}")
    +    raise
     
    Suggestion importance[1-10]: 9

    Why: This suggestion improves error handling by specifying exception types, which enhances debugging and prevents masking other unexpected issues. It is a significant improvement in terms of best practices.

    9
    Improve exception handling by using specific exceptions

    Refactor the exception handling to provide more specific error messages and avoid
    raising generic exceptions, which can make debugging more difficult.

    merge-approaches/get_merged_association_output.py [93-94]

    -except Exception as e:
    -    print(f"Error occurred while calculating similarity for neighbouring roads: {e}")
    -    raise Exception(f"Error occurred while calculating similarity for neighbouring roads: {e}")
    +except pd.errors.EmptyDataError as e:
    +    print(f"No data available for calculating similarity: {e}")
    +    raise
    +except KeyError as e:
    +    print(f"Key error in data columns for similarity calculation: {e}")
    +    raise
     
    Suggestion importance[1-10]: 9

    Why: This suggestion improves error handling by specifying exception types, which enhances debugging and prevents masking other unexpected issues. It is a significant improvement in terms of best practices.

    9
    Performance
    Optimize the removal of an item from a list

    Use a more efficient method for removing a specific item from a list, especially
    when the list can be large. Using list.remove() is more readable and direct than
    filtering with a list comprehension.

    hydrography-approach/processing_scripts/associate_data/calculate_match_percentage.py [80]

    -available_osm_road_names = [col for col in available_osm_road_names if col != 'osm_id']
    +available_osm_road_names.remove('osm_id')
     
    Suggestion importance[1-10]: 7

    Why: Using list.remove() is more readable and direct than filtering with a list comprehension. This change improves code readability and performance slightly.

    7

    @pushpak-gote-mapup
    Copy link
    Contributor Author

    pushpak-gote-mapup commented Aug 6, 2024

    Addressing PR agent's reviews.

    @pushpak-gote-mapup
    Copy link
    Contributor Author

    Error Handling
    The error handling in the calculate_similarity and read_exploded_osm_data_csv functions uses a generic Exception catch, which might obscure the underlying issue. It's recommended to catch more specific exceptions or to at least log the stack trace along with the error message.
    Solved
    Added specific exceptions:

    def read_exploded_osm_data_csv(exploded_osm_data_csv: str, osm_cols_for_road_names: List[str]) -> Tuple[pd.DataFrame, List[str]]:
        """
        Reads the 'exploded_osm_data' CSV file and returns a DataFrame with the required columns.
    
        Parameters:
            exploded_osm_data_csv (str): The path to the CSV file.
            osm_cols_for_road_names (list): The list of column names to read from the CSV file.
    
        Returns:
            pandas.DataFrame: The DataFrame containing the required columns.
        """
        series_list=[]
        available_osm_road_names=[]
        for col in osm_cols_for_road_names:
            try:
                series=pd.read_csv(exploded_osm_data_csv,usecols=[col])
                series_list.append(series)
                available_osm_road_names.append(col)
            except ValueError as e:
                logger.warning(f"Column {col} not found in CSV: {e}")
                raise
            except pd.errors.EmptyDataError as e:
                logger.error(f"Empty data error when reading CSV: {e}", exc_info=True)
                raise
            except Exception as e:
                logger.error(f"Unexpected error when reading CSV: {e}", exc_info=True)
                raise
        
        
        try:
            exploded_osm_data_df=pd.concat(series_list,axis=1)
            return exploded_osm_data_df, available_osm_road_names
        except ValueError as e:
            logger.error(f"ValueError when concatenating series: {e}", exc_info=True)
            raise
        except Exception as e:
            logger.error(f"Unexpected error when concatenating series: {e}", exc_info=True)
            raise
        

    Performance Concern
    The function calculate_similarity iterates over columns and performs fuzzy matching within a loop, which can be inefficient with large datasets. Consider vectorizing this operation or using more efficient data structures like numpy arrays or pandas methods that are optimized for performance.
    Solved
    Now using vectorized implementation:

    def vectorized_fuzz(s1: pd.Series, s2: pd.Series) -> pd.Series:
            try:
                mask = s1.notna() & s2.notna()
                result = pd.Series(0, index=s1.index)
                if mask.any():
                    result[mask] = np.vectorize(fuzz.token_sort_ratio, otypes=[np.int64])(
                        s1[mask].astype(str), s2[mask].astype(str)
                    )
                return result
            except ValueError as e:
                logger.error(f"ValueError in vectorized_fuzz: {str(e)}", exc_info=True)
                raise 
            except TypeError as e:
                logger.error(f"TypeError in vectorized_fuzz: {str(e)}", exc_info=True)
                raise 
            except Exception as e:
                logger.error(f"Unexpected error in vectorized_fuzz: {str(e)}", exc_info=True)
                raise 

    Exception Handling
    Similar to the first file, the use of generic Exception handling in functions like calculate_osm_similarity and read_geopackage_to_dataframe can be improved by specifying the type of exceptions expected or by enhancing the logging mechanism to provide more context about the errors.
    Solved
    Now using specific exceptions:

    def read_geopackage_to_dataframe(filepath: str) -> gpd.GeoDataFrame:
        """
        Read a GeoPackage file into a GeoDataFrame.
    
        Args:
            filepath (str): Path to the GeoPackage file.
    
        Returns:
            gpd.GeoDataFrame: The read GeoDataFrame.
        """
        try:
            return gpd.read_file(filepath)
        except FileNotFoundError as e:
            logger.error(f"FileNotFoundError: GeoPackage file not found at {filepath}: {str(e)}", exc_info=True)
            raise
        except PermissionError as e:
            logger.error(f"PermissionError: Unable to access GeoPackage file at {filepath}: {str(e)}", exc_info=True)
            raise
        except gpd.io.file.DriverError as e:
            logger.error(f"DriverError: Unable to read GeoPackage file at {filepath}: {str(e)}", exc_info=True)
            raise
        except Exception as e:
            logger.error(f"Unexpected error while reading GeoPackage file at {filepath}: {str(e)}", exc_info=True)
            raise

    In addition to this, PR code suggestions are also implemented.

    @pushpak-gote-mapup pushpak-gote-mapup merged commit e6cbcbf into main Aug 8, 2024
    2 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants