ISBN search imports DVDs from Amazon.com #9879

stopregionblocking · 2024-09-15T18:22:07Z

Problem

The feature which allows Amazon.com records to be imported into OpenLibrary via searching for ISBNs will, unfortunately, import DVDs with ISBNs from Amazon.com, even if they are clearly indicated as not being books.

This was confirmed using the example of https://www.amazon.com/gp/product/1621064298

Reproducing the bug

Search OpenLibrary for the ISBN of a DVD available on Amazon.com, but not included on OpenLibrary.
Search for the same ISBN again (to attempt an import of the record from Amazon.com).

Expected behavior: Nothing is imported, because there is no book with that ISBN.
Actual behavior: A DVD is imported, because it has an ISBN.

Context

Browser (Chrome, Safari, Firefox, etc):
OS (Windows, Mac, etc):
Logged in (Y/N): Y
Environment (prod, dev, local): prod

Breakdown

The solution will likely involve some minor modifications to https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py so that DVDs don't return values to get_products().

For the above DVD/item, get_products() returns: [{'url': 'https://www.amazon.com/dp/1621064298/?tag=internetarchi-20', 'source_records': ['amazon:1621064298'], 'isbn_10': ['1621064298'], 'isbn_13': ['9781621064299'], 'price': '$15.19', 'price_amt': 1519, 'title': 'Homeland Insecurity: Films by Bill Brown', 'cover': 'https://m.media-amazon.com/images/I/41FuCUj3kUL._SL500_.jpg', 'authors': [{'name': 'Brown, Bill'}], 'publishers': ['Microcosm Publishing'], 'number_of_pages': None, 'edition_num': None, 'publish_date': 'Aug 01, 2007', 'product_group': 'DVD', 'physical_format': 'dvd'}] .

Of interest is product_group and physical_format. To complete this issue one would likely want to look at https://webservices.amazon.com/paapi5/documentation/ and determine why we should use product_group, physical_format, both, either, or something else to determine something is a DVD. Or maybe it's better to focus on what is allowed (e.g. books).

In any event, we'll likely want to want to modify serialize() or get_product() to filter out DVDs (or only allow whatever constitutes books, if the cases are clear).

Requirements Checklist

Read through https://webservices.amazon.com/paapi5/documentation/ (or other relevant documentation) and make a case for why we should filter based on product_group, physical_format, both, either, or something else to determine something is a DVD.
Filter using the above criteria (e.g. physical_format) such that get_product() or serialize() will no longer return metadata for DVDS (e.g. if modifying serialize(), it should return {} -- serialize() may be the better option).
Update the unit tests to include a case of a DVD being excluded and a book included in the results.

Related files

https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py

openlibrary/openlibrary/core/vendors.py

Lines 127 to 167 in d3bb158

    
               def get_products( 
        
                   self, 
        
                   asins: list | str, 
        
                   serialize: bool = False, 
        
                   marketplace: str = 'www.amazon.com', 
        
                   resources: Any | None = None, 
        
                   **kwargs, 
        
               ) -> list | None: 
        
                   """ 
        
                   :param str asins: One or more ItemIds like ASIN that uniquely identify an item 
        
                   or product URL. (Max 10) Separated by comma or as a list. 
        
                   """ 
        
                   # Wait before doing the request 
        
                   wait_time = 1 / self.throttling - (time.time() - self.last_query_time) 
        
                   if wait_time > 0: 
        
                       time.sleep(wait_time) 
        
                   self.last_query_time = time.time() 
        
                   item_ids = asins if isinstance(asins, list) else [asins] 
        
                   _resources = self.RESOURCES[resources or 'import'] 
        
                   try: 
        
                       request = GetItemsRequest( 
        
                           partner_tag=self.tag, 
        
                           partner_type=PartnerType.ASSOCIATES, 
        
                           marketplace=marketplace, 
        
                           item_ids=item_ids, 
        
                           resources=_resources, 
        
                           **kwargs, 
        
                       ) 
        
                   except ApiException: 
        
                       logger.error( 
        
                           f"Amazon fetch failed for: {', '.join(item_ids)}", exc_info=True 
        
                       ) 
        
                       return None 
        
                   response = self.api.get_items(request) 
        
                   products = ( 
        
                       [p for p in response.items_result.items if p] 
        
                       if response.items_result 
        
                       else [] 
        
                   ) 
        
                   return products if not serialize else [self.serialize(p) for p in products]

openlibrary/openlibrary/core/vendors.py

Lines 123 to 125 in d3bb158

    
           def get_product(self, asin: str, serialize: bool = False, **kwargs): 
        
               if products := self.get_products([asin], **kwargs): 
        
                   return next(self.serialize(p) if serialize else p for p in products)

openlibrary/openlibrary/core/vendors.py

Lines 170 to 294 in d3bb158

    
               def serialize(product: Any) -> dict: 
        
                   """Takes a full Amazon product Advertising API returned AmazonProduct 
        
                   with multiple ResponseGroups, and extracts the data we are 
        
                   interested in. 
        
                   :param AmazonAPI product: 
        
                   :return: Amazon metadata for one product 
        
                   { 
        
                     'price': '$54.06', 
        
                     'price_amt': 5406, 
        
                     'physical_format': 'hardcover', 
        
                     'authors': [{'name': 'Guterson, David'}], 
        
                     'publish_date': 'Jan 21, 2020', 
        
                     #'dimensions': { 
        
                     #  'width': [1.7, 'Inches'], 
        
                     #  'length': [8.5, 'Inches'], 
        
                     #  'weight': [5.4, 'Pounds'], 
        
                     #  'height': [10.875, 'Inches'] 
        
                     # }, 
        
                     'publishers': ['Victory Belt Publishing'], 
        
                     'source_records': ['amazon:1628603976'], 
        
                     'title': 'Boundless: Upgrade Your Brain, Optimize Your Body & Defy Aging', 
        
                     'url': 'https://www.amazon.com/dp/1628603976/?tag=internetarchi-20', 
        
                     'number_of_pages': 640, 
        
                     'cover': 'https://m.media-amazon.com/images/I/51IT9MV3KqL._AC_.jpg', 
        
                     'languages': ['English'] 
        
                     'edition_num': '1' 
        
                   } 
        
                   """ 
        
                   if not product: 
        
                       return {}  # no match? 
        
                   item_info = getattr(product, 'item_info') 
        
                   images = getattr(product, 'images') 
        
                   edition_info = item_info and getattr(item_info, 'content_info') 
        
                   attribution = item_info and getattr(item_info, 'by_line_info') 
        
                   price = ( 
        
                       getattr(product, 'offers') 
        
                       and product.offers.listings 
        
                       and product.offers.listings[0].price 
        
                   ) 
        
                   brand = ( 
        
                       attribution 
        
                       and getattr(attribution, 'brand') 
        
                       and getattr(attribution.brand, 'display_value') 
        
                   ) 
        
                   manufacturer = ( 
        
                       item_info 
        
                       and getattr(item_info, 'by_line_info') 
        
                       and getattr(item_info.by_line_info, 'manufacturer') 
        
                       and item_info.by_line_info.manufacturer.display_value 
        
                   ) 
        
                   product_group = ( 
        
                       item_info 
        
                       and getattr( 
        
                           item_info, 
        
                           'classifications', 
        
                       ) 
        
                       and getattr(item_info.classifications, 'product_group') 
        
                       and item_info.classifications.product_group.display_value 
        
                   ) 
        
                   try: 
        
                       publish_date = ( 
        
                           edition_info 
        
                           and edition_info.publication_date 
        
                           and isoparser.parse( 
        
                               edition_info.publication_date.display_value 
        
                           ).strftime('%b %d, %Y') 
        
                       ) 
        
                   except Exception: 
        
                       logger.exception(f"serialize({product})") 
        
                       publish_date = None 
        
                   asin_is_isbn10 = not product.asin.startswith("B") 
        
                   isbn_13 = isbn_10_to_isbn_13(product.asin) if asin_is_isbn10 else None 
        
                   book = { 
        
                       'url': "https://www.amazon.com/dp/{}/?tag={}".format( 
        
                           product.asin, h.affiliate_id('amazon') 
        
                       ), 
        
                       'source_records': ['amazon:%s' % product.asin], 
        
                       'isbn_10': [product.asin] if asin_is_isbn10 else [], 
        
                       'isbn_13': [isbn_13] if isbn_13 else [], 
        
                       'price': price and price.display_amount, 
        
                       'price_amt': price and price.amount and int(100 * price.amount), 
        
                       'title': ( 
        
                           item_info 
        
                           and item_info.title 
        
                           and getattr(item_info.title, 'display_value') 
        
                       ), 
        
                       'cover': ( 
        
                           images.primary.large.url 
        
                           if images 
        
                           and images.primary 
        
                           and images.primary.large 
        
                           and images.primary.large.url 
        
                           and '/01RmK+J4pJL.' not in images.primary.large.url 
        
                           else None 
        
                       ), 
        
                       'authors': attribution 
        
                       and [{'name': contrib.name} for contrib in attribution.contributors or []], 
        
                       'publishers': list({p for p in (brand, manufacturer) if p}), 
        
                       'number_of_pages': ( 
        
                           edition_info 
        
                           and edition_info.pages_count 
        
                           and edition_info.pages_count.display_value 
        
                       ), 
        
                       'edition_num': ( 
        
                           edition_info 
        
                           and edition_info.edition 
        
                           and edition_info.edition.display_value 
        
                       ), 
        
                       'publish_date': publish_date, 
        
                       'product_group': product_group, 
        
                       'physical_format': ( 
        
                           item_info 
        
                           and item_info.classifications 
        
                           and getattr( 
        
                               item_info.classifications.binding, 'display_value', '' 
        
                           ).lower() 
        
                       ), 
        
                   } 
        
                   return book

https://github.com/internetarchive/openlibrary/blob/master/openlibrary/tests/core/test_vendors.py

Stakeholders

Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

The text was updated successfully, but these errors were encountered:

krthkmgndm · 2024-09-23T14:44:32Z

Hi, I'm interested in working on this issue. Can someone give me some pointers on how to get started?

scottbarnes · 2024-09-23T15:59:05Z

Hi, @krthkmgndm, it sounds as if this issue may be a bit too daunting, as the current steps outlined are the best I can do, unfortunately. It may make more sense to try a Good First Issue: https://github.com/internetarchive/openlibrary/issues?q=is%3Aissue+is%3Aopen+label%3A%22Good+First+Issue%22+-linked%3Apr.

It's also worth checking out https://github.com/internetarchive/openlibrary/tree/master/docker#welcome-to-the-docker-installation-guide-for-open-library-developers.

DebbieSan · 2024-09-23T16:03:54Z

Hi @scottbarnes. I would like to work on this. I have an idea on how to tackle this issue :) ty!

mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Sep 16, 2024

scottbarnes assigned DebbieSan Sep 23, 2024

github-actions bot added the Needs: Response Issues which require feedback from lead label Sep 24, 2024

scottbarnes removed the Needs: Response Issues which require feedback from lead label Sep 24, 2024

DebbieSan mentioned this issue Sep 24, 2024

reject Amazon dvd imports #9905

Merged

scottbarnes closed this as completed in #9905 Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISBN search imports DVDs from Amazon.com #9879

ISBN search imports DVDs from Amazon.com #9879

stopregionblocking commented Sep 15, 2024 •

edited by scottbarnes

Loading

krthkmgndm commented Sep 23, 2024

scottbarnes commented Sep 23, 2024 •

edited

Loading

DebbieSan commented Sep 23, 2024

ISBN search imports DVDs from Amazon.com #9879

ISBN search imports DVDs from Amazon.com #9879

Comments

stopregionblocking commented Sep 15, 2024 • edited by scottbarnes Loading

Problem

Reproducing the bug

Context

Breakdown

Requirements Checklist

Related files

Stakeholders

Instructions for Contributors

krthkmgndm commented Sep 23, 2024

scottbarnes commented Sep 23, 2024 • edited Loading

DebbieSan commented Sep 23, 2024

stopregionblocking commented Sep 15, 2024 •

edited by scottbarnes

Loading

scottbarnes commented Sep 23, 2024 •

edited

Loading