Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISBN search imports DVDs from Amazon.com #9879

Closed
3 tasks
stopregionblocking opened this issue Sep 15, 2024 · 3 comments · Fixed by #9905
Closed
3 tasks

ISBN search imports DVDs from Amazon.com #9879

stopregionblocking opened this issue Sep 15, 2024 · 3 comments · Fixed by #9905
Assignees
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@stopregionblocking
Copy link

stopregionblocking commented Sep 15, 2024

Problem

The feature which allows Amazon.com records to be imported into OpenLibrary via searching for ISBNs will, unfortunately, import DVDs with ISBNs from Amazon.com, even if they are clearly indicated as not being books.

This was confirmed using the example of https://www.amazon.com/gp/product/1621064298

Reproducing the bug

  1. Search OpenLibrary for the ISBN of a DVD available on Amazon.com, but not included on OpenLibrary.
  2. Search for the same ISBN again (to attempt an import of the record from Amazon.com).
  • Expected behavior: Nothing is imported, because there is no book with that ISBN.
  • Actual behavior: A DVD is imported, because it has an ISBN.

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N): Y
  • Environment (prod, dev, local): prod

Breakdown

The solution will likely involve some minor modifications to https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py so that DVDs don't return values to get_products().

For the above DVD/item, get_products() returns: [{'url': 'https://www.amazon.com/dp/1621064298/?tag=internetarchi-20', 'source_records': ['amazon:1621064298'], 'isbn_10': ['1621064298'], 'isbn_13': ['9781621064299'], 'price': '$15.19', 'price_amt': 1519, 'title': 'Homeland Insecurity: Films by Bill Brown', 'cover': 'https://m.media-amazon.com/images/I/41FuCUj3kUL._SL500_.jpg', 'authors': [{'name': 'Brown, Bill'}], 'publishers': ['Microcosm Publishing'], 'number_of_pages': None, 'edition_num': None, 'publish_date': 'Aug 01, 2007', 'product_group': 'DVD', 'physical_format': 'dvd'}] .

Of interest is product_group and physical_format. To complete this issue one would likely want to look at https://webservices.amazon.com/paapi5/documentation/ and determine why we should use product_group, physical_format, both, either, or something else to determine something is a DVD. Or maybe it's better to focus on what is allowed (e.g. books).

In any event, we'll likely want to want to modify serialize() or get_product() to filter out DVDs (or only allow whatever constitutes books, if the cases are clear).

Requirements Checklist

  • Read through https://webservices.amazon.com/paapi5/documentation/ (or other relevant documentation) and make a case for why we should filter based on product_group, physical_format, both, either, or something else to determine something is a DVD.
  • Filter using the above criteria (e.g. physical_format) such that get_product() or serialize() will no longer return metadata for DVDS (e.g. if modifying serialize(), it should return {} -- serialize() may be the better option).
  • Update the unit tests to include a case of a DVD being excluded and a book included in the results.

Related files

  • https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py
  • def get_products(
    self,
    asins: list | str,
    serialize: bool = False,
    marketplace: str = 'www.amazon.com',
    resources: Any | None = None,
    **kwargs,
    ) -> list | None:
    """
    :param str asins: One or more ItemIds like ASIN that uniquely identify an item
    or product URL. (Max 10) Separated by comma or as a list.
    """
    # Wait before doing the request
    wait_time = 1 / self.throttling - (time.time() - self.last_query_time)
    if wait_time > 0:
    time.sleep(wait_time)
    self.last_query_time = time.time()
    item_ids = asins if isinstance(asins, list) else [asins]
    _resources = self.RESOURCES[resources or 'import']
    try:
    request = GetItemsRequest(
    partner_tag=self.tag,
    partner_type=PartnerType.ASSOCIATES,
    marketplace=marketplace,
    item_ids=item_ids,
    resources=_resources,
    **kwargs,
    )
    except ApiException:
    logger.error(
    f"Amazon fetch failed for: {', '.join(item_ids)}", exc_info=True
    )
    return None
    response = self.api.get_items(request)
    products = (
    [p for p in response.items_result.items if p]
    if response.items_result
    else []
    )
    return products if not serialize else [self.serialize(p) for p in products]
  • def get_product(self, asin: str, serialize: bool = False, **kwargs):
    if products := self.get_products([asin], **kwargs):
    return next(self.serialize(p) if serialize else p for p in products)
  • def serialize(product: Any) -> dict:
    """Takes a full Amazon product Advertising API returned AmazonProduct
    with multiple ResponseGroups, and extracts the data we are
    interested in.
    :param AmazonAPI product:
    :return: Amazon metadata for one product
    {
    'price': '$54.06',
    'price_amt': 5406,
    'physical_format': 'hardcover',
    'authors': [{'name': 'Guterson, David'}],
    'publish_date': 'Jan 21, 2020',
    #'dimensions': {
    # 'width': [1.7, 'Inches'],
    # 'length': [8.5, 'Inches'],
    # 'weight': [5.4, 'Pounds'],
    # 'height': [10.875, 'Inches']
    # },
    'publishers': ['Victory Belt Publishing'],
    'source_records': ['amazon:1628603976'],
    'title': 'Boundless: Upgrade Your Brain, Optimize Your Body & Defy Aging',
    'url': 'https://www.amazon.com/dp/1628603976/?tag=internetarchi-20',
    'number_of_pages': 640,
    'cover': 'https://m.media-amazon.com/images/I/51IT9MV3KqL._AC_.jpg',
    'languages': ['English']
    'edition_num': '1'
    }
    """
    if not product:
    return {} # no match?
    item_info = getattr(product, 'item_info')
    images = getattr(product, 'images')
    edition_info = item_info and getattr(item_info, 'content_info')
    attribution = item_info and getattr(item_info, 'by_line_info')
    price = (
    getattr(product, 'offers')
    and product.offers.listings
    and product.offers.listings[0].price
    )
    brand = (
    attribution
    and getattr(attribution, 'brand')
    and getattr(attribution.brand, 'display_value')
    )
    manufacturer = (
    item_info
    and getattr(item_info, 'by_line_info')
    and getattr(item_info.by_line_info, 'manufacturer')
    and item_info.by_line_info.manufacturer.display_value
    )
    product_group = (
    item_info
    and getattr(
    item_info,
    'classifications',
    )
    and getattr(item_info.classifications, 'product_group')
    and item_info.classifications.product_group.display_value
    )
    try:
    publish_date = (
    edition_info
    and edition_info.publication_date
    and isoparser.parse(
    edition_info.publication_date.display_value
    ).strftime('%b %d, %Y')
    )
    except Exception:
    logger.exception(f"serialize({product})")
    publish_date = None
    asin_is_isbn10 = not product.asin.startswith("B")
    isbn_13 = isbn_10_to_isbn_13(product.asin) if asin_is_isbn10 else None
    book = {
    'url': "https://www.amazon.com/dp/{}/?tag={}".format(
    product.asin, h.affiliate_id('amazon')
    ),
    'source_records': ['amazon:%s' % product.asin],
    'isbn_10': [product.asin] if asin_is_isbn10 else [],
    'isbn_13': [isbn_13] if isbn_13 else [],
    'price': price and price.display_amount,
    'price_amt': price and price.amount and int(100 * price.amount),
    'title': (
    item_info
    and item_info.title
    and getattr(item_info.title, 'display_value')
    ),
    'cover': (
    images.primary.large.url
    if images
    and images.primary
    and images.primary.large
    and images.primary.large.url
    and '/01RmK+J4pJL.' not in images.primary.large.url
    else None
    ),
    'authors': attribution
    and [{'name': contrib.name} for contrib in attribution.contributors or []],
    'publishers': list({p for p in (brand, manufacturer) if p}),
    'number_of_pages': (
    edition_info
    and edition_info.pages_count
    and edition_info.pages_count.display_value
    ),
    'edition_num': (
    edition_info
    and edition_info.edition
    and edition_info.edition.display_value
    ),
    'publish_date': publish_date,
    'product_group': product_group,
    'physical_format': (
    item_info
    and item_info.classifications
    and getattr(
    item_info.classifications.binding, 'display_value', ''
    ).lower()
    ),
    }
    return book
  • https://github.com/internetarchive/openlibrary/blob/master/openlibrary/tests/core/test_vendors.py

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
@stopregionblocking stopregionblocking added Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Bug Something isn't working. [managed] labels Sep 15, 2024
@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Sep 16, 2024
@scottbarnes scottbarnes added Priority: 3 Issues that we can consider at our leisure. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Good First Issue Easy issue. Good for newcomers. [managed] and removed Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Good First Issue Easy issue. Good for newcomers. [managed] labels Sep 23, 2024
@krthkmgndm
Copy link

Hi, I'm interested in working on this issue. Can someone give me some pointers on how to get started?

@scottbarnes
Copy link
Collaborator

scottbarnes commented Sep 23, 2024

Hi, @krthkmgndm, it sounds as if this issue may be a bit too daunting, as the current steps outlined are the best I can do, unfortunately. It may make more sense to try a Good First Issue: https://github.com/internetarchive/openlibrary/issues?q=is%3Aissue+is%3Aopen+label%3A%22Good+First+Issue%22+-linked%3Apr.

It's also worth checking out https://github.com/internetarchive/openlibrary/tree/master/docker#welcome-to-the-docker-installation-guide-for-open-library-developers.

@DebbieSan
Copy link
Contributor

Hi @scottbarnes. I would like to work on this. I have an idea on how to tackle this issue :) ty!

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Sep 24, 2024
@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants