Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to package #1

Open
ahmed-shariff opened this issue Nov 19, 2021 · 7 comments
Open

Convert to package #1

ahmed-shariff opened this issue Nov 19, 2021 · 7 comments

Comments

@ahmed-shariff
Copy link

Hey there, awesome paper. And thank you so very much for having all of this available publicly

I was wondering if there is a reason why this was not conceived as a python package? or do you have any plans on dong this in the future?

@arpitnarechania
Copy link
Member

Thanks, we are in the middle of several upgrades mostly pertaining to the front-end, e.g. allowing users to configure which publication venues to support and then only load those.

However, I can make the scraper available as a python package, sure, but that will take a few weeks. Can you suggest the kind of API you are expecting the python package to support?

@ahmed-shariff
Copy link
Author

I was thinking more along the lines of a cli application. I had written something similar to this (as a bet XD) but was a much more naive solution - going through the corssref/doi api - which was prone to blocking my ip. I am just looking to use the metadata you collect to quick search with regex or any other extension.

This is the current implementation I have, which is still super jank https://github.com/ahmed-shariff/acm-dl-searcher

That being said, when I have some time, I also can help out with a few PR's if you have a roadmap or wishlist of what you want this repo to do.

@arpitnarechania
Copy link
Member

I see, the CLI you implemented looks great; we can definitely do something like that for the scrapers here.

Also, while we are definitely encouraging contributions from the community going forward, but first, to get that ball rolling, let me discuss with my colleagues about roadmap/milestones and prepare a plan. I will then get back to you here after Thanksgiving!

@ahmed-shariff
Copy link
Author

@arpitnarechania any updates on this?

@arpitnarechania
Copy link
Member

Hi @ahmed-shariff, apologies for not getting back earlier. We have prepared an internal timeline of multiple new features on the user interface as well as the scraper, many of them are currently in the pipeline. Unfortunate for this discussion, a Python-package to scrape data from the command-line was voted a low priority item. However, would you like to collaborate with me on it? I have a major deadline at the end of March but can utilize part of my weekends thereafter to work on it with you. At least with designing the CLI spec and commands and documentation; I am obviously happy to port the actual scraper-related aspects.

@ahmed-shariff
Copy link
Author

Its no problem, I can certainly relate 😅

Having used your current implementation, I see why it would be voted down. It's quite memory intensive, atleast on the first few steps. The current implementation makes sense in terms of running once in a while to update the back-end's database. It's going to need some optimization to run as a stand-alone cli (and offline gui?) application.

I am also fighting against a few deadlines, I'll setup what I had done so far as a PR and we can discuss the details there.

I'll also create a separate issue to discuss the possible optimizations.

@arpitnarechania
Copy link
Member

I agree, it was designed for a long-term batch update; but that too can be made efficient -- I had plans to move to asynchronous threads or pyspark-based map-reduce operations. I will go through your WIP PR and get back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants