Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless mode #202

Open
wants to merge 67 commits into
base: master
Choose a base branch
from
Open

Conversation

furstenheim
Copy link

The pr is quite large so it would totally make sense not to merge.

That being said, the aim of this pr is to make the repo available in headless mode. That means that it is possible to load it as a npm module and scrape programmatically. Right now it uses jsdom to get the window, document and jquery. Next step would be to add another browser that runs in Chrome Headless.

The pr incorporates several stuff

  • Using standard.js (mostly habit but it was also good to find global undefined variables)
  • Using a bundler so it was easier to port to node. In particular the selectors are stored in Selectors instead of window
  • Use karma and gulp, this made easier to run the tests on changes.
  • Window, jquery and document are not accessed globally but instead passed on object creation. That way we can pass to the content scraper the fake window and jquery when that is the case.
  • Added a jsdom browser, this is similar to chrome popup browser but running in node
  • Added a web browser, this runs jsdom in the devtools in a webworker thus avoiding having a popup to scrape.

@jwillmer
Copy link

Nice work! Long term it might make sense to merge your changes but that would also mean that all other open pull requests (9) could not be merged. I would like to support the headless mode but it is just to much work for me to merge it into my fork

@jwillmer
Copy link

How about implementing the open pull requests into your version?

@furstenheim
Copy link
Author

Yes, I think that would be feasible. The most delicate part will be the tests because I'd rather have no jquery on them so that they are more agnostic.

@furstenheim
Copy link
Author

Actually it is not using jasmine any longer, the tests used an old version of jasmine which wasn't very nice to run async tests (all that runFor, waitFor... was a bit hacky), so I took the time to move to mocha which is easier for asynchronous tests.

I also changed the test runner. It was a bit odd, having the specRunner.html to load all the dependencies. It was alright for the extension but it wasn't perfect to integrate with node and run automatically on changes so I moved it to karma.

@furstenheim
Copy link
Author

furstenheim commented May 23, 2017

Right now the tests work as follows, first JSDOMSpec or browserSpec runs to load window, document and jquery, they store this variables in globals. Then for each test that variable is loaded and passed to the classes that require it

@grinono
Copy link

grinono commented Sep 12, 2017

I automated the client side(extension) scraping with an graphql link to a server. But headless is a much better solution. Is this fork working? If so, is their Any documentation about how to get it working?
Chrome headless implementation is indeed a good next step..

@furstenheim
Copy link
Author

@grinono yes, it is available in npm and it is easy to use

@grinono
Copy link

grinono commented Sep 15, 2017

this, i start testing with it.. but i got lots of errors.

Just changed to right package name
const webscraper = require('webscraper-headless')
to
import webscraper from 'web-scraper-headless'

but then running the example in NodeJS, it returns errors. when fixing these i encounter more and more errors. looks like Babal es6 compiler errors. how do you run this package server side?
I'm using meteor > nodejs

@furstenheim
Copy link
Author

@grinono what kind of errors are you getting? Have you tried requiring the package instead of importing?

@jwillmer
Copy link

this, i start testing with it.. but i got lots of errors.

Can you move this discussion to an issue? So this thread stays on track.

@grinono
Copy link

grinono commented Sep 18, 2017

The first errors i get are regarding the default values declared in the Functions as shown below.
function scrapeJSDOM (sitemapInfo, options = {})
The option = {} is somehow not allowed. But this should be a totally fine declaration in es6.

i can start a new issue, but this code is not officially supported.

@furstenheim
Copy link
Author

You can start an issue in the fork

@grinono
Copy link

grinono commented Sep 19, 2017

i checked that before, but it's not possible to start an issue in the fork...
Image of fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants