Headless mode #202

furstenheim · 2017-05-19T14:48:54Z

The pr is quite large so it would totally make sense not to merge.

That being said, the aim of this pr is to make the repo available in headless mode. That means that it is possible to load it as a npm module and scrape programmatically. Right now it uses jsdom to get the window, document and jquery. Next step would be to add another browser that runs in Chrome Headless.

The pr incorporates several stuff

Using standard.js (mostly habit but it was also good to find global undefined variables)
Using a bundler so it was easier to port to node. In particular the selectors are stored in Selectors instead of window
Use karma and gulp, this made easier to run the tests on changes.
Window, jquery and document are not accessed globally but instead passed on object creation. That way we can pass to the content scraper the fake window and jquery when that is the case.
Added a jsdom browser, this is similar to chrome popup browser but running in node
Added a web browser, this runs jsdom in the devtools in a webworker thus avoiding having a popup to scrape.

jwillmer · 2017-05-22T12:50:37Z

Nice work! Long term it might make sense to merge your changes but that would also mean that all other open pull requests (9) could not be merged. I would like to support the headless mode but it is just to much work for me to merge it into my fork

jwillmer · 2017-05-23T08:46:19Z

How about implementing the open pull requests into your version?

furstenheim · 2017-05-23T08:57:24Z

Yes, I think that would be feasible. The most delicate part will be the tests because I'd rather have no jquery on them so that they are more agnostic.

jwillmer · 2017-05-23T12:45:33Z

How about this? https://stackoverflow.com/questions/22810786/add-fixtures-to-jasmine-without-using-jquery-jasmine-is-it-possible

furstenheim · 2017-05-23T13:05:33Z

Actually it is not using jasmine any longer, the tests used an old version of jasmine which wasn't very nice to run async tests (all that runFor, waitFor... was a bit hacky), so I took the time to move to mocha which is easier for asynchronous tests.

I also changed the test runner. It was a bit odd, having the specRunner.html to load all the dependencies. It was alright for the extension but it wasn't perfect to integrate with node and run automatically on changes so I moved it to karma.

furstenheim · 2017-05-23T13:08:30Z

Right now the tests work as follows, first JSDOMSpec or browserSpec runs to load window, document and jquery, they store this variables in globals. Then for each test that variable is loaded and passed to the classes that require it

grinono · 2017-09-12T21:02:33Z

I automated the client side(extension) scraping with an graphql link to a server. But headless is a much better solution. Is this fork working? If so, is their Any documentation about how to get it working?
Chrome headless implementation is indeed a good next step..

furstenheim · 2017-09-13T13:24:14Z

@grinono yes, it is available in npm and it is easy to use

grinono · 2017-09-15T11:43:51Z

this, i start testing with it.. but i got lots of errors.

Just changed to right package name
const webscraper = require('webscraper-headless')
to
import webscraper from 'web-scraper-headless'

but then running the example in NodeJS, it returns errors. when fixing these i encounter more and more errors. looks like Babal es6 compiler errors. how do you run this package server side?
I'm using meteor > nodejs

furstenheim · 2017-09-15T13:53:30Z

@grinono what kind of errors are you getting? Have you tried requiring the package instead of importing?

jwillmer · 2017-09-15T15:37:53Z

this, i start testing with it.. but i got lots of errors.

Can you move this discussion to an issue? So this thread stays on track.

grinono · 2017-09-18T10:06:36Z

The first errors i get are regarding the default values declared in the Functions as shown below.
function scrapeJSDOM (sitemapInfo, options = {})
The option = {} is somehow not allowed. But this should be a totally fine declaration in es6.

i can start a new issue, but this code is not officially supported.

furstenheim · 2017-09-18T10:14:43Z

You can start an issue in the fork

grinono · 2017-09-19T12:44:07Z

i checked that before, but it's not possible to start an issue in the fork...

furstenheim-geoblink and others added 30 commits March 2, 2017 11:31

Add Cssslector

d877e8f

Restrict spec to scraper

8bd2156

ChromeHeadlessBrowser, chrome-remote-interface

2758421

Make bundle

547acc4

Adapt test to node

3c4fbda

Load all dependencies into open web

51fe9c8

Adapt tests to mocha

55ecb6c

Replace chrome runtime by extension

a004a1d

Use deferred from library

179d2be

Set pretty print

c6d514c

Close tab on end

783a227

Require selectors

4291fac

Move several dependencies to require

987d806

Add browserify

3b53ae7

Standarize

2872899

Standarize several files

c506f29

Use standard fix

0ff02b3

Standarize background script

bf65c68

Bundle background script

f55028c

Bundle devtools

4302fe8

Fix some exports

f1a6c9c

Modularize base and deferred

c83da5f

Fix requires

ff363c4

Run standard fix on tests

d418d79

Run first test with karma

ffd632f

Adapt scraper tests

4bb27da

Adapt queue specs

996327b

Adapt selector list spec

b6b17a4

Adapt selector test

cbff04d

Adapt sitemap

23ec26f

furstenheim-geoblink added 11 commits May 18, 2017 14:43

Ignore popup link tests

5d4c012

Remove some $

863586d

Fix listener

4c9714c

Allow headless browsing

6e45a35

Use semi colon separator

1713c74

Add main entry

1b22f47

Add description to README

0d6e611

Remove unnecesary token

d9b338e

Generate builds without sources

b8e4bc6

Add missing tests

532e482

Remove chrome headless

002ea6c

furstenheim-geoblink added 2 commits June 1, 2017 08:14

Update package json

a9a597d

Fix repository

ea6036d

furstenheim-geoblink and others added 2 commits November 28, 2017 14:33

Remove generated bundles

a7da692

Remove console log

984f212

tripu mentioned this pull request Dec 13, 2017

Fix errors in documentation / examples geoblink/web-scraper-chrome-extension#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headless mode #202

Headless mode #202

furstenheim commented May 19, 2017

jwillmer commented May 22, 2017

jwillmer commented May 23, 2017

furstenheim commented May 23, 2017

jwillmer commented May 23, 2017

furstenheim commented May 23, 2017

furstenheim commented May 23, 2017 •

edited

Loading

grinono commented Sep 12, 2017 •

edited

Loading

furstenheim commented Sep 13, 2017

grinono commented Sep 15, 2017 •

edited

Loading

furstenheim commented Sep 15, 2017

jwillmer commented Sep 15, 2017

grinono commented Sep 18, 2017

furstenheim commented Sep 18, 2017

grinono commented Sep 19, 2017 •

edited

Loading

Headless mode #202

Are you sure you want to change the base?

Headless mode #202

Conversation

furstenheim commented May 19, 2017

jwillmer commented May 22, 2017

jwillmer commented May 23, 2017

furstenheim commented May 23, 2017

jwillmer commented May 23, 2017

furstenheim commented May 23, 2017

furstenheim commented May 23, 2017 • edited Loading

grinono commented Sep 12, 2017 • edited Loading

furstenheim commented Sep 13, 2017

grinono commented Sep 15, 2017 • edited Loading

furstenheim commented Sep 15, 2017

jwillmer commented Sep 15, 2017

grinono commented Sep 18, 2017

furstenheim commented Sep 18, 2017

grinono commented Sep 19, 2017 • edited Loading

furstenheim commented May 23, 2017 •

edited

Loading

grinono commented Sep 12, 2017 •

edited

Loading

grinono commented Sep 15, 2017 •

edited

Loading

grinono commented Sep 19, 2017 •

edited

Loading