Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless mode #202

Open
wants to merge 67 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d877e8f
Add Cssslector
furstenheim-geoblink Mar 2, 2017
8bd2156
Restrict spec to scraper
furstenheim-geoblink Mar 2, 2017
2758421
ChromeHeadlessBrowser, chrome-remote-interface
Mar 3, 2017
547acc4
Make bundle
furstenheim-geoblink Mar 3, 2017
3c4fbda
Adapt test to node
furstenheim-geoblink Mar 3, 2017
51fe9c8
Load all dependencies into open web
furstenheim-geoblink Mar 3, 2017
55ecb6c
Adapt tests to mocha
furstenheim-geoblink Mar 3, 2017
a004a1d
Replace chrome runtime by extension
furstenheim-geoblink Mar 3, 2017
179d2be
Use deferred from library
furstenheim-geoblink Mar 3, 2017
c6d514c
Set pretty print
furstenheim-geoblink Mar 3, 2017
783a227
Close tab on end
furstenheim-geoblink Mar 3, 2017
4291fac
Require selectors
furstenheim-geoblink Mar 24, 2017
987d806
Move several dependencies to require
furstenheim-geoblink Mar 24, 2017
3b53ae7
Add browserify
furstenheim-geoblink Mar 24, 2017
2872899
Standarize
furstenheim-geoblink Mar 24, 2017
c506f29
Standarize several files
furstenheim-geoblink Mar 24, 2017
0ff02b3
Use standard fix
furstenheim-geoblink Apr 12, 2017
bf65c68
Standarize background script
furstenheim-geoblink Apr 12, 2017
f55028c
Bundle background script
furstenheim-geoblink Apr 12, 2017
4302fe8
Bundle devtools
furstenheim-geoblink Apr 17, 2017
f1a6c9c
Fix some exports
furstenheim-geoblink Apr 17, 2017
c83da5f
Modularize base and deferred
furstenheim-geoblink Apr 17, 2017
ff363c4
Fix requires
furstenheim-geoblink Apr 17, 2017
d418d79
Run standard fix on tests
furstenheim-geoblink Apr 17, 2017
ffd632f
Run first test with karma
furstenheim-geoblink Apr 18, 2017
4bb27da
Adapt scraper tests
furstenheim-geoblink Apr 25, 2017
996327b
Adapt queue specs
furstenheim-geoblink Apr 25, 2017
b6b17a4
Adapt selector list spec
furstenheim-geoblink Apr 26, 2017
cbff04d
Adapt selector test
furstenheim-geoblink Apr 26, 2017
23ec26f
Adapt sitemap
furstenheim-geoblink Apr 26, 2017
7716a29
Adapt UniqueElementList
furstenheim-geoblink Apr 26, 2017
700762a
Adapt tests
furstenheim-geoblink Apr 27, 2017
c1823e0
Adapt Element query
furstenheim-geoblink Apr 27, 2017
39cf2ca
Adapt global tests
furstenheim-geoblink Apr 27, 2017
b8aef9d
Adapt element click and element attribute
furstenheim-geoblink Apr 27, 2017
230f299
WIP: Adapt selector tests
furstenheim-geoblink Apr 28, 2017
c4dc80f
Finish adapting the tests for webscraper
furstenheim-geoblink Apr 28, 2017
5d10a68
Remove unnecesary files
furstenheim-geoblink Apr 28, 2017
8eeadef
Rebundle extension
furstenheim-geoblink Apr 28, 2017
cb91a34
Load css selector as a package
furstenheim-geoblink May 3, 2017
2e36812
WIP get rid of global jquery
furstenheim-geoblink May 3, 2017
1bbb1a0
Revert "WIP get rid of global jquery"
furstenheim-geoblink May 16, 2017
2d20e35
Reload tests in gulp
furstenheim-geoblink May 16, 2017
d2f9e5f
Remove global jquery
furstenheim-geoblink May 3, 2017
61758fd
Fix small issues
furstenheim-geoblink May 17, 2017
fb8d7d6
Require document to be local
furstenheim-geoblink May 17, 2017
40e27dd
Throw error if missing document
furstenheim-geoblink May 17, 2017
1adda3f
Create objects with document and window
furstenheim-geoblink May 17, 2017
cd7aa01
Get local version of document and window
furstenheim-geoblink May 17, 2017
04c4409
Adapt all tests except scraper and popup
furstenheim-geoblink May 18, 2017
408aca1
Move save image to browser
furstenheim-geoblink May 18, 2017
a715b91
Adapt scraper test
furstenheim-geoblink May 18, 2017
5d4c012
Ignore popup link tests
furstenheim-geoblink May 18, 2017
863586d
Remove some $
furstenheim-geoblink May 19, 2017
4c9714c
Fix listener
furstenheim-geoblink May 19, 2017
6e45a35
Allow headless browsing
furstenheim-geoblink May 19, 2017
1713c74
Use semi colon separator
furstenheim-geoblink May 19, 2017
1b22f47
Add main entry
furstenheim-geoblink May 19, 2017
0d6e611
Add description to README
furstenheim-geoblink May 19, 2017
d9b338e
Remove unnecesary token
furstenheim-geoblink May 19, 2017
b8e4bc6
Generate builds without sources
furstenheim-geoblink May 19, 2017
532e482
Add missing tests
furstenheim-geoblink May 19, 2017
002ea6c
Remove chrome headless
furstenheim-geoblink May 19, 2017
a9a597d
Update package json
furstenheim-geoblink Jun 1, 2017
ea6036d
Fix repository
furstenheim-geoblink Jun 1, 2017
a7da692
Remove generated bundles
furstenheim-geoblink Nov 28, 2017
984f212
Remove console log
furstenheim Dec 7, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .babelrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"plugins": ["meaningful-logs"]
}
16 changes: 16 additions & 0 deletions .eslintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{ "env": {
"node": true
},
"globals": {
"d3": true,
"$": true,
"chrome": true,
"jQuery": true,
"describe": true,
"it": true,
"beforeEach": true,
"afterEach": true,
"after": true,
"before": true
},
"extends": ["standard"]}
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.idea
projectFilesBackup
extension.zip

node_modules
npm-debug.log
31 changes: 29 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Web Scraper
Web Scraper is a chrome browser extension built for data extraction from web
Web Scraper is a chrome browser extension and a library built for data extraction from web
pages. Using this extension you can create a plan (sitemap) how a web site
should be traversed and what should be extracted. Using these sitemaps the
Web Scraper will navigate the site accordingly and extract all data. Scraped
data later can be exported as CSV.

Install the extension from [Chrome store] [chrome-store]
To use it as an extension install it from [Chrome store] [chrome-store]

To use it as a library do `npm i web-scraper-headless`

### Features

Expand All @@ -26,6 +28,31 @@ Install the extension from [Chrome store] [chrome-store]

Submit bugs and suggest features on [bug tracker] [github-issues]

#### Headless mode
To use it as a library you need a sitemap, for example exported from the app.

const webscraper = require('webscraper-headless')
const sitemap = {
id: 'test',
startUrl: 'http://test.lv/',
selectors: [
{
'id': 'a',
'selector': '#scraper-test-one-page a',
'multiple': false,
type: 'SelectorText',
'parentSelectors': [
'_root'
]
}
]
}
const options = {} // optional delay and pageLoadDelay
webscraper(sitemap, options)
.then(function (scraped) {
// This is your scraped info
})

#### Bugs
When submitting a bug please attach an exported sitemap if possible.

Expand Down
53 changes: 27 additions & 26 deletions extension/assets/base64.js
Original file line number Diff line number Diff line change
@@ -1,36 +1,37 @@
var jquery = require('jquery-deferred')
/**
* @url http://jsperf.com/blob-base64-conversion
* @type {{blobToBase64: blobToBase64, base64ToBlob: base64ToBlob}}
*/
var Base64 = {

blobToBase64: function(blob) {
blobToBase64: function (blob) {
var deferredResponse = jquery.Deferred()
var reader = new FileReader()
reader.onload = function () {
var dataUrl = reader.result
var base64 = dataUrl.split(',')[1]
deferredResponse.resolve(base64)
}
reader.readAsDataURL(blob)

var deferredResponse = $.Deferred();
var reader = new FileReader();
reader.onload = function() {
var dataUrl = reader.result;
var base64 = dataUrl.split(',')[1];
deferredResponse.resolve(base64);
};
reader.readAsDataURL(blob);
return deferredResponse.promise()
},

return deferredResponse.promise();
},
base64ToBlob: function (base64, mimeType) {
var deferredResponse = jquery.Deferred()
var binary = atob(base64)
var len = binary.length
var buffer = new ArrayBuffer(len)
var view = new Uint8Array(buffer)
for (var i = 0; i < len; i++) {
view[i] = binary.charCodeAt(i)
}
var blob = new Blob([view], {type: mimeType})
deferredResponse.resolve(blob)

base64ToBlob: function(base64, mimeType) {
return deferredResponse.promise()
}
}

var deferredResponse = $.Deferred();
var binary = atob(base64);
var len = binary.length;
var buffer = new ArrayBuffer(len);
var view = new Uint8Array(buffer);
for (var i = 0; i < len; i++) {
view[i] = binary.charCodeAt(i);
}
var blob = new Blob([view], {type: mimeType});
deferredResponse.resolve(blob);

return deferredResponse.promise();
}
};
module.exports = Base64
1 change: 0 additions & 1 deletion extension/assets/css-selector
Submodule css-selector deleted from d9c204
66 changes: 33 additions & 33 deletions extension/assets/jquery.whencallsequentially.js
Original file line number Diff line number Diff line change
@@ -1,48 +1,48 @@
var jquery = require('jquery-deferred')
/**
* @author Martins Balodis
*
* An alternative version of $.when which can be used to execute asynchronous
* calls sequentially one after another.
*
* @returns $.Deferred().promise()
* @returns jqueryDeferred().promise()
*/
$.whenCallSequentially = function (functionCalls) {

var deferredResonse = $.Deferred();
var resultData = new Array();
module.exports = function whenCallSequentially (functionCalls) {
var deferredResonse = jquery.Deferred()
var resultData = []

// nothing to do
if (functionCalls.length === 0) {
return deferredResonse.resolve(resultData).promise();
}
if (functionCalls.length === 0) {
return deferredResonse.resolve(resultData).promise()
}

var currentDeferred = functionCalls.shift()();
var currentDeferred = functionCalls.shift()()
// execute synchronous calls synchronously
while (currentDeferred.state() === 'resolved') {
currentDeferred.done(function (data) {
resultData.push(data);
});
if (functionCalls.length === 0) {
return deferredResonse.resolve(resultData).promise();
}
currentDeferred = functionCalls.shift()();
}
while (currentDeferred.state() === 'resolved') {
currentDeferred.done(function (data) {
resultData.push(data)
})
if (functionCalls.length === 0) {
return deferredResonse.resolve(resultData).promise()
}
currentDeferred = functionCalls.shift()()
}

// handle async calls
var interval = setInterval(function () {
var interval = setInterval(function () {
// handle mixed sync calls
while (currentDeferred.state() === 'resolved') {
currentDeferred.done(function (data) {
resultData.push(data);
});
if (functionCalls.length === 0) {
clearInterval(interval);
deferredResonse.resolve(resultData);
break;
}
currentDeferred = functionCalls.shift()();
}
}, 10);
while (currentDeferred.state() === 'resolved') {
currentDeferred.done(function (data) {
resultData.push(data)
})
if (functionCalls.length === 0) {
clearInterval(interval)
deferredResonse.resolve(resultData)
break
}
currentDeferred = functionCalls.shift()()
}
}, 10)

return deferredResonse.promise();
};
return deferredResonse.promise()
}
Loading