-
Notifications
You must be signed in to change notification settings - Fork 36
Home
tribbloid edited this page Dec 23, 2014
·
10 revisions
... is a scalable query engine for web scraping/data mashup/acceptance QA. The goal is to allow the Web being queried and ETL'ed like a relational database.
SpookyStuff is the fastest big data collection engine in history, with a speed record of querying 330404 dynamic pages per hour on 300 cores.
- Apache Spark
- Selenium
- GhostDriver/PhantomJS
- JSoup
- Apache Tika
- (build by) Apache Maven
- Scala/ScalaTest plugins
- (deployed by) Ansible
- Current implementation is influenced by Spark SQL and Mahout Sparkbinding.
Copyright © 2014 by Peng Cheng @tribbloid, Sandeep Singh @techaddict, Terry Lin @ithinkicancode, Long Yao @l2yao and contributors.
Published under ASF License, see LICENSE.