Skip to content

Latest commit

 

History

History
55 lines (41 loc) · 1.86 KB

README.md

File metadata and controls

55 lines (41 loc) · 1.86 KB

Very simple alternative to "Apache Nutch":http://nutch.apache.org/ created in Grails

Could be used independently from Apache Solr, so crawled data could be stored to files, saved to database etc. Use "Apache Camel":http://camel.apache.org/ as integration framework and "Apache ActiveMQ":http://activemq.apache.org/ as source messaging and integraation patterns server.

Installation

Before installing, make sure you installed grails-routing-1.1.4.zip plugin (you may find it in the root of source tree).

    cd gnutch
    grails war

Then copy created WAR to $YOUR_WEBSERVER_WEBAPPS directory

Configuration

file: $GNUTCH_HOME/grails-app/conf/Config.groovy

 gnutch {
   // Input route definition 
   inputRoute = 'file:///tmp/gnutch-input'
   // text file containing regular expressions to include and exclude URL crawling patterns
   regexUrlFilter = 'regex-urlfilter.txt'


   crawl {
     // Crawling thread pool
     threads = 40
   }

   transformations = [
     '^http://www.ziprealty.com/property/[\\d\\w\\-\\_]*/\\d*/detail$':'ziprealty.xsl'
   ]

   http {
     // UserAgent string. Better if contain email address of person who is responsible 
     // for crawling. That will allow source owners to contact person directly
     userAgent = 'ziprealty.com crawling bot ([email protected])'
     // Maximmum number of connections per host
     defaultMaxConnectionsPerHost = 40
     // Maximmum number of total connections
     maxTotalConnections = 40
   }

   solr {
     // URL to Solr server (may reside on master server)
     serverUrl = 'http://localhost:8983/solr'
   }


   activemq {
     // URL to message broker (may reside on master server)
     brokerURL = 'vm://localhost'
     //conf = 'classpath:activemq.xml'
   } 
 }