-
Notifications
You must be signed in to change notification settings - Fork 1
DRAT Statistics
Chris Mattmann edited this page Jul 20, 2017
·
3 revisions
What
This is a simple utility, written in Python, which uses DRAT to scan multiple code repositories sequentially, collect statistics and dumps into both, Apache Solr ("statistics" core) and user defined directory.
What Statistics
- Crawl start time
- Crawl end time
- Index start time
- Index end time
- Mapper start time
- Mapper end time
- Reducer start time
- Reducer end time
- Notes (count from RatAggregator)
- Binaries (count from RatAggregator)
- Archives (count from RatAggregator)
- Standards (count from RatAggregator)
- Apache (count from RatAggregator)
- Generated (count from RatAggregator)
- Unknown (count from RatAggregator)
- Mimetypes (count from "drat" core by doing a facet on "mimetype")
All license types are stored as "license_*" and mimetypes as "mime_*"
Why
As we know that DRAT runs on single code repository and generates the output. But what if we have a large number of repositories to be scanned and record their individual statistics. This utility can be leveraged to such large-scale tasks. The Solr core gives the advantage to understand and visualize the statistics through amazing function and facet queries.
How To Use
- Set the following environment variables:
- DRAT_HOME - (eg: ~/drat/deploy)
- JAVA_HOME - (where your Java resides. Same what you have for DRAT installation)
- OPSUI_URL - (eg: http://localhost:8080/opsui)
- SOLR_URL - (eg: http://localhost:8080/solr)
- WORKFLOW_URL - (eg: http://localhost:9001)
- Run the script as below:
python dratstats.py <path to list of repository URLs> <path to output directory>
The details are as below:
- Path to a flat file containing a list of repositories to traverse. Each line in the file represents the absolute path to one source code repository. Eg: the entries below provide examples of paths referencing Apache Tika and Apache Nutch codebases on a local file system.
/apacheSvn/tika ApacheTika http://github.com/apache/tika.git The digital babel fish.
/apacheSvn/nutch ApacheNutch http://github.com/apache/nutch.git The open source web crawler.
A sample repos.txt file is available.
- Path to the output directory where the contents of ${DRAT_HOME}/data will be copied to, for each repository. Each folder in the output directory follow standard naming conventions i.e.
- Remove the first character i.e. ‘/’
- All ‘/’ will be replaced with ‘_’
- And it will be appended with the current timestamp. Example - An output directory of ‘/apacheSvn/tika’ repository can be written as apacheSvn_tika_2016-01-15T23:14:39Z