-
Notifications
You must be signed in to change notification settings - Fork 760
Potential Cleanup Refactorings
- merge small Util classes to fewer larger ones?
- do we need both a org.archive.util.IoUtils and org.archive.crawler.util.IoUtils?
- do we need separate IoUtils and FileUtils
- should the package be shrunk or split to subpackages?
- move blooms to subpackage?
- move SURT classes to URI-related package?
CrawlURIs need a flexible generic serializable data structure for arbitrary markup by loosely-coordinated extensions. This has been AList but our discmofort with it has caused us to deprecate still-useful direct accessors.
Should we try another structure constrained to have a convenient serialization format-- perhaps JSON? JSON+Objects? YAML?
- evaluate for refactoring all classes >1000 lines?
- deprecate SupplementaryLinksScoper – subsumed by mappers, main
LinksScoper's logging? - reconcile dash-underscore_CamelCase conventions?
In places (specifically around the UURI/CrawlURI classes) we've overridden Object.toString() to return a more 'naked' representation of a object, and then relied on that toString() for functionality.
Unfortunately, given toString()'s special role as default string-ification of any object and use through debugging logs/itnerfaces, this hides useful info – like the class of anything that reports a toString() URI.
I would prefer any meaningful string-version of an object to be accessed via other methods, retaining toString() in its default implementation or some other rich, debugging-centric rendering that can be adjusted fearlessly without impacting application functionality.
OK to use JS, should never be necessary for UI to be usable
allow tallies (and rates?) of generic named quanitites – not only the static set of values defined in interface methods
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse