-
Notifications
You must be signed in to change notification settings - Fork 760
HOWTO Ship a Heritrix Release
The Heritrix issue tracker is at: https://github.com/internetarchive/heritrix3/issues
The GitHub project is at: https://github.com/internetarchive/heritrix3
The project homepage is at: http://crawler.archive.org
The Docker Hub images are at: https://hub.docker.com/r/iipc/heritrix
And of course, this wiki's entry page is: https://github.com/internetarchive/heritrix3/wiki
In a release number X.Y.Z, X is the 'major' release number, Y is the
'minor' release number, and 'Z' is the 'micro' release number. Interim releases
may also have an additional -SUFFIX
. For more details, see Version
Numbering
Before any release, verify that:
- all tracked issues targeted for that release are resolved or rescheduled for a later release
- the continuous build box builds successfully, and all automatic unit tests pass, both on a local developer box and the build box
- the lead developer agrees the code is ready for release and has reviewed recent commit logs for areas of concern
- committers have been aware a release is upcoming for a reasonable period (days for micro releases; a week+ for minor releases) and refrained from making destabilizing changes
(For 'minor' and 'major' releases, other production-scale test crawling should have already occurred, and an announced 'code freeze' on the relevant trunk may have been in effect for a week or more.)
Using previous wiki page Release Notes as a template, create a skeleton wiki page Release Notes for the planned version. Leave the area where a release date is declared with a 'planned' or 'TK'/'TBD' ('to come' or 'to be determined') notation.
Add notes there of significant changes anyone upgrading should be aware of, with links to other wiki pages or JIRA issues with more info.
Use the dynamic-inclusion links to pull in a live copy of the 'release notes' issue list from JIRA.
Add acknowledgement of any new or outside contributors to this release.
Make a commit to the trunk that sets the official release version number and links the in-distribution 'release notes' to the full wiki release notes.
Verify all expected artifacts (.tar.gz
, .zip
, -src.tar.gz
, -src.zip
)
were created and have their official distribution names.
Download these each to a remote directory and confirm they expand without error and create expected directory trees.
For at least the .tar.gz
, launch the crawler with a webui. Connect to
the webui and verify visible version identifiers are as expected.
Using the default profile, configure a minimal test crawl of a several-pages site (>1 page, <100). Launch crawl and verify expected output in crawl.log and normal termination of crawl when finished.
The main project POM has an ossrh
build profile, intended to be used to submit Maven artefacts to Maven Central, as per the OSSRH Guide.
To use it, you'll need an OSSRH account and you'll need to request access by getting a current user who as the rights to push to org.archive
to comment here with a request to add you to the account.
Then, you'll need to add your username and password to your Maven ~/.m2/settings.xml
file, using the sonatype-nexus-staging
and sonatype-nexus-snapshots
IDs, like this:
<servers>
<server>
<id>sonatype-nexus-snapshots</id>
<username>anjackson</username>
<password>********</password>
</server>
<server>
<id>sonatype-nexus-staging</id>
<username>anjackson</username>
<password>********</password>
</server>
</servers>
and set up GPG
as outlined in the OSSRH guide.
Then, you should be able to deploy snapshots with
mvn -Possrh clean deploy
and for releases:
mvn -Possrh release:clean release:prepare
mvn -Possrh release:perform
Note that there may be problems GPG-signing things unless you set a GPG_TTY=$(tty)
environment variable, see this for more details.
If there is a problem, you can try mvn release:rollback
but sometimes you'll have to delete the local tag (if it's been created) or reset our git repository.
To get the change log right, we need to do it after the release so the changes get associated with the new release tag.
We can update the change log via github-changelog-generator. You'll need a suitable token, then you can use:
export CHANGELOG_GITHUB_TOKEN="«your-40-digit-github-token»"
github_changelog_generator -u internetarchive -p heritrix3 --release-branch master
Then commit the updated CHANGELOG.md
to the master
branch.
Go to https://github.com/internetarchive/heritrix3/releases and create a release from the release tag. Add a brief summary and include links to the dist
TAR and ZIP files hosted on Maven Central (see e.g. https://oss.sonatype.org/content/repositories/releases/org/archive/heritrix/heritrix/3.4.0-20190205/).
Update the wiki release notes with the actual release date.
Update the project wiki front page to list the new release as the latest, and adjust other wording about upcoming releases accordingly.
Send email to the archive-crawler project list announcing the release, with links to the release notes and download area.
Commit a change to the 'xdocs/index.xml' file in heritrix trunk which auto-generates the http://crawler.archive.org home page, to include a news item in the appropriate place announcing the latest release. (BROKEN NEEDS FIXING: Currently the auto-builds are not uploading the changed website automatically to crawler.archive.org.)
See Docker about building current images.
Build images for current release number, tag them with
<user>/heritrix[:<label>]
(<user>
being iipc
, optional label
consisting of release number, contrib
for contribution builds and
jre
for Java JRE), and push them to Docker Hub.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse