offline-pages

Offline-pages lets you save an entire website to a file, along with all the required media you'll need to view the pages offline. It's like your browser's "Save Page As" feature, except it isn't limited to one page so it can handle entire websites. All the inter-links will point within the archive (i.e. it is fully self-contained), and the archive is easy to browse offline.

The goals of offline-pages are:

capture entire sites, instead of single pages
simple method for viewing the archive with a standard web browser
store external media and links into a self-contained file
convert all external references so the archive has no online dependencies

Installation

git clone git://github.com/iandennismiller/offline-pages.git
cd offline-pages
sudo make install

Behind the scenes, make install will use setuptools to install a python library and scripts. Tested on OS X Mountain Lion. Other *NIXes are likely to work as well.

Usage

Let's say you want to mirror the wikipedia article for "Webarchive".

1. Create a file containing target URLs

This file can contain as many URLs as you want; they will all be added to the same archive.

echo http://en.wikimedia.org/wikipedia/en/wiki/Webarchive > urls.txt

2. Use this file as input to the `offline-create` program:

offline-create ./urls.txt wikipage

3. View the results

offline-browse wikipage.archive.tgz

Use Case: archiving a forum thread

So there is a forum thread consisting of hundreds of posts, and it spans dozens of pages. Offline-pages can create a fully self-contained mirror of all these pages, such that the offline version can be navigated much like the online version. In this case, just create a URLs file containing each of the pages you want to include in the archive.

example forum: vBulletin

For the purpose of this example, we will look at a vBulletin forum. The base URL for a vBulletin forum thread might look something like this:

http://www.example.com/vb/threads/1234-this-thread

Subsequent pages of the forum thread simply append "pageX" to the URL, like this:

http://www.example.com/vb/threads/1234-this-thread/page2

1. identify target URLs

Begin with the stable portion of the URL and write all URLs to a file at once:

export BASE_URL=http://www.example.com/vb/threads/1234-this-thread/page
for i in $(seq 1 40); do echo ${BASE_URL}${i}; done > urls.txt

2. fix the first URL

If you look at the file, everything looks great except the first URL. Since I used a for loop to generate most of the URLs, I will need to manually modify the first entry. You will notice that the first line now says:

http://www.example.com/vb/threads/1234-this-thread/page1

We don't want "page1" there (because that's not how vBulletin works), so edit urls.txt and change the first line back to the real URL:

http://www.example.com/vb/threads/1234-this-thread

However, that's not going to work right. Our URLs file specifies there to be a directory called 1234-this-thread in which it will place a file called page2. If there is already a file called 1234-this-thread, then we will be unable to create a directory with the same name, and our archival process will be unable to save page2 if it cannot create a properly named directory. To fix this, modify the first line again to include a nonsense parameter:

http://www.example.com/vb/threads/1234-this-thread?s=1

You will notice I added ?s=1 to the end of the URL. This will be ignored by vBulletin, but it will ensure our archival process creates a file happily in the archive.

3. archive the thread

Now you are ready to archive the thread.

offline-create ./urls.txt vbulletin-thread

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
OfflinePages		OfflinePages
bin		bin
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

offline-pages

Installation

Usage

1. Create a file containing target URLs

2. Use this file as input to the `offline-create` program:

3. View the results

Use Case: archiving a forum thread

example forum: vBulletin

1. identify target URLs

2. fix the first URL

3. archive the thread

See also

Webarchive

HTTrack

MHTML

MAF

Webarchive

About

Releases

Packages

Languages

iandennismiller/offline-pages

Folders and files

Latest commit

History

Repository files navigation

offline-pages

Installation

Usage

1. Create a file containing target URLs

2. Use this file as input to the offline-create program:

3. View the results

Use Case: archiving a forum thread

example forum: vBulletin

1. identify target URLs

2. fix the first URL

3. archive the thread

See also

Webarchive

HTTrack

MHTML

MAF

Webarchive

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Use this file as input to the `offline-create` program:

Packages