-
Notifications
You must be signed in to change notification settings - Fork 275
/
plan.txt
85 lines (68 loc) · 3.11 KB
/
plan.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
== NGWM NEXT STEPS ==
0.2 (week of Nov 28 - Dec 2)
- fix windows file-pipeline bug (move pipeline state into bdb?)
- better install/admin instructions (in file and in UI)
- bundle small default dataset (sample ARC) for immediate use
- remove extraneous files from build distro bundle
- polish UI to clarify local operation/collection/etc.
- release via SF 'file release' mechanism, announce to lists
- demo for team next Tuesday (Dec 6) for UI/feature feedback
0.4 (by first week of January)
- verified performance at typical scale of contract collections
- retrieve/cache/respect freshest robots.txt
- nice, flexible results-list UI (classic calendar or other)
0.6 (July 14, 2006)
- administrative/manual exclusions
- timeline UI ala WERA with in-page banner
0.8 (January 2007)
- multiple collections within single installation? (currently possible with multiple webapps)
- CDX resource index
- ArcProxy allowing single locationDB to be used by multiple front-end webapps
1.0 (March 2007)
- static, manually maintained alphabetic division of index for larger scale installations
- retrieval on demand from live web
- automatically maintained distributed index
- basis for production Wayback Machine replacement
- internationalization support
1.2 (Jan 2008)
- changed LocalArcResourceStore => LocalResourceStore
* supports compressed/uncompressed ARC + WARC format
* requires property arcDir => dataDir
* major refactoring of Replay system, allowing mixing and matching of now-
separable replay elements
* server-side URL rewriting in archival URL mode for JavaScript-less client
replay browsing
* numerous bug-fixes
* pluggable URL canonicalization module
* initial support for de-duplicated WARC records
== BRAINSTORMING ==
Error-handling:
- when URL not available
- if auto-fetch, make sure an in-page floaty announces that it's
a fresh retrieve; log to side for QA purposes
- if not auto-fetch, offer link to broader search (all local collections)
or remote collections (such as public wayback machine)
Entry pages:
- when scanning ARCs, offer option to add /root pages to 'entry pages' list
- also allow admin to add entry pages
- display recommended entry pages below main search box (or on separate page)
Search/Admin UI:
- make host, collection very clear in UI (unless suppressed)
- highlight especially 'localhost' connections (different color background?)
- ape google/search-engine style to maximum extent practical
Replay UI:
- in-page presence
- collapsable uri/date/collection indicator
- mouseover indicators?
- n/a graphic for n/a images
- stayback firefox extension
- would look for special indicator inside page that it's a wayback session
- if found, would refuse to load inline resources or follow clicks without
patching URI back to wayback machine (assuming intercept/callback is
possible)
Heritrix integration ideas:
- bundled? just installed alongside?
- simultaneous update during crawl?
- 'not here yet but scheduled' error/placeholder IMGs?
- schedule-when-requested-via-WM?
- linkify all logs/reports to WM?