Web Verse enables deep-linking into HTML text, without requiring specific coöperation from the
content (such as adding id
attributes everywhere). It can be used to generate locator keys for
content inside of a page that are reasonably resilient to markup modifications as well as to edits
to the text itself. As such, it can be used to build an annotation system for text that is likely
to be edited over time. Obviously it is not
altogether unstoppable but it offers good enough resilience to
be used in production systems.
It was inspired by Emphasis by Michael Donohoe and Ted Nelson parallel markup, but leverages the Range interface and selection object.
We do not provide direct support for instance for mapping a URL's hash containing a Web Verse key into a specific paragraph or the such. Rather, the expectation is that one can build one's own preferred annotation system (or more generally deep, resilient linking system) very easily on top of Web Verse.
We fingerprint a block-level element (e.g. a paragraph) by:
-
Normalising the text to abstract away from markup and formatting differences.
-
Breaking the text into sentences. We attempt to be smart about handling full-stops. We'll ignore things like "Dr. Who" and a number of similar cases. It is generally enough to avoid getting single word nonsense for our sentences.
-
Taking the first and last sentences. It's OK if the first and last sentences are the same, the key is still meaningful.
-
Taking the first character from the first three words of each sentence. Words are defined as tokens composed of a run of non-white-space characters.
These fingerprints have been shown to provide reasonable uniqueness for reasonably-sized documents. Since it's deterministic yet not dependent on all the content, this method is tolerant to smaller changes in the content. Furthermore, finding keys can take edit-distance into account, which enables additional resilience to change.
Regions of text more specific than a block-level element can be referenced from within a block using character ranges. For instance, in the following paragraph:
`I` `a`m `a` paragraph with 2 **sentences**.
`I` `a`m `t`he second sentence.
We can refer to the word sentences
in the first sentence by using the range, 24-33
. Altogether
with the paragraph's fingerprint, this gives us an address selecting just that word of
IaaIat:25-33
. (Note that the text offsets are zero-based, and apply to normalised text.)
npm install web-verse
This is primarily a client-side library (~7k minizipped), just include the web-verse.min.js
script
that comes with the distribution.
Web Verse works with Node, but you have to bring your own DOM. Currently, the best option is likely
to be jsdom
, but it has limitations due to it not supporting Range
s.
The following subset of methods works with Node and jsdom
:
createKey()
createHash()
getScope()
, but only with anode
argumentserializeNode()
findKey()
getChildOffsets()
normalizeText()
normalizeOffset()
denormalizeOffset()
These should normally be more than enough to carry out the sort of operations that you are likely to want to do on the server (as opposed to, say, getting the user's selection and producing a link from it).
When loaded in a Web context, Web Verse exposes itself as a global WebVerse
object, on which the
following methods are available.
Given an element, returns a 6-char key that summarises it for the purposes of deep, resilient linking.
Given a key that is being searched for, and a list of candidate keys (for instance, all the keys for block elements in the document), this will return the best match it can find.
The returned object has fields for index
(the index in candidateKeys
that best matched), value
(the value that actually matched, which may differ slightly from the targetKey
), and lev
(an
indication of the Levenshtein edit distance of the match). If no match was found, all of those
fields will be undefined
.
The match works by first attempting an exact match, then by choosing the candidate with the smallest edit distance. No edit distance can be greater than or equal to 3.
Given an element, it will return a hash for it that is invariant to numerous markup changes inside of it, looking only at its normalised text content. Such a hash can also be used to generate resilient identifiers.
Given a range and optionally a scoping element (which defaults to getScope(range)
), it will return
the details one needs in order to create a resilient pointer to that range. The returned object
contains:
$scope
: The scoping element (which was used for key and hash generation).hash
: The hash of the scoping element, can be used as an ID that is resilient to markup and white space changes but not to text edits.key
: The key for the scoping element; can also be used as an ID. It is resilient to markup and white space changes, as well as to a certain amount of text editing; but it is less unique than thehash
.startOffset
,endOffset
: The normalised offsets into the text for that range.
So if you were to wish to use the key+offsets fingerpint that is discussed in this README's introduction in order to obtain a resilient pointer into what a given range captures, you would:
var details = WebVerse.serializeRange(range);
var fingerprint = details. key + ':' + details.startOffset + '-' + details.endOffset;
Returns the same details as serializeRange()
but for the current selection. If there is no
selection (or if it is collapsed) it returns undefined
.
The same as serializeRange()
but instead of a Range
it uses a node, taking its own text content
as the offsets into the given scope. If no scoping $el
is given, it will use getScope($node)
.
Given a scope and normalised start/end offsets (that you may have stored in a fingerprint), returns
a range
object suitable to use directly on the DOM (i.e. applying to the raw content).
If you start with a fingerprint such as the IaaIat:24-33
example you would use the IaaIat
part
to find the $scope
(typically with findKey()
) and then this method using the scope and the
offsets. It returns a Range
that you could wrap to highlight, etc.
Given an element to scope the search in, and a string, it will find all instances of that string
(in a normalised, white-space-invariant manner) inside the textual content of that element, and
return an array of Range
elements pointing into the matches.
This can be used to find an highlight a specific string. Or, for instance, if a user is creating a link around a given string in a text this can offer the option of linking all other occurrences of the same string.
Since it returns Range
s, it can be easily used with Range.surroundContents
.
An array of element tagName
s (i.e. uppercase) that are considered acceptable scopes (block-level
elements). You can modify this to alter Web Verse's behaviour.
Given a string, returns a version normalised according to Web Verse's internal normalising
algorithm. This is essentially str.trim().replace(/\s+/g, ' ')
but with its behaviour made
resistant to browser vagaries.
Web Verse hides away a lot of the complexity involved in dealing with normalised text internally but having to manipulate a DOM that has raw, unnormalised text content (obviously, without changing the DOM).
This method returns the offset in the normalised text equivalent to the given raw offset into the
unadulterated text. So calling it with 4, ' a b'
(which has the offset right before the b
) will
return 2
, since the normalised text is a b
.
This may seem cryptic, and in many ways it is. You should only need this if you are trying to manipulate the text in the same manner as Web Verse does, for instance to extend its functionality.
Does the reverse of the previous one: given a normalised offset and the raw text, it will return the matching raw offset.
Given a range or a $node
, will return the closest enclosing element that may scope it (i.e. a
block-level element from citeable
). This can the range's commonAncestorContainer
or any of its
parents. If it goes up the tree without finding a valid candidate, it will return undefined
.
Given a range and an element scope, return an object with startOffset
and endOffset
that are the
offsets into the normalised text equivalent to that range, for that scope. Mostly of internal use.
Same as getOffsets()
but uses a $child
text node (or a $child
element containing text) as
determining the offsets inside a $parent
element. Returns startOffset
and endOffset
fields
being the offsets normalised to the content of the $parent
.
The best thing when developing is to npm run watch
. This will build both Node and browser versions
continuously. It is also a good idea to npm run test-local
, which will keep the Karma tests
running (just in Chrome, so as not to be too invasive) whenever you make changes.