-
Notifications
You must be signed in to change notification settings - Fork 0
Parabible API (proposal)
This is a WIP. There is no code that necessarily correlates to this functionality at this point nor is the data wrangled into a format that would support these queries (yet). The purpose of this design doc is (1) to help solidify objectives so that the data wrangling has more clear goals in mind, (2) to help guide the implementation of new server code, and (3) to step towards having documentation (which https://parabible.com woefully lacks right now).
The idea is to follow RESTful principles so that the API is sensible and easy to reason about. With that in mind, I think this document should begin with the resources that can be queried.
Resource | Content |
---|---|
word |
The data that drives https://parabible.com is morphologically tagged Greek and Hebrew words. |
text |
Text is versioned and aligned to parallels. Where it contains strings of words, it includes word IDs. It should be treated as sanitized HTML (even text without word IDs can have italics etc.). It will not include verse numbers. It may include Expansion Symbols*. |
verse |
Because nodes use arbitrary integer IDs (and are collections of word IDs), the verse endpoint provides an interface to more familiar units that have more meaningful names. |
query |
Search queries are sufficiently complex that we need to consider them a type of resource. |
Obviously, apart from Read
, the usual CRUD operations don't make sense with most of these resources because they're static. So we'll move quickly over the first three and spend a bit more time on queries.
If there is no id set, the API will return an error (TODO: really?):
{
"error": "UNSET_ID",
"message": "Resource id not set. Expected `/word/:id`"
}
This query will return word data in an array of key/value pairs:
{
"wid": 2002762,
"result": [
{ "key": "pos", "value": "verb" },
{ "key": "lexeme", "value": "οἶδα" },
{ "key": "gloss", "value": "I know, remember" },
{ "key": "person", "value": "3" },
{ "key": "tense", "value": "perf" },
{ "key": "voice", "value": "act" },
{ "key": "mood", "value": "ind" },
{ "key": "number", "value": "sg" }
]
}
For an unset id, see /word
above...
The id
of a text node is a reference to a parallel node. That is, the primary key is an id
-version
composite and to request multiple versions of the same verse an id and each version
requested is required. Parallel data is limited in some respects which sometimes makes verse to verse equivalencies suboptimal. But it's also good enough that we're using it for the foreseeable future (until I have even more time to do some awesome analysis and generate better parallelisation) so: a text node always contains a single full verse. As above, then, it should be treated as sanitized HTML (even text without word IDs can have italics etc.).
Note: IDs are not necessarily ordered and so to request the next verse, you cannot simply increment them. That is why the next and previous IDs are supplied. These are ordered according to the first version
's versification.
If there is no version
set, the API will return an error (TODO: really? this seems like something that could have a sensible default):
{
"error": "UNSET_VERSIONS",
"message": "Text versions not set for {id}. Expected `/text/:id?versions=v1,v2,v3`"
}
Assuming versions
is set, however (versions=sblgnt,net
):
{
"text_id": 104582,
"versions": ["sblgnt", "net"],
"result": [{
"version": "sblgnt",
"verse_id": 39002001,
"text": "<word id=123>Greek</word> <word id=124>words</word> <word id=125>everywhere</word>."
}, {
"version": "net",
"verse_id": 39002001,
"text": "The NET is just a bunch of strings with <i>occassional</i> formatting."
}],
"previous_id": 104581,
"next_id": 104583
}
As above but this time, ids
is a comma separated list of ids: /texts/1,2,3?versions=bhs,net
. The return is an array of texts.
For an unset id, see /word
above...
Instead of requesting a parallel text node, a verse can be requested by using a reference string. Reference strings will be best-guess parsed (like "1 Joh 3", "ezek 20", "ex23.12", etc.).
- If the verse number/range is omitted, the default is the whole chapter.
- If the chapter number is omitted, the default is 1.
Note that because verse references can differ depending on the version, it's important to include the versions
variable. The first version listed will be treated as the primary text (even if, for example, the versification is non-standard). If a verse is not present in the primary version, it will be omitted from the result set (e.g., if a full chapter is requested). If the versions
variable is omitted, TODO?
Considering queries as resources is one of the most significant divergences from parabible-1. It's also not definite and probably the key element that I'm interested in thinking through in this draft.
Here are the problems we're trying to solve:
- Queries are not well captured by RESTful urls: one would expect them to be
GET
requests but they require so much detail that the parameter would be?q=json_garbage
. JSON garbage is bad. - Queries are also not shareable as a result of the fact that we use
POST
to send them (we usePOST
to avoid the JSON garbage in aGET
).
The idea to RESTify the queries is to use CRUD operations to read and create them. They would be deleted after a fixed period (e.g. 24 hours) in order to avoid running the risk of being spammed. I would consider lengthening that for a small subscription fee (perhaps - this may have commercial/non-commercial implications which has significance given licensing restrictions on the data so this would need to be investigated).
For an unset id, see /word above...
Creates a new query, expects a payload with the necessary parameters to run the query:
key | value |
---|---|
search_type | The type of search to run on the search_terms. |
search_terms | An array of search_term objects (see below). |
filter | Array of verse ranges that are allowed in results. This can be in the form of books: ["Genesis", "Exodus"] . Or chapter/verse reference ranges: ["Genesis 1:1-2:3", "Exodus 25-40", "Lev 19:2"] . These references will be parsed first to create a filter that will be passed into the query. TODO: figure out how to handle weird versification edge cases... |
<params> | Additional parameters expected depending on search_type . |
search_type | meaning |
---|---|
collocation | This is the standard type of search that finds syntactical nodes where search terms are "collocated". The syntactical range is an expected parameter: syntax_range and may be phrase , clause , sentence , or verse (note that only verse is supported across translation versions). |
sequential† | Find the search terms in the precise sequence in which they appear in the array. |
within_range† | Find the search terms in a given range of words. Expects the parameter within_range and an integer. Note that implementation-wise the integer will probably need to be capped somehow (perhaps depending on how many terms there are. Actually I can't imagine this search being performant... |
significant_neighbors† | Using something like MI weights for analysis, find the most important collocated words. Might need to be sure that we have few enough terms or many enough results? |
† search_types marked with a dagger are only ideas. The critical type of search that I want implemented is the collocation search.
key | value |
---|---|
inverted | Boolean flag. If true, the presence of this term negates a match. |
attributes | Array of key/value pairs in the form { "key": "key1", "value", "value1" }, { "key": "key2", "value", "value2" }
|
{
"search_type": "collocation",
"syntax_range": "clause",
"search_terms": [{
"inverted": false,
"attributes": [
{ "key": "lexeme", "value": "קדשׁ" },
{ "key": "tense", "value": "impf" }
]
}, {
"inverted": false,
"attributes": [
{ "key": "lexeme", "value": "על" }
]
}]
}
The return is the new query (result) which is also returned if you query by id.
Every query is assigned an ID (maybe named queries are worth considering). The results are in the form of an array of matching text nodes. These may be paginated with a default limit
of 500 and offset
of 0.
A matching text node has the following form: (TODO: come up with a better name than "matching text node" because this is ambiguous)
{
"text_ids": [104582],
"matching_words": [
[3248597, 3248598], [3248598, 3248599]
],
"word_ids": [3248597, 3248598, 3248599, 3248600]
}
Note that text_ids
is an array because some matches (viz. "sentences" in the BHS) span multiple verses. We, therefore, need to capture all relevant verses in a result. The matching_words
array contains arrays of matching words for each search term in order. It would be normal for these arrays to have only one element. The word_ids
array contains every word_id
that matches the parent text node (so that we can "warm" up results with matching words highlighted and words in the same node lowlighted). We cannot include a verse_id
because versification is version dependent and text_id
is a parallel (version-agnostic) field.
The query result has the form:
{
"query_id": 1234,
"query": "__original query__",
"limit": 500,
"offset": 0,
"result_total": 1300,
"results": [{
"text_ids": [104582],
"matching_words": [
[3248597, 3248598], [3248598, 3248599]
],
"word_ids": [3248597, 3248598, 3248599, 3248600]
}, {
"text_ids": [104587],
"matching_words": [
[8293749], [8293751]
],
"word_ids": [8293749, 8293750, 8293751, 8293752, 8293753, 8293754]
}, {
"_comment": "Another 1298 times..."
}]
}
Expansion symbols may be numbered footnotes or they may be other symbols (asterisk, dagger, etc.) that provide access to additional content that has specific to a location within a particular version of the text. Examples of these are NET Notes and critical apparatus (e.g. in the SBL GNT). Expansion symbols have the following format:
<symbol id=4029482 type="note" word_ids[]="123,124,125">*</symbol>
TODO: Think about the properties here. I like type
because it could indicate how to handle it in the client (note|popup) but it might be unnecessary. I like word_ids
because it allows me to associate a text-critical note with a series of actual words in the text (but maybe it should be something else that is more dom-id related so that I'm not dependent on these having word ids [like in the NET]). id
is necessary so that we can look up the content on the backend. I like html-ifying it so that we can optionally hide them. But we need another endpoint to look up the content for these... /note/:id
?
TODO: "Expansion Symbol" is a dumb name. It could be a footnote in the ESV. It could be a text-critical symbol in the SBLGNT. It could be an extensive comment from the NET. (expansion|information|extra-content|more|note)-(symbol|indicator|marker)...