-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referencing pieces of text #307
Comments
This seems to be a direct continuation of #270, which I think everybody should at least skim before commenting on the present issue. I think we can all live without machine-actionable references to gridlike milestones, which would complicate the reference system very much (see e.g. here). We never planned to number @michaelnmmeyer , what I don't understand here is the purpose for which you want to define the terms/units to use in references, such as "part" and "line". My best guess is that you are talking about generating display for encoded references, so that e.g. an encoded reference to a particular point in an edition would be displayed using these terms. Is that right? If not, or this is not the only thing you have in mind, then please clarify: why do we need a rigorous set of terms and what is wrong with e.g. referring to something displayed as "Frontal Face" using a reference involving "item A"? Are you for example talking about free-text queries that would be parsed by machine to jump to the requested point? I think that would be overkill. My comments on the rest of your post may be partly off because of the above lack of my understanding of the purpose. For textpart divs without As I said above, I don't think we should introduce numbering for p and ab elements, so there will be no referencing of such elements. For stanzas, displaying Roman numerals is fine. I'm also perfectly happy with "verse", but @arlogriffiths is likely to object to this, since by his definition the English term "verse" can also mean "one line of poetry" and rigorously uses "stanza" for "a cohesive group of lines". So perhaps use "stanza" instead - though since I don't think I have ever seen "verse" used in the sense of "line" in Indological literature, while "verse" in the meaning "stanza" is ubiquitous in the same literature (and far more common than "stanza"), we might perhaps reconsider that. In referring to verse lines, I agree that "verse line" is cumbersome. If we are only talking about the display of encoded references, then I think displaying nothing (just the For pagelike milestones, the solution should be the same as that for textparts, with For For uniqueness across I'm also OK with explicitly forbidding "paragraph", "verse" (or "stanza", "page", "line" and "verse line" (if we keep it) in I agree with using commas to separate the elements of the hierachy (except possibly for verse lines, see above), and with verbose display using "to" instead of hyphens. |
The This whole referencing thing stems from Amandine's project. She wants a way to refer to verses with hyperlinks, so that you can navigate between her table of verses and occurrences of these verses in our inscriptions. I am trying to make this more generic to allow hyperlinks not just to verses, but also to other pieces of text (divisions, lines, etc.). This requires some kind of URI-like notation. There is a standard one that would work, namely XPath, but I doubt people will want to use that. For instance, "face B, line 5" could be encoded as I believe it would be convenient if this notation looked like a "normal" reference, but this does not really matter from my side. It could as well look like |
OK, so if I understand you correctly, we're basically talking about creating a new "code language" for references, that will have to be formulated rigorously by the person who uses it, and will be parsed by the machine. If devising this system does not take up an inordinate amount of time, does not require an inordinate degree of revision in the established encoding practice and the existing files, and is likely to be used by others beyond Amandine, then I'm all right with that. But if any of these three conditions are not met, then I think we (and especially the PIs) should seriously consider whether we really need to go there instead of, as you say, using XPath, which requires extra effort and learning on the part of the person(s) who will be encoding such references, but does not affect anyone else. As for numbering paragraphs, I'm of course OK with doing so in critical editions and have not been aware that this was done there. I'm likewise OK with referring to both I still do not think we should introduce paragraph numbering for inscription editions. It is not clear to me if anyone has suggested that we do so, perhaps for Amandine to be able to refer to paragraphs from her table. If this is the case, then I think that instead of changing established projectwide practice for the sake of an individual project, it is the latter that could adapt and use what is already there, for instance line numbers, perhaps in pairs, pointing to the line in which a segment relevant to her begins, and also pointing to the line in which that segment ends. This would be actually be a more accurate reference system than referring to paragraphs, since the prose passeges of interest to her will in many cases be small parts of long paragraphs. I am also slightly worried that if she (or anyone) starts referring to specific parts of our editions, i.e. essentially applying standoff markup to them, then what happens if our editions change? What if next month I revise one of my editions by splitting a previous long paragraph into two shorter ones? What if I get access to better visual documentation than before, and realise that a poorly legible passage originally encoded as prose was in fact verse, so that stanza numbering has to change in the file from that point onward? The only solution to this problem that I can see is for Amandine to create a fork of the repositories for her references, or to include some method of versioning in those references. Neither is ideal, and we can perhaps just live with the risk that such changes may happen (after all, we do the same when referring to stanza X of an inscription in a print publication). But someone needs to consider such eventualities. |
For paragraphs numbering in diplomatic editions, I do not think anyone suggested it. I wrote a self-reminder to talk about it, but I do not remember the reason. You point out a major issue in the last paragraph. For Amandine's case, I planned to save, for each reference, a commit hash plus the verse referred to, so as to detect potential modifications in numbering. This is not great, to say the least, but the alternative is to store multiple revisions of our texts in the database, which would bring too much complication. I did that in the beginning, but abandoned the idea soon after. In any case, maintaining working hyperlinks across revisions is a lot of (manual) work. I never do that in my own projects, precisely because the extra work does not seem worth the effort. If the original text is cited, finding its location does not take much time. Maybe it would not hurt much to just refer to the inscription id? To discuss with Amandine. |
I concur with Daniel. |
sorry, overwhelmingly busy. I concur with Manu's response and ask for patience so I can try to respond in due course to matters on which an explicit reaction has been requested from me. |
I think this discussion hasn't been quite sorted out. At any rate, I have one additional observation here: the above statements about the need to make page, pagelike milestone and line numbers unique within any division needs to be qualified. When textpart divs are present, the uniqueness is a requirement within each textpart, but not within the div containing the textparts. I noticed when I opened https://dharmalekha.info/texts/INSVengiCalukya00091 that I now get a warning that pb and pagelike milestone elements do not have unique @n within this division (the edition div). The structure of this text is as follows:
The identical pb numbers in textparts B and C need to be permitted, since they are in fact the very same pages (on which traces of an earlier inscription can be made out). But identical page or milestone numbering in textparts may be desirable in other cases too, so uniqueness must be enforced on the textpart level only. |
This was intentional, addressed in 4c29818. |
Did you mean unintentional? Anyway, I'm still getting the same error message on 00091. |
I will have to write something more complicated. I expected |
|
yes, we certainly have such cases in our corpus. here's an example, though I am not sure the lost medial plates have been correctly encoded: https://dharmalekha.info/texts/INSIDENKTerep_II. |
This is a discussion of the referencing issue I alluded to in our mail exchanges about the new release of the EGD.
The purpose is to define a machine-actionable reference system for pieces of text: verses, lines, etc. Given a reference in some defined format (e.g. "line 5" or "face A, line 5 to line 6"), the machine should be able to locate the corresponding text in the XML file, and, optionally, to extract it.
I assume we only need this reference feature for the edition division. In this context, the following elements might bear a
@n
:Referring to gridlike
<milestone>
s does not seem useful, thus I ignore this case. The EGD does not prescribe to number<ab>
and<p>
, but I add them nonetheless in the discussion, in case people need that; currently, we have less than a dozen cases where they bear a@n
.We need a notation that can be processed by a machine, but I assume we want it to look fairly natural nonetheless (as in the examples above: "face B, line 5 to line 7", etc.) Since the format of
@n
is not restricted ("A" can represent a textpart division or a milestone, for instance), it is necessary to make each unit explicit, as in "face A, line 1" instead of "A1", "line A1", etc.Let us review each element and describe how references to it would look like. The difficulty is to find unit names that look natural enough but that also unambiguously refer to a given XML element with a given set of attributes.
<div type="texpart">
If the
<div>
has a@subtype
, use it as unit. Otherwise, use a unit named"part". Thus:
Now, textpart divisions might have a heading (declared with
<head>
), which is supposed to be displayed instead of the one that would have been generated otherwise. Thus,... would result in "Frontal Face" instead of "Item A". Since "Item A" is not displayed, it is not possible for the reader to tell, without looking at the XML, what corresponds to the reference "item A".
To address this, we can either display "item A" in some way, or use the heading as reference, and generate references like "Frontal Face, line 1", etc. The first solution seems preferable to me, because, for the second to work, headings need to be unique, and this is not prescribed by the EGD.
<p>
and<ab>
Use "paragraph" as unit:
This assumes that
p/@n
andab/@n
are unique among all<p>
and<ab>
elements in a given division.<lg>
Verses
@n
are displayed in Roman numerals, thus it seems natural to do the same in references. We would have:<l>
We cannot use "line" as unit, because this would better fit
<lb>
. We cannot use "pāda" either, since many<l>
represent hemistiches. We could use "verse line" as in:This is verbose, however.
<milestone type="pagelike">
and<pb>
For milestones, use the provided
@unit
as unit; for<pb>
, use the unit "page". Thus:This assumes that
<pb>
is equivalent to<milestone type="pagelike" unit="page">
.<lb>
Use "line" as unit.
Conclusion
For the above to work, we need to make sure that the value of
div/@subtype
andmilestone/@unit
do not coincide in the same inscription; otherwise, the reference would be ambiguous. This is not enforced currently.Likewise,
div/@subtype
andmilestone/@unit
should not have a value that is used as unit elsewhere; more specifically, they should not have the value "paragraph", "verse", "page", "line", "verse line".For representing hierarchies, I propose we use a comma (e.g. "part 5, line 2"), as in normal references. For representing ranges, however, I do not think we can get away with using a dash (e.g. "line 2-5"), because this might mess with the format of
@n
. I thus suggest we use the explicit format "$unit n1 to $unit $n2", as in "line 2 to line 5". I think this can be parsed without ambiguity.The text was updated successfully, but these errors were encountered: