Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<citedRange unit="entry"><foreign>...</foreign></citedRange> #310

Open
arlogriffiths opened this issue May 23, 2024 · 14 comments
Open

<citedRange unit="entry"><foreign>...</foreign></citedRange> #310

arlogriffiths opened this issue May 23, 2024 · 14 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@arlogriffiths
Copy link
Collaborator

  • code: <bibl><ptr target="bib:Goris1954_01"/><citedRange unit="volume">2</citedRange><citedRange unit="page">319</citedRange><citedRange unit="entry"><foreign>tanggung</foreign></citedRange></bibl>

  • display: Capture d’écran 2024-05-23 à 09 06 29

  • issue: The use of <foreign> in the above context does not yet have the desired effect in display.

@arlogriffiths arlogriffiths added the invalid This doesn't seem right label May 23, 2024
@michaelnmmeyer
Copy link
Member

I would rather not allow the use of XML elements except for <citedRange unit="mixed">, because I am doing pattern matching on the contents of citedRange (for replacing "-" with en dashes, for detecting whether several items are cited, etc.). Doing that on an XML tree is a real mess, and we would not gain much from it.

So, if @danbalogh is OK with that, I would suggest we only allow plain text in <citedRange>, except when @unit is mixed. In the latter case, the element's contents is left unchanged, so there is no issue.

@arlogriffiths
Copy link
Collaborator Author

So it seems you are suggesting that I should encode as follows:

<bibl><ptr target="bib:Goris1954_01"/><citedRange unit="mixed">vol. 2, p. 319, s.v. <foreign>tanggung</foreign></bibl>

That could work, though I'd rather prefer a solution that doesn't force me to diverge form the usual pattern only because I want to see italics.

I imagine that people will only want to see italics in cases where the @Unit of <citedRange> is "entry". Does this limitation help at all to keep the trouble with pattern searching in check?

@michaelnmmeyer
Copy link
Member

@arlogriffiths @danbalogh @manufrancis

This can be made to work. But we first need to decide what are the criteria for determining whether the contents of citedRange refers to a single item or to many, so that the proper form of @unit (singular or plural) is displayed. The current solution does not really work.

We should at least have a way to specify unambiguously whether there is a single item or many. To remove ambiguities, I propose we use the plural form of @unit ("pages", "entries", etc.) to indicate that there are many items, andto use the singular one ("page", "entry", etc.) only when there is a single item. This requires transformations on the existing files, which can be automated.

@danbalogh
Copy link
Collaborator

danbalogh commented May 27, 2024

My number one comment on this, you can probably guess, is that this is a fine detail to which we should not devote a lot of time and effort.

Number two. @michaelnmmeyer, I'm completely OK with not permitting XML elements within citedRange at all; I'm also OK with permitting them in <citedRange unit="mixed">. It has in fact never occurred to me to use any further elements within this one.

Three. Why not make the display of <citedRange unit="entry"> italic by default? That would give the display Arlo wants without the need to use an XML element for formatting, and without the need to switch to mixed unit. Are there any circumstances where the italic display of an "entry" is so undesirable as to rule this solution out?

Four. I don't know in what way the current solution for determining singular/plural unit does not work. I don't recall the details, but if it doesn't work as expected, couldn't the display transformation be tweaked further? I would very much dislike a further complication of our already hellishly complicated reference encoding with the introduction of units like "pages" etc. In addition to the increased (practically doubled) compexity, I have the following concerns with Michaël's [edited] note that conversion to this in existing files can be automated. The smaller one: OK, so conversion can be automated, and we or Michaël makes the change in all existing files on date X. Can we realistically expect all encoders to switch to the new system consistently from date X onward, or would the auto-conversion have to be repeated regularly? The bigger one: if conversion can indeed be automated, then why can't the same algorithm that makes the conversion in the files be used in the display transformation to achieve the desired display without altering and complicating the code?

@michaelnmmeyer
Copy link
Member

@danbalogh

For 3). We have a few cases where several entries are given (as in <foreign>word1</foreign> and <foreign>word2</foreign>), italics are not desirable in this case. There are about a dozen <citedRange unit="entry"> that contain foreign elements. Plain text is used everywhere else (except for a single instance in one of Manu's inscriptions).

For 4). The problem is that the format of references is unrestricted, but that the app is still supposed to guess whether they refer to a single item or to multiple ones, and thus often produces "wrong" results. There is no way to fix this besides encoding the reference explicitly with @unit="mixed", but people (Manu and Arlo so far) apparently do not want to do that.

@danbalogh
Copy link
Collaborator

Anything you and the PIs can agree on will be acceptable for me, so there is no need to go along with my wishes here. However,
For 3) fair enough, italics are not desirable for "and" between entries. But I'm not at all sure that "and" is desirable between entries; if there are only a few cases of this, then I think the straightforward solution would be to replace the "and" in those entries with a comma. It would not be a problem if the comma were displayed in italics, and at the same time, the presence of the comma would be a flag for the algorithm that this is a plural. Although in the EGD we had written that the contents of <citedRange unit="entry"> will not be italicised by default, I now think that it would actually be better to do so for consistent display, and to explicitly forbid using anything in the contents of that element other than the actual entries and commas.

For 4) you have not answered my main concern: how can it be possible to automate replacing the value of @unit with plurals in the code, if it is not possible to automate doing so in display? Apart from that, I of course understand that the reason the display of plurals doesn't work because the format of the references is too lax. What I do not know is precisely what laxities prevent this from happening. My feeling is that inconsistent laxities should be eliminated from our encoding, while consistent laxities should be formulated as supplementary rules for determining when a plural is needed. As best I recall, the main problem was that appendix names could include hyphens and perhaps commas as well. In my opinion, before you decide to further complicate the entire already complex system of reference encoding for the sake of meticulous display in the case of a small minority (1%? 5%?) of all our references, we should be clear on the exact cases where the PIs don't want to use @unit="mixed" and see if those cases could be catered for.

But I repeat: anything is acceptable to me.

@michaelnmmeyer
Copy link
Member

For 4), manual corrections would indeed be needed.

I would rather simplify the current encoding than complicate it. My position is that we would be much happier if we just abandoned all the citedRange and @unit stuff, and used plain references like

<bibl><ptr target="bib:Goris1954_01"/>, vol. 2, p. 319</bibl>

everywhere.

@danbalogh
Copy link
Collaborator

Since I don't think we ever want to make those references machine-actionable to the level of <citedRange>, I think your suggestion should be seriously considered. It is certainly acceptable to me. I think the main reason why Arlo and I introduced <citedRange> and the various units to begin with was that this would enforce some level of consistency (in reference structure and display) across the project. But given that we now have a proliferation of units and, apparently, a number of people dissatisfied with what can be done within the system as well as some "hacked" usage to achieve citations for which the system was not intended, we may indeed be better off abandoning all the complication. Or reducing it greatly, e.g. keeping two permitted values of @unit, namely "page" (to which the existing citedRanges without unit would be converted) and "free" or "mixed", to cover everything else. Anyway, that is for the PIs to decide. If we made this leap, I'm quite sure that a fair amount of manual checking and revision would be necessary to convert the existing encoded references to this system, even though much of the conversion could be handled automatically using the existing algorithm for display transformation to hard-code the citation into the "free" part. But if any other solution also needs manual revision, then this may not be too bad.

@arlogriffiths
Copy link
Collaborator Author

Indeed, the main reason why we introduced <citedRange> and the various units was that this would enforce some level of consistency (in reference structure and display) across the project. I still believe this is important, given the very broad range of bibliographic cultures active in our project and the equally broad range of diligence in matters bibliographic on the part of our team members. I don't think we should let the (in my impression rather minor) rough edges of the system that we have in place lead us to any radical revision.

I don't understand what people could be dissastified about now that we have the option @Unit="mixed" which gives complete freedom, doesn't it? Any "hacked" usage is probably due to people being unaware of the option "@Unit="mixed".

I am flexible about any mix of the variables presented so far, as long as we leave the basics of the present system intact.

Notably, I am willing to play along with the proposal to introduce explicit encoding of plural in the values of @Unit and the partially automated, partially manual path to implementing the change that Michaël proposes. But I am also able to accept sticking to singular values only in exchange for some loss of flexibilty elsewhere in order for the machine to be able to tell whether sg. or pl. is intended.

@michaelnmmeyer
Copy link
Member

To be noted that Dan's proposal is close to LaTeX's behaviour: you have a special case for citing pages (e.g. \cite[43-45]{MyBook}, which produces "MyBook, pp. 43–45"), but everything else (volumes, etc.), has to be encoded manually. About one half of our citedRanges are basic page numbers (ranges or sequences of digits).

@arlogriffiths
Copy link
Collaborator Author

arlogriffiths commented May 27, 2024

even if it's only 50% of our references that we're talking about, I insist that we need a structuring mechanism such as the one we have in place.

@danbalogh
Copy link
Collaborator

Fair enough, let's forget about discarding the existing units. This takes us back to the point where we need solutions for the following details:

  1. Correct display, wherever feasible, of plural units; and
  2. the original issue: italic display for headwords when the unit is "entry", preferably without using XML elements within <citedRange>.

Anything I missed?

For 1, my preferred solution would be to stick to the present units, and let the display transformation algorithm take care of plural display. Since this does not work perfectly in all circumstances, we need information about the cases where it does not (or is not expected to) work correctly, and assess whether any of those cases are systematic. For the systematic cases, it may be possible to add sub-rules for the transformation algorithm. For the non-systematic cases, we would then have to change the problematic citations to @unit="mixed", or live with the inaccurate display.

For 2, I think the best solution is to prescribe that <citedRange> must never contain any further XML elements (contrary to the earlier permission to use <foreign> in entries), except when @unit="mixed", where certain elements (only <foreign>? or also something else?) would be permitted. Next, always display the contents of <citedRange unit="entry"> in italics. And finally, instruct encoders not to put anything within <citedRange unit="entry"> other than headwords and, where applicable, commas; or, where a more complex citation is needed, to use @unit="mixed" instead.

In addition, regarding what I anticipate to be systematic cases in 1, I think it would make sense to prescribe in the EGD and EGC that the contents of <citedRange> with a @unit other than "mixed" must never include a comma or a hyphen unless a plural display is desired. With this rule, any reference where the thing itself contains a hyphen or comma would thus have to be encoded as @unit="mixed", with the singular or plural form written by hand as applicable. One (slightly more complex) alternative to this would be to stipulate the above rule (no hyphens or commas unless plural is intended) only for @unit="page", and for the any other unit (i.e. other than "mixed" or "page"), stipulate that hyphens will not result in plural display, while commas will. My impression is that there exist a number of appendices, figures, plates, etc. with hyphens in their numbers, but very few or practically none with commas in their numbers; and conversely, that we may sometimes want to refer to several appendices, figures, plates, etc., but only very rarely want to refer to ranges of these units. With this setup, @unit="mixed" would have to be used for page numbers containing a hyphen, page numbers containing a comma, non-page numbers containing a comma, and ranges of non-page numbers. Since this is already getting a bit complex, the introduction of special plural units can also remain on the table.

@michaelnmmeyer
Copy link
Member

I think it is important to make transformation rules simple enough and easy to remember, so that people can predict what the output will look like and so that they have a chance to remember them. Perhaps more importantly, they should not change (even for "improvements"), because this would inevitably introduce mistakes in existing entries.

So, I propose to stick to the core of Dan's comment. We would have:

  1. A citedRange that contains a comma or an hyphen (or some other dash) is considered to refer to several items, otherwise to a single one.
  2. <citedRange unit="entry"> is rendered in italics, as if it was wrapped with foreign.
  3. Hyphens are replaced with EN dashes in <citedRange unit="page">. (I do not like this special case, but doing that everywhere might create problems.)
  4. If the displayed result is incorrect, or if some special format is needed, <citedRange unit="mixed"> should be used. XML elements are allowed only for @unit="mixed".

@danbalogh
Copy link
Collaborator

All of this is acceptable to me, provided that the PIs are happy with it. The one thing that worries me is that, at least for the Indian subcontinent, there is a huge number of references to ARIE appendices for which we had specifically required the format <bibl><ptr target="bib:ARIE1962-1963"/><citedRange unit="page">49</citedRange><citedRange unit="appendix">A/1962-63</citedRange><citedRange unit="item">19</citedRange></bibl> (EGD Example 10.4.5.F). There may be similar cases (i.e. a citation type that is both numerous and includes a hyphen) in other corpora as well.
If we stick to the above, then we'll need a solution for these. Ideally, I would prefer if they did not have to be changed to @unit="mixed", because consistency is very difficult to maintain that way.
@michaelnmmeyer , would it be possible to A) auto-replace all hyphens contained in a <citedRange unit="appendix"> that is a child of a <ptr> whose @target starts with bib:ARIE to an en-dash, and B) make sure that the algorithm for identifying plurals is sensitive only to hyphens, and not to en-dashes?
If this solution is feasible, then we could also instruct encoders that in any context, if in the future they want a hyphen in <citedRange> other than mixed, but they don't want it displayed with a plural unit, then they can use an en-dash in place of the hyphen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants