Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kosha texts mangled upon export #503

Open
vvasuki opened this issue Apr 6, 2023 · 3 comments
Open

kosha texts mangled upon export #503

vvasuki opened this issue Apr 6, 2023 · 3 comments

Comments

@vvasuki
Copy link
Contributor

vvasuki commented Apr 6, 2023

Observe how the proofreader has marked headwords in this dict:

image

Neither the txt or the tei-xml dump have that. Rather we get an amorphous blog like:

<p>अद्वैतम्, अद्वैत द्वाभ्यां प्रकाराभ्यामितं ज्ञानं द्वीतम्, द्वीतमेव द्वैतं न द्वैतम् अद्वैतम् । तच्च द्वैताभावोपलक्षितं ब्रह्म । यथा- द्विधेतं द्वीतमित्याहुस्तद्भावो द्वैतमुच्यते (बृ० उप० भा० वा० ४।३।१८०७) । न विद्यते द्वैतं द्विधाभावो यत्र तत् । (सि० बि० १०) यथा च स्थूलादिवाक्यैः परिहृतद्वैतप्रपञ्चम् । (सं० शा० १ । २६६ सु० टी०) यथा च - तुच्छत्वम् असत्त्वम् अनृतत्वम् आद्यसहनवपुः सत्यज्ञानानन्तानन्दसद्रूपं परमं ...

cc @suhasm

@shreevatsa
Copy link
Contributor

AFAIK line breaks (without an empty line in between) are not considered significant, and the proofreader instructions (probably/hopefully) also say that. So the behaviour (line breaks ignored, and everything wrapped into a paragraph) is kind of expected. (Consider: we don't want the other line breaks to be reflected in the text.)

So IMO the bug here is not quite “kosha texts mangled upon export” but rather something like “a particular proofreader came up with an ad-hoc convention in which the first line break in each paragraph is significant, because it indicates the headword in this text, and the rest of the line breaks are not, and the backend was not aware of this convention”.

But going even further, rather than saying this is user error / blaming it on the user, IMO the fix is for the proofreader UI to have a “rendered” (”preview”) mode (maybe even shown by default), so that any proofreader, as they work through a text, will always see the effect of their conventions: how the text they prepared will be seen by readers eventually.

(Of course I think of this issue as another point in favour of the rich-editor / ProseMirror idea :P, though I admit that this could be hacked around even without it….)

@vvasuki
Copy link
Contributor Author

vvasuki commented Apr 6, 2023

So IMO the bug here is not quite “kosha texts mangled upon export” but rather something like “a particular proofreader came up with an ad-hoc convention in which the first line break in each paragraph is significant, because it indicates the headword in this text, and the rest of the line breaks are not, and the backend was not aware of this convention”.

I bet that the proofreader just followed @suhasm 's wise instruction (given his experience with similar dict files and need for headwords without vibhakti-pratyaya)!

So, the real fix would be to come up with better support (conventions and markup) for dictionary books given the importance @suhasm (rightly) gives to domain specific dicts.

@shreevatsa
Copy link
Contributor

Right, that too :)

The issues I see:

  1. We don't already have a convention for these headwords (either markup or, my preference, editor support),
  2. The convention that has been developed/used for this text is not interpreted by the backend (this we can probably manually override for this text),
  3. It is not obvious that the problem (2) (or (1)) exists, until much later when someone tries export. IMO the issue should show up while proofreading itself, which is what I suggested with the preview / rich editor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants