Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs in text splitting for DXF output #986

Open
4 of 5 tasks
vagran opened this issue Jun 18, 2024 · 2 comments
Open
4 of 5 tasks

Bugs in text splitting for DXF output #986

vagran opened this issue Jun 18, 2024 · 2 comments
Assignees
Labels
blocking bug Something isn't working

Comments

@vagran
Copy link

vagran commented Jun 18, 2024

Invalid DXF file is produced when trying to convert DWG to DXF using dwg2dxf. Long text (usually in group 1) is split incorrectly. The continuation line does not have a preceding group code. Additionally unicode code points are split between the lines. Here is an example fragment of such result (sorry, I cannot share the full source file, it is proprietary):

XRECORD
  5
E331
102
{ACAD_REACTORS
330
E2CB
102
}
330
E2CB
100
AcDbXrecord
280
     1
 40
401321167
  1
{\fCalibri|b1|i0|c0|p34;\L3\fCalibri|b1|i0|c161|p34;.     \fCalibri|b1|i0|c0|p34;ΟΡΟΙ ΔΟΜΗΣΗΣ\P\fCalibri|b0|i0|c0|p34;\l \P  \fCalibri|b1|i0|c0|p34; 3.1 ΞΕΝΟΔΟΧΕΙΑΚΕΣ ΕΓΚΑΤΑΣΤΑΣΕΙΣ - ΕΚΤΟΣ ΣΧΕΔΙΟΥ\fCalibri|
b1|i0|c161|p34;\L\P\pxi-20.507,l23,t5.4931,2  # <<<< continuation without tag!
  1
3,24;\fCalibri|b0|i0|c161|p34;\l1.	ΘΕΣΗ ΓΗΠΕΔΟΥ		\fCalibri|b1|i0|c161|p34;\L\P\fCalibri|b0|i0|c161|p34;\l2.	ΝΟΜΟΙ & ΔΙΑΤΑΓΜΑΤΑ          \P\pi0,l0,tz;\P\P\P\P\P\P\pi-20.507,l23,t5.4931,23,24;3.	ΑΡΤΙΟΤΗΤΑ\P4.	ΣΥΝΤΕΛ?
?ΣΤΗΣ ΔΟΜΗΣΗΣ \fCalibri|b0|i0|c0|p34; # <<<< continuation without tag! unicode symbol broken, part is left on the  previous line.

Looking into the code, I suspect several problems:

while (len > 0)

  • Continuation group code is written only if remaining length is greater than 255. It is probably a typo, and it should check for greater than 0, like in the block above it.
  • Unicode is not handled in any way. Code point may be split at an arbitrary byte.
  • According to DXF specification text length limit is 250 characters, not 255.
  • According to DXF specification, if text field is split (group 1), all partial fragments (starting from the first one) should have group 3, and should be terminated by last fragment with group 1. Seems there is no place for such logic in the current implementation.
  • This 1024 bytes limit looks very bad. Shouldn't it be increased with dynamic buffer allocation?
@rurban
Copy link
Contributor

rurban commented Jun 20, 2024

Yep, you nailed it. The splitter is very naive

@rurban rurban self-assigned this Jul 12, 2024
@rurban rurban added bug Something isn't working blocking labels Jul 12, 2024
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986
Remaining is proper utf8-len splitting. not 250 bytes but runes.
@rurban
Copy link
Contributor

rurban commented Oct 4, 2024

Fixed 1,3,4,5 so far. Proper unicode rune splitting seems to be implemented by transformation to UCS-2, and transformed back to UTF-8.

rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986
Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986 (thanks to @vagran).

Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986 (thanks to @vagran/Artyom Lebedev).

Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986 (thanks to @vagran/Artyom Lebedev).

Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986 (thanks to @vagran/Artyom Lebedev).

Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 4, 2024
Fixes most parts of GH #986 (thanks to @vagran/Artyom Lebedev).

Remaining is proper utf8-len splitting. not 250 bytes but runes.
This needs to be done by converting overlong strings to UCS-2,
split them at 250
and then output them as UTF-8.
rurban added a commit that referenced this issue Oct 6, 2024
rurban added a commit that referenced this issue Oct 14, 2024
rurban added a commit that referenced this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants