Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework iso9660 view action #199

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

slowpeek
Copy link
Contributor

@slowpeek slowpeek commented May 27, 2024

  • use xorriso -> isoinfo -> 7z fallback chain
  • ignore the Joliet tree with isoinfo
  • improve error reporting
  • dev notes: src/vfs/extfs/helpers/README.iso9660

Sample iso images to try it on:

mkdir utf8
for x in latin cyrillic-{абв,а,б,в}; do echo "contents of $x.txt" > utf8/"$x".txt; done

# rock ridge
xorriso -as mkisofs -r -o utf8-r.iso utf8
# joliet
xorriso -rockridge off -joliet on -as mkisofs -r -o utf8-j.iso utf8
# both
xorriso -joliet on -as mkisofs -r -o utf8-rj.iso utf8
# none
xorriso -rockridge off -as mkisofs -r -o ecma.iso utf8

Example how to inspect those for Rock/Joliet support:

> isoinfo -d -i utf8-r.iso | grep -E '^(NO )?(Rock|Joliet)'
NO Joliet present
Rock Ridge signatures version 1 found

@zyv
Copy link
Member

zyv commented May 27, 2024

Apparently this would fix https://midnight-commander.org/ticket/2851

@zyv
Copy link
Member

zyv commented May 27, 2024

Oh, I wanted to ask for tests again, but I realized that this is only for view action, and there is not even a framework for it at the moment, if I'm correct...

Also, it means that the linked ticket is only related, but not exactly the same issue. Can you confirm that the copy-out works though, or if not also have a look at fixing it?

@zyv
Copy link
Member

zyv commented May 27, 2024

Also this one is indirectly related: https://midnight-commander.org/ticket/4488

@slowpeek
Copy link
Contributor Author

Dont yet proceed on merge/review, I'm writing some big text on isoinfo to post here

@slowpeek
Copy link
Contributor Author

slowpeek commented May 28, 2024

After looking deeper into the whole isoinfo stuff, I've found something bad.

Below I'll use a sample Rock Ridge+Joliet utf8-rj.iso image created like this (the terminal is utf8):

mkdir utf8
for x in abcde абвгд; do echo "contents of $x.txt" > utf8/"$x".txt; done
xorriso -joliet on -as mkisofs -o utf8-rj.iso utf8

Rock Ridge doesnt feature a "charset" concept for filenames. By default iso9660 tools print the names as-is and it is not a big problem these days, since most likely the names are utf-8 encoded and the terminals are utf-8 as well. xorriso since 2009 supports -auto_charset option to save/load the charset from the isofs.cs xattr on the root dir. It is likely a xorriso-only thing. Also, there is -in_charset option to set the source charset when opening an existing iso.

isoinfo is a simple tool, it always prints RR names raw, which is fine:

> isoinfo -i utf8-rj.iso -l -R

Directory listing of /
drwxr-xr-x   1 1750 1750            2048 May 28 2024 [     19 02]  .
drwxr-xr-x   1 1750 1750            2048 May 28 2024 [     19 02]  ..
-rw-r--r--   1 1750 1750              22 May 28 2024 [     33 00]  abcde.txt
-rw-r--r--   1 1750 1750              27 May 28 2024 [     34 00]  абвгд.txt

Joliet filenames are UCS-2 encoded, it is the standard. When iso9660 tools create images, they convert from whatever input charset is to UCS-2. When they list some image's content, they convert from UCS-2 for the local charset. It sounds much better than the RR case, but there is a problem: isoinfo cant convert to utf-8. It can only convert to a selection of 1-byte charsets, the conversion tables are under cdrkit-1.1.11/libunls/. Among the tables there is the almighty nls_iconv.c, but it is only used by mkisofs. When isoinfo cant convert some char in a Joliet name to the current charset, it prints an underscore:

> isoinfo -i utf8-rj.iso -l -J

Directory listing of /
d---------   0    0    0            2048 May 28 2024 [     23 02]  .
d---------   0    0    0            2048 May 28 2024 [     23 02]  ..
----------   0    0    0              22 May 28 2024 [     33 00]  abcde.txt
----------   0    0    0              27 May 28 2024 [     34 00]  _____.txt

The underscored name can be used to extract the file:

> isoinfo -i utf8-rj.iso -J -x /_____.txt
contents of абвгд.txt

It already looks bad, but there is more to it. Let's create another sample iso:

mkdir multi
for x in а б в; do echo "contents of $x.txt" > multi/"$x".txt; done
xorriso -joliet on -as mkisofs -o multi.iso multi

and list it:

> isoinfo -i multi.iso -l -J

Directory listing of /
d---------   0    0    0            2048 May 28 2024 [     23 02]  .
d---------   0    0    0            2048 May 28 2024 [     23 02]  ..
----------   0    0    0              19 May 28 2024 [     33 00]  _.txt
----------   0    0    0              19 May 28 2024 [     34 00]  _.txt
----------   0    0    0              19 May 28 2024 [     35 00]  _.txt

All the files look the same, as expected. Let's see what whould it extract by /_.txt path:

> isoinfo -i multi.iso -J -x /_.txt
contents of а.txt
contents of б.txt
contents of в.txt

It printed contents of ALL the files which got their names squashed into _.txt due to the incapable charset converter.

It is possible to produce the correct listing with isoinfo:

> isoinfo -i multi.iso -l -J -j cp1251 | iconv -f cp1251

Directory listing of /
d---------   0    0    0            2048 May 28 2024 [     23 02]  .
d---------   0    0    0            2048 May 28 2024 [     23 02]  ..
----------   0    0    0              19 May 28 2024 [     33 00]  а.txt
----------   0    0    0              19 May 28 2024 [     34 00]  б.txt
----------   0    0    0              19 May 28 2024 [     35 00]  в.txt

but it only works because we know ahead symbols used in filenames can be converted to cp1251 without issues. This trick can be used with extraction as well:

> isoinfo -i multi.iso -J -j cp1251 -x /"$(echo б.txt | iconv -t cp1251)"
contents of б.txt

To summarize, Joliet support in isoinfo is inadequate. It only works well for latin characters. It cant convert non-latin filenames to utf-8, which is the must these days. I might suggest to at least ignore the Joliet tree when using isoinfo. In the case, it would be always isoinfo -R which states for RR with ECMA-119 fallback.

@slowpeek
Copy link
Contributor Author

In case we decided to ignore the Joliet tree with isoinfo, this whole PR would become a one-line change: isoinfo -l -R -i "${MC_EXT_FILENAME}" instead of isoinfo -l -R -J -i "${MC_EXT_FILENAME}"

@zyv
Copy link
Member

zyv commented May 28, 2024

Sounds pretty brutal to build a complex fallback chain with Joliet problem in-between. Maybe just ignore Joliet with isoinfo (a reference to this PR can be added in the code) and if you want to make it better rather use xorriso if available and only then fallback to isoinfo?

@slowpeek
Copy link
Contributor Author

slowpeek commented May 29, 2024

Locally I've remade it into xorriso + isoinfo (no Joliet).

The following text is about the outdated p7zip 16.02 still present in ubuntu 22.04

Here is another question. In the current view action code for iso9660 there is a fallback to 7za l. First of all, it likely never worked, because it is 7z that supports iso9660, not 7za.

Next, 7z prefers Joliet over Rock Ridge, there is no cli option to change that [1]. When Joliet is present, 7z l correctly converts filenames to the current locale from Joliet's UCS2:

> 7z l utf8-rj.iso | sed -n '/^----/,/^----/p'
------------------- ----- ------------ ------------  ------------------------
2024-05-28 01:29:57 .....            0            0  abcde.txt
2024-05-28 01:20:23 .....            0            0  абвгд.txt
------------------- ----- ------------ ------------  ------------------------

But when there is only Rock Ridge, it assumes the filenames are encoded in some 1-byte encoding (idk what CP_OEMCP stands for in linux enviroment) and converts it to the current locale from that. In my case, RR names are utf-8 encoded, current locale is utf-8 as well. 7z l prints it as double utf-8 encoded:

> 7z l utf8-r.iso | sed -n '/^----/,/^----/p'
------------------- ----- ------------ ------------  ------------------------
2024-05-28 01:29:57 .....            0            0  abcde.txt
2024-05-28 01:20:23 .....            0            0  абвгд.txt
------------------- ----- ------------ ------------  ------------------------

It could be tricked to print the names raw:

> LC_CTYPE=C 7z l utf8-r.iso | sed -n '/^----/,/^----/p'
------------------- ----- ------------ ------------  ------------------------
2024-05-28 01:29:57 .....            0            0  abcde.txt
2024-05-28 01:20:23 .....            0            0  абвгд.txt
------------------- ----- ------------ ------------  ------------------------

But the same trick breaks it for Joliet images:

> LC_CTYPE=C 7z l utf8-rj.iso | sed -n '/^----/,/^----/p'
------------------- ----- ------------ ------------  ------------------------
2024-05-28 01:29:57 .....            0            0  abcde.txt
2024-05-28 01:20:23 .....            0            0  ?????.txt
------------------- ----- ------------ ------------  ------------------------

So, to correctly list some iso with 7z l we need to detect if it contains Joliet or RR only and apply the trick for RR only. And to detect that, we need some other tool, for example such shell function (source: me):

is_joliet() {
    local skip=16 mark

    # Loop through the volume descriptor set
    # https://en.wikipedia.org/wiki/ISO_9660#Volume_descriptor_set
    while true; do
        mark=$(od -j$((2048*skip)) -N6 -An -tx1 <"$1" 2>/dev/null | tr -d ' ')

        case "$mark" in
            ??4344303031) # Type (1 byte) + CD001
                case "$mark" in
                    ff*) return 1 ;; # Terminator
                    02*) return 0 ;; # Joliet
                esac ;;
            *)
                return 1 ;;
        esac

        skip=$((skip+1))
    done
}

To summarize: 7z l has its problems and dealing with it looks like an overkill for such a humble script as misc/ext.d/misc.sh.in.

What do you think, should 7z l be used as a fallback when xorriso and isoinfo are not available, or let it be just those two?

[1] https://github.com/p7zip-project/p7zip/blob/36f6b74222/CPP/7zip/Archive/Iso/IsoHandler.cpp#L203-L206

@zyv
Copy link
Member

zyv commented May 29, 2024

Well, I don't think it hurts too much to keep 7z as a fallback. In the worst case the filenames are unreadable, but at least you've got something at all.

I was thinking that maybe it makes sense to report a bug against 7z, but it's kind of difficult. What should they do? I think the only thing that would help a bit would be to validate the filenames as utf-8 stream and in this case as a special handling just output them raw. But somehow I'm skeptical they would want to invest much into it.

@slowpeek
Copy link
Contributor Author

slowpeek commented May 29, 2024

Well, I don't think it hurts too much to keep 7z as a fallback. In the worst case the filenames are unreadable, but at least you've got something at all.

You mean let it be just 7z l, without my is_joliet function and the LC_CTYPE trick?

Also, I'd like to have my findings about isoinfo and 7z added somewhere to the repo, like README.iso9660, so that I could refer to it in comments. Which dir can I put it into?

I was thinking that maybe it makes sense to report a bug against 7z, but it's kind of difficult. What should they do?

I'll create an issue to make them aware of the problem. They could take the other established tools' default path: treat RR names raw, as if the iso was created on a system with the current locale. There is no such problem in upstream builds and builds from https://github.com/p7zip-project/p7zip used by archlinux.

@zyv
Copy link
Member

zyv commented May 29, 2024

You mean let it be just 7z l, without my is_joliet function and the LC_CTYPE trick?

Yep.

Which dir can I put it into?

I think this would be a good place:

https://github.com/MidnightCommander/mc/tree/master/src/vfs/extfs/helpers

I don't think anyone would be looking for it / find it where the scripts are:

https://github.com/MidnightCommander/mc/tree/master/misc/ext.d

@slowpeek
Copy link
Contributor Author

More details on 7z. I've figured out there are two flavours of 7z on my ubuntu 22.04 system.

One is named p7zip (p7zip-full package, 7z and 7za binaries). 7-zip.org quote:

p7zip is the command line version of 7-Zip for Linux / Unix, made by an independent developer

Its most recent version is 16.02. Ubuntu shipped that version since 16.10 till 23.10. In ubuntu 24.04 it became a transitional package to 7zip.

The other 7z flavour is 7zip (7zip package, 7zz binary), available since ubuntu 22.04. It is built from 7-zip.org sources. This one is actively maintained. And it has no problem with RR names I noticed in p7zip-full.

Since ubuntu 24.04, 7zip package provides 7z, 7za and 7zr binaries. And 7zip-standalone package provides 7zz binary.

Interestingly, archlinux atm ships p7zip 17.05-2 from this fork https://github.com/p7zip-project/p7zip. It does not have the RR problem as well. There are builds from upstream in AUR as well (1 2)

So, a rule of thumb with iso9660 listing with 7z: try 7zz first, fallback to 7z.

But it still sucks in preferring Joliet over RR.

@slowpeek slowpeek changed the title Fix isoinfo usage in the iso9660 view action Rework iso9660 view action May 30, 2024
@slowpeek
Copy link
Contributor Author

Ready to merge now.

I'll look into those tickets, you mentioned above, later.

- use xorriso -> isoinfo -> 7z fallback chain
- ignore the Joliet tree with isoinfo
- improve error reporting
- dev notes: src/vfs/extfs/helpers/README.iso9660
@zyv zyv force-pushed the master branch 2 times, most recently from e9e72d9 to de7d72c Compare October 17, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants