Extra BOM in CSV file, hledger reports an error #2189

PSLLSP · 2024-03-29T17:15:30Z

hledger 1.32.3, linux

I have CSV file in UTF-8 format, it starts with BOM <feff>

When I join several such files to one file cat test-bom-*.csv > test-bom.csv, this file contains several BOM characters.
hledger doesn't like those extra BOM characters, it reports an error:

$ hledger -f test-bom.csv print
hledger: error: could not parse "2024-01-02" as a date using date format "%Y-%m-%d"
the CSV record is:  "\65279\&2024-01-02", "0.2", "test 2"
the date rule is:   %1
the date-format is: %Y-%m-%d
you may need to change your date rule, change your date-format rule, or add a skip rule
for m/d/y or d/m/y dates, use date-format %-m/%-d/%Y or date-format %-d/%-m/%Y

I am not sure but I think that it is not wrong when UTF-8 file has several BOM codes in the file; I tried other utilities and those were not failing with an error. In theory, coding of the file can change in the middle, like from UTF-8 to UTF-16LE...

How to replicate. Prepare test data, several simple CSV files with BOM and without BOM:

for I in 1 2 3; do echo -e "2024-01-0${I},0.${I},test ${I}" > "test-nobom-${I}.csv";  echo -e "\xef\xbb\xbf2024-01-0${I},0.${I},test ${I}" > "test-bom-${I}.csv"; done
cat test-nobom-[123].csv > test-nobom.csv 
cat test-bom-[123].csv > test-bom.csv

Files test-bom.csv and test-nobom.csv looks same but they differ in file size:

$ cat test-bom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

ls -l test-nobom.csv test-bom.csv
-rw-rw-r-- 1 user user 75 Mar 29 17:59 test-bom.csv
-rw-rw-r-- 1 user user 66 Mar 29 17:59 test-nobom.csv

grep is "confused" with BOM:

$ grep ^2024 test-nobom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

$ grep ^2024 test-bom.csv

$ grep ^.2024 test-bom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

Create import rules, those are the same, I created test-bom.csv.rules and then used ln -s test-bom.csv.rules test-nobom.csv.rules and ln -s test-bom.csv.rules test-bom-1.csv.rules :

$ cat test-bom.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d

$ cat test-nobom.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d

$ cat test-bom-1.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d

TEST

hledger can import CSV file with single BOM and file without BOM:

$ hledger -f test-bom-1.csv bal
                -0.1  income:unknown
                 0.1  unknown
--------------------
                   0

$ hledger -f test-nobom.csv bal
                -0.6  income:unknown
                 0.6  unknown
--------------------
                   0

hledger doesn't like file with several BOM:

$ hledger -f test-bom.csv bal
hledger: error: could not parse "2024-01-02" as a date using date format "%Y-%m-%d"
the CSV record is:  "\65279\&2024-01-02", "0.2", "test 2"
the date rule is:   %1
the date-format is: %Y-%m-%d
you may need to change your date rule, change your date-format rule, or add a skip rule
for m/d/y or d/m/y dates, use date-format %-m/%-d/%Y or date-format %-d/%-m/%Y

The text was updated successfully, but these errors were encountered:

simonmichael · 2024-03-30T00:12:05Z

That's very clear! Thank you.

I also found:

https://www.unicode.org/faq/utf_bom.html#BOM
https://learn.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks suggests windows apps will ignore a BOM in the middle of a file
https://googlesamples.github.io/android-custom-lint-rules/checks/ByteOrderMark.md.html suggests android apps consider it an error

We do want hledger to just work on real world data where possible, so we should be permissive where it doesn't add complications. But I'm not sure if we need to go as far as ignoring BOMs appearing anywhere in the input. It seems like an unusual niche case, and one that's easy to solve with preprocessing. Is it really valid for files to change encoding in the middle ? I can't imagine many tools that would handle that properly.

simonmichael · 2024-03-30T00:14:21Z

Our BOM handling should be mentioned at https://hledger.org/dev/hledger.html#text-encoding .

simonmichael · 2024-03-30T00:48:15Z

Related, https://www.unicode.org/faq/utf_bom.html#BOM says:

Q: What should I do with U+FEFF in the middle of a file?

In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur.
For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string.
When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

PSLLSP · 2024-03-30T13:01:09Z

BOM is troublemaker... ;-) We use extended ASCII and banks produced CSV files in CP-1250 in the past. Some of them upgraded their software and moved to UTF-8 and I believe that is why they produce UTF-8 file with BOM, to clearly signal that CSV file is not in CP-1250 but in UTF-8.

It is possible to create file that starts with BOM for UTF-8 and there is a BOM for UTF-16LE in the middle file. Just join file in UTF-8 with file in UTF-16LE. But that will be illegal, because BOM is just one code point (U+FEFF) expressed in different ways for each version of UTF. I thought that it could be possible to start with UTF-8 and use BOM in the middle of file to switch encoding to UTF-16LE but it is not possible because BOM for UTF-16LE is invalid sequence in UTF-8... Well, it could be possible but software has to test why there is an error in data, test if error code could be BOM for other variant of UTF... The good news is that UTF-16LE files are rare, UTF-8 is used in most cases.

simonmichael · 2024-03-30T18:27:04Z

So all we need to do is document our BOM requirements at https://hledger.org/dev/hledger.html#text-encoding as I've done ?

PSLLSP · 2024-04-02T12:14:33Z

What about ignoring ZWNBSP characters during CSV import? I do not see any way how these invisible troublemakers could be useful in hledger journal... Other way of handling these is to see them as EOL, this will help in the case that CSV file is not ended with EOL... Exception could be that ZWNBSP is used as field separator. I do not know if there is a way to define invisible ZWNBSP as field separator, maybe separator \uFEFF or separator ZWNBSP. I do not know any case of such CSV file... Or maybe to address this in a way that new command will be added, to map one character to other character, like UNIX command tr. I can use it to translate CSV file in encoding CP-1250 to UTF-8, I will define translation table in hledger import rule. New command to map input code to new code, several such commands could be in the rule file, each mapping on new line. The problem here is that hledger reads input file as UTF-8 and extended ASCII characters are invalid codes when file is read as UTF-8 stream (hledger reports error invalid byte sequence); to address this, new command to disable UTF-8 parsing should be added too, maybe (encoding utf-8 - the default and encoding binary to parse csv in 8-bit mode).

simonmichael added A-WISH Some kind of improvement request, hare-brained proposal, or plea. csv The csv file format, csv output format, or generally CSV-related. i18n Internationalisation/localisation-related. labels Mar 29, 2024

simonmichael added the docs Documentation-related. label Mar 30, 2024

simonmichael added a commit that referenced this issue Mar 30, 2024

;doc: text encoding: mention BOM support [#2189]

89d6f4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra BOM in CSV file, hledger reports an error #2189

Extra BOM in CSV file, hledger reports an error #2189

PSLLSP commented Mar 29, 2024 •

edited

Loading

simonmichael commented Mar 30, 2024 •

edited

Loading

simonmichael commented Mar 30, 2024

simonmichael commented Mar 30, 2024 •

edited

Loading

PSLLSP commented Mar 30, 2024

simonmichael commented Mar 30, 2024 via email

PSLLSP commented Apr 2, 2024 •

edited

Loading

Extra BOM in CSV file, hledger reports an error #2189

Extra BOM in CSV file, hledger reports an error #2189

Comments

PSLLSP commented Mar 29, 2024 • edited Loading

simonmichael commented Mar 30, 2024 • edited Loading

simonmichael commented Mar 30, 2024

simonmichael commented Mar 30, 2024 • edited Loading

PSLLSP commented Mar 30, 2024

simonmichael commented Mar 30, 2024 via email

PSLLSP commented Apr 2, 2024 • edited Loading

PSLLSP commented Mar 29, 2024 •

edited

Loading

simonmichael commented Mar 30, 2024 •

edited

Loading

simonmichael commented Mar 30, 2024 •

edited

Loading

PSLLSP commented Apr 2, 2024 •

edited

Loading