Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple codepages in the same file #631

Open
gergelyts opened this issue Jun 13, 2023 · 9 comments
Open

Multiple codepages in the same file #631

gergelyts opened this issue Jun 13, 2023 · 9 comments
Labels
question Further information is requested

Comments

@gergelyts
Copy link

Background [Optional]

Hi,

in our case one file contains several codepages. Different lines are encoded with different code pages. There is a column that acts like a country_code, and it is based on the country code that we know which codepage to use. A file is usually coded with 4-5 codepages. Currently, we would have to read the file for each codepage separately by reading only the appropriate lines (based on the country_code column) with the appropriate codepage.

This is quite slow.

Question

Do you see any possibility to use multiple codepages in the driver at the same time with one readout?

Thanks!

@gergelyts gergelyts added the question Further information is requested label Jun 13, 2023
@yruslan
Copy link
Collaborator

yruslan commented Jul 7, 2023

Currently this is not supported, and I not sure at this point how to add this support in a general way.

But there is a possible workaround. You can make the text field binary by specifying 'USAGE COMP' to that field. It will be converted to binary Spark format. And then, you can use Spark, with a UDF to convert the binary field to a string based on the country code.

@gergelyts
Copy link
Author

Do you have an example code for this? Thanks

@gergelyts
Copy link
Author

gergelyts commented Oct 9, 2023

Hi, your workaround does not work, I got the following error:

: scala.MatchError: AlphaNumeric(X(4026),4026,None,Some(EBCDIC),Some(X(4026))) (of class za.co.absa.cobrix.cobol.parser.ast.datatype.AlphaNumeric)

This is the copybook:

           20  I69-105-CLTX-TEXT             PIC X(4026) USAGE COMP.

@yruslan
Copy link
Collaborator

yruslan commented Oct 9, 2023

Hi,

you can find an example here:

val data = Array(0x12.toByte, 0x34.toByte, 0x56.toByte, 0x78.toByte)

Please, use the latest version of Cobrix (2.6.8) as this feature was introduced only recently.

@Beno922
Copy link

Beno922 commented Oct 19, 2023

Hi,

I'm working with OP (@gergelyts) on this, we've managed to get it to binary, our problem is now that we'd need it in a (PySpark) String form.

Could you give an example of how to do that (which is not a forced cast)?

Also, we think it would be easier to somehow set the encoding (we think of UTF-8 to support the languages for all the countries).

Thanks for the help in advance!

@yruslan
Copy link
Collaborator

yruslan commented Oct 23, 2023

Hi,

Now the binary field needs to be decoded in to the Unicode text (probably UTF-8 encoded). But I realized that decoders we have in Cobrix are not available from Python, only from Scala.

I'm going to think about a a solution. But at first glance, it might need direct support from Cobrix.

Are all strings in each record encoded the same, or only particular fields?
Can you give an example of some country code to code page mapping?
Which is the list of code pages that can be encountered in your files? (this is to check if Cobrix supports these code pages)

@Beno922
Copy link

Beno922 commented Oct 25, 2023

No, out of 10 columns 7 are done by default encoding, and the remaining 3 are decided by country code (1 and 2 bytes codepage).

In the source application the country code mappings look like this: (ebcidic_codepage_mapping (1).txt
)

For the third question see also the prior reply.

(These prior requests are related to this topic:)
#574
#539

If you need more information regarding the source application, feel free to contact @BenceBenedek

@yruslan
Copy link
Collaborator

yruslan commented Oct 25, 2023

Yes, I see. Thank you for the context!

So ideally, you would like a mapping like this:

[
  { 
    "code_field" : "country_code", 
    "target_fields": [ "field1", "field2", "field3" ] 
    "code_mapping": {
      "kor": "cp300",
      "chn": "cp1388"
    }
  }
]

so that the encoding of field1, field2 and field3 is determined by the column country_code. When country_code=jpn use cp300 , right?

If multiple country code fields are defined for each record, it can be split like this:

[
  { 
    "code_field" : "country_code1", 
    "target_fields": [ "field1", "field2" ] 
    "code_mapping": {
      "jpn": "cp300",
      "chn": "cp1388"
    }
  },
  { 
    "code_field" : "country_code2", 
    "target_fields": [ "field3" ] 
    "code_mapping": {
      "japan": "cp300",
      "china": "cp1388"
    }
  },
]

So ideally, if you want to be able to pass such a mapping to Cobrix, and it should figure things out, right?

@yruslan
Copy link
Collaborator

yruslan commented Oct 25, 2023

Now that I am thinking about it, a workaround is possible even now, but not too effective.

   val df1 = spark.read.format("cobol")
     .option("ebcdic_code_page", "cp037")
     .option("field_code_page:cp300" -> "field1")
     .load("/path/to/files")
     .filter(col("country_code") === "jpn")

   val df2 = spark.read.format("cobol")
     .option("ebcdic_code_page", "cp037")
     .option("field_code_page: cp1388" -> "field1")
     .load("/path/to/files")
     .filter(col("country_code") === "chn")

  val df = df1.union(df2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants