Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvSource type conversion with custom schema #370

Open
tszolar opened this issue Feb 8, 2018 · 3 comments
Open

CsvSource type conversion with custom schema #370

tszolar opened this issue Feb 8, 2018 · 3 comments
Assignees

Comments

@tszolar
Copy link

tszolar commented Feb 8, 2018

From the project README - CSV source part I got the idea that type conversion for loaded CSV should be performed according to the specified schema.

But if I define a custom schema for a CsvSource which has columns with other types than String (Int for example), then the values in that column are still returned as String.

Is it intended behaviour, bug or it just haven't been implemented?

Runnable example:

import java.io.ByteArrayInputStream
import java.nio.charset.StandardCharsets
import io.eels.component.csv.CsvSource
import io.eels.schema._

object CsvSourceTypeConversionTest extends App {

  val exampleCsvString =
    """A,B,C,D
      |1,2.2,3,foo
      |4,5.5,6,bar
    """.stripMargin

  val stream = new ByteArrayInputStream(exampleCsvString.getBytes(StandardCharsets.UTF_8))
  val schema = new StructType(Vector(
    Field("A", IntType.Signed),
    Field("B", DoubleType),
    Field("C", IntType.Signed),
    Field("D", StringType)
  ))
  val ds = new CsvSource(stream _, Some(schema)).toDataStream()
  val firstRow = ds.iterator.toIterable.head
  val firstRowA = firstRow.get("A")
  println(firstRowA) // prints 1 as expected
  println(firstRowA.getClass.getTypeName) // prints java.lang.String
  assert(firstRowA == 1) // this assertion will fail because firstRowA is not an Int
}
@hannesmiller
Copy link
Contributor

@flexik Have you considered using the SchemaInferrer?

val inferrer = SchemaInferrer(SchemaType.String, SchemaRule("qty", SchemaType.Int, false), SchemaRule(".*_id", SchemaType.Int))
CsvSource("myfile").withSchemaInferrer(inferrer)

I take your point though that perhaps if you explicitly pass in the schema it should use the schema under-the-hold - we will be looking into this.

@tszolar
Copy link
Author

tszolar commented Feb 14, 2018

@hannesmiller Using SchemaInferrer yields the exactly same result as using schema directly. No type conversion happens. All Row fields are still Strings.

Updated example with SchemaInferrer:

import java.io.ByteArrayInputStream
import java.nio.charset.StandardCharsets

import io.eels.{DataTypeRule, SchemaInferrer}
import io.eels.component.csv.CsvSource
import io.eels.schema._

object CsvSourceTypeConversionTest extends App {

  val exampleCsvString =
    """A,B,C,D
      |1,2.2,3,foo
      |4,5.5,6,bar
    """.stripMargin

  def stream = new ByteArrayInputStream(exampleCsvString.getBytes(StandardCharsets.UTF_8))
  val inferrer = SchemaInferrer(
    StringType,
    DataTypeRule("A", IntType.Signed),
    DataTypeRule("B", DoubleType),
    DataTypeRule("C", IntType.Signed),
    DataTypeRule("D", StringType)
  )
  val ds = new CsvSource(stream _).withSchemaInferrer(inferrer).toDataStream()
  val firstRow = ds.iterator.toIterable.head
  val firstRowA = firstRow.get("A")
  println(firstRowA) // prints 1 as expected
  println(firstRowA.getClass.getTypeName) // prints java.lang.String
  assert(firstRowA == 1) // this assertion will fail because firstRowA is not an Int
}

@hannesmiller
Copy link
Contributor

@flexik ok this maybe a bug that was introduced between versions - nevertheless I agree this should match the supplied schema - we will make this a priority for the next release which is looking like the early part of March.

Will keep you posted If we manage to get this resolved in alpha release beforehand.

Regards
Hannes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants