Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Hive (or extensible JDBC) support #807

Open
christopherfrieler opened this issue Aug 3, 2024 · 1 comment
Open

Apache Hive (or extensible JDBC) support #807

christopherfrieler opened this issue Aug 3, 2024 · 1 comment
Assignees
Labels
databases JDBC related issues enhancement New feature or request
Milestone

Comments

@christopherfrieler
Copy link

christopherfrieler commented Aug 3, 2024

I tried to connect to Apache Hive via JDBC and faced a few issues, which I would like to share here. Feel free to discuss my ideas or split them into multiple issues.

  1. When I tried it, first I ran into the IllegalArgumentExceptions from https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt telling me it's not possible. -> Hive support would be nice.
  2. I tried to add Hive support myself (in my code), but, although I could create a Hive instance of DbType, the when-statements in https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt checking the DbType are not extensible. -> The implementation should be easier extensible, as new database technologies arrise and supporting them in Kotlin Dataframe takes a while or even never happens, if it's a rather exotic one. I found the issues Add Apache Pinot® as supported database #637 and Redshift not supported #549, that also ask for further databases to be supported. Maybe there could be something like a "DbTypeRegistry", where users can add custom DbTypes at runtime instead of these static when-statements.
  3. I found a workaround (see below) to connect to Hive and read data from it. However, I had to deal with a lot of SQLFeatureNotSupportedExceptions from the Hive JDBC library, because they did not implement some (optional? I'm not a JDBC expert, so I'm not sure how "optional" these things are, i.e. if they should implement it or client code should be able to deal with the missing SQL features.) stuff, especially regarding metadata. -> Maybe there could be some kind of fallback implementation that might not be as performant, but allows to work with such incomplete JDBC driver implementations.

Finally, my workaround to connect to Hive:

import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.io.*
import org.jetbrains.kotlinx.dataframe.io.db.DbType
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.ResultSetMetaData
import kotlin.reflect.KType

private object Hive: DbType("hive") {
    override val driverClassName: String
        get() = "org.apache.hive.jdbc.HiveDriver"

    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        TODO("Not yet implemented")
    }

    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? = null

    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? = null

    override fun isSystemTable(tableMetadata: TableMetadata): Boolean = false
}

private class HiveResultSet(private val original: ResultSet) : ResultSet by original {
    private class HiveResultSetMetaData(val original: ResultSetMetaData) : ResultSetMetaData by original {
        override fun getTableName(column: Int): String = ""

        override fun getColumnName(column: Int): String {
            return original.getColumnName(column).substringAfter(".")
        }
    }

    override fun getMetaData(): ResultSetMetaData {
        return HiveResultSetMetaData(original.metaData)
    }
}

public fun DataFrame.Companion.readHiveSqlQuery(
    query: String,
    dbConfig: DatabaseConfiguration,
): DataFrame<*> {
    DriverManager.getConnection(dbConfig.url, dbConfig.user, dbConfig.password).use { connection ->
        connection.createStatement().use { statement ->
            statement.executeQuery(query).use { rs ->
                return readResultSet(HiveResultSet(rs), Hive)
            }
        }
    }
}

This way I managed to get my own Hive DbType into readResultSet(). It definitely has limitations and I'm very unsure about my minimal implementations of the methods of the DbType. But at least it seems to work so far.

@zaleslaw
Copy link
Collaborator

zaleslaw commented Aug 5, 2024

Thanks for sharing! There are a few user-requests in the Issues about possiblity to register custom SQL dialects and I believe it will possible not earlier than 0.15 release (now we are finishing with 0.14)

But we have some bottlenecks in our plugin for schema generation and this is a reason why we closed the hierarchy of DB classes. Hope to solve or suggest workaround for this problem and be open for new data sources

@zaleslaw zaleslaw self-assigned this Aug 5, 2024
@zaleslaw zaleslaw added databases JDBC related issues enhancement New feature or request labels Aug 5, 2024
@zaleslaw zaleslaw added this to the 0.15.0 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
databases JDBC related issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants