Apache Hive (or extensible JDBC) support #807

christopherfrieler · 2024-08-03T18:14:31Z

I tried to connect to Apache Hive via JDBC and faced a few issues, which I would like to share here. Feel free to discuss my ideas or split them into multiple issues.

When I tried it, first I ran into the IllegalArgumentExceptions from https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt telling me it's not possible. -> Hive support would be nice.
I tried to add Hive support myself (in my code), but, although I could create a Hive instance of DbType, the when-statements in https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt checking the DbType are not extensible. -> The implementation should be easier extensible, as new database technologies arrise and supporting them in Kotlin Dataframe takes a while or even never happens, if it's a rather exotic one. I found the issues Add Apache Pinot® as supported database #637 and Redshift not supported #549, that also ask for further databases to be supported. Maybe there could be something like a "DbTypeRegistry", where users can add custom DbTypes at runtime instead of these static when-statements.
I found a workaround (see below) to connect to Hive and read data from it. However, I had to deal with a lot of SQLFeatureNotSupportedExceptions from the Hive JDBC library, because they did not implement some (optional? I'm not a JDBC expert, so I'm not sure how "optional" these things are, i.e. if they should implement it or client code should be able to deal with the missing SQL features.) stuff, especially regarding metadata. -> Maybe there could be some kind of fallback implementation that might not be as performant, but allows to work with such incomplete JDBC driver implementations.

Finally, my workaround to connect to Hive:

import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.io.*
import org.jetbrains.kotlinx.dataframe.io.db.DbType
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.ResultSetMetaData
import kotlin.reflect.KType

private object Hive: DbType("hive") {
    override val driverClassName: String
        get() = "org.apache.hive.jdbc.HiveDriver"

    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        TODO("Not yet implemented")
    }

    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? = null

    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? = null

    override fun isSystemTable(tableMetadata: TableMetadata): Boolean = false
}

private class HiveResultSet(private val original: ResultSet) : ResultSet by original {
    private class HiveResultSetMetaData(val original: ResultSetMetaData) : ResultSetMetaData by original {
        override fun getTableName(column: Int): String = ""

        override fun getColumnName(column: Int): String {
            return original.getColumnName(column).substringAfter(".")
        }
    }

    override fun getMetaData(): ResultSetMetaData {
        return HiveResultSetMetaData(original.metaData)
    }
}

public fun DataFrame.Companion.readHiveSqlQuery(
    query: String,
    dbConfig: DatabaseConfiguration,
): DataFrame<*> {
    DriverManager.getConnection(dbConfig.url, dbConfig.user, dbConfig.password).use { connection ->
        connection.createStatement().use { statement ->
            statement.executeQuery(query).use { rs ->
                return readResultSet(HiveResultSet(rs), Hive)
            }
        }
    }
}

This way I managed to get my own Hive DbType into readResultSet(). It definitely has limitations and I'm very unsure about my minimal implementations of the methods of the DbType. But at least it seems to work so far.

The text was updated successfully, but these errors were encountered:

zaleslaw · 2024-08-05T13:01:09Z

Thanks for sharing! There are a few user-requests in the Issues about possiblity to register custom SQL dialects and I believe it will possible not earlier than 0.15 release (now we are finishing with 0.14)

But we have some bottlenecks in our plugin for schema generation and this is a reason why we closed the hierarchy of DB classes. Hope to solve or suggest workaround for this problem and be open for new data sources

zaleslaw self-assigned this Aug 5, 2024

zaleslaw added databases JDBC related issues enhancement New feature or request labels Aug 5, 2024

zaleslaw added this to the 0.15.0 milestone Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Hive (or extensible JDBC) support #807

Apache Hive (or extensible JDBC) support #807

christopherfrieler commented Aug 3, 2024 •

edited

Loading

zaleslaw commented Aug 5, 2024

Apache Hive (or extensible JDBC) support #807

Apache Hive (or extensible JDBC) support #807

Comments

christopherfrieler commented Aug 3, 2024 • edited Loading

zaleslaw commented Aug 5, 2024

christopherfrieler commented Aug 3, 2024 •

edited

Loading