[KYUUBI #6832] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6856

naive-zhang · 2024-12-20T17:14:00Z

see #6832

Why are the changes needed?

Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs and YARN Apps

How was this patch tested?

Add tests of YarnAppQuerySuite, YarnCatalogSuite and YarnLogQuerySuite

Was this patch authored or co-authored using generative AI tooling?

This patch was not authored or co-authored using Generative Tooling

Be nice. Be informative.

…ate)

… of app_logs

…o or in some specific state(s)

codecov-commenter · 2024-12-20T20:48:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (b265ccb) to head (1d4d45d).

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #6856   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         687     687           
  Lines       42463   42463           
  Branches     5796    5796           
======================================
  Misses      42463   42463

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…make line_num starts from 1.

pan3793 · 2024-12-21T15:21:57Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnCatalog.scala

+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class YarnCatalog extends TableCatalog with SupportsNamespaces with Logging {


it does not support namespace, right?

Yes, I've remove relevant code~

pan3793 · 2024-12-21T15:29:42Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnCatalog.scala

+      structType: StructType,
+      transforms: Array[Transform],
+      map: util.Map[String, String]): Table = {
+    throw new UnsupportedOperationException("Create table is not supported")


let's canonicalize all error messages to The tables in catalog ${catalogName} does not support ALTER TABLE.

Thx for your review, I've modified.

...park-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogTable.scala

pan3793 · 2024-12-21T15:58:08Z

...park-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogTable.scala

+class YarnLogTable extends Table with SupportsRead {
+  override def name(): String = "app_logs"
+
+  override def schema(): StructType =


Spark also has SupportsMetadataColumns, maybe we should consider converting some cols to metadata col

pan3793 · 2024-12-21T15:59:38Z

...-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/BasicScanBuilder.scala

+import org.apache.spark.sql.connector.read._
+import org.apache.spark.sql.sources.Filter
+
+trait BasicScanBuilder


I don't think this abstract layer is really helpful

Yes, it seems unnecessary and I've removed it

pan3793 · 2024-12-21T16:08:56Z

...ctor-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnAppPartitionReader.scala

+    // fet apps
+    val applicationReports: java.util.List[ApplicationReport] =
+      yarnAppPartition.filters match {
+        case filters if filters.isEmpty => yarnClient.getApplications


will it retrieve all apps into memory, or streamly?

pan3793 · 2024-12-21T16:09:24Z

...ctor-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnAppPartitionReader.scala

+        // type => in (a,b,c), batch query
+        case filters =>
+          filters.collectFirst {
+            case EqualTo("id", appId: String) => java.util.Collections.singletonList(


one application may have multiple attempts

pan3793 · 2024-12-21T16:27:24Z

...nector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnApplicationTable.scala

+      StructField("start_time", LongType, nullable = false),
+      StructField("finish_time", LongType, nullable = false),
+      StructField("tracking_url", StringType, nullable = false),
+      StructField("original_tracking_url", StringType, nullable = false)))


I'm not sure the tracking_url is always present

extensions/spark/kyuubi-spark-connector-yarn/pom.xml

pan3793 · 2024-12-21T16:38:16Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogScan.scala

+
+  private val remoteAppLogDir = {
+    val dir = SparkSession.active.sparkContext
+      .getConf.getOption(remoteAppLogDirKey) match {


this is incorrect.

use SparkSession.conf in SQL cases instead of SparkContext.getConf in SQL cases because the latter returns the global static conf, which is immutable after Spark is launched.

and I think it is not worth having an additional spark conf key, just use SparkSession.sessionState.newHadoopConf to create a Configuration, and read the hadoop conf directly.

pan3793 · 2024-12-21T16:48:46Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogScan.scala

+  private def tryPushDownPredicates(): mutable.Seq[FileStatus] = {
+    filters match {
+      case pushed if pushed.isEmpty => listFiles(remoteAppLogDir)
+      case pushed => pushed.collectFirst {


it should support multiple predications ...

pan3793 · 2024-12-21T16:53:38Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogScan.scala

+          case EqualTo("container_id", containerId: String) =>
+            listFiles(s"${remoteAppLogDir}/*/*/*/*/${containerId}") ++
+              // compatible for hadoop2
+              listFiles(s"${remoteAppLogDir}/*/*/*/${containerId}")


could you leave some comments to explain the directory structure and Hadoop code/JIRA reference?

pan3793 · 2024-12-21T16:56:07Z

...spark-connector-yarn/src/main/scala/org/apache/kyuubi/spark/connector/yarn/YarnLogScan.scala

+          val fileIterator = fs.listFiles(status.getPath, true)
+          while (fileIterator.hasNext) {
+            val fileStatus = fileIterator.next()
+            if (fileStatus.isFile) logFiles += fileStatus


what if dir?

…g/apache/kyuubi/spark/connector/yarn/YarnLogTable.scala Co-authored-by: Cheng Pan <[email protected]>

…g/apache/kyuubi/spark/connector/yarn/YarnApplicationTable.scala Co-authored-by: Cheng Pan <[email protected]>

Co-authored-by: Cheng Pan <[email protected]>

pan3793 · 2024-12-23T02:50:56Z

@naive-zhang it may take some time to merge the whole feature into the master, to speed up the process, you may want to split it into several PRs, for example, the first PR just include a YarnCatalog, and then each PR focus one table

naive-zhang added 21 commits December 9, 2024 10:57

init module kyuubi-spark-connector-yarn

634991c

introduce DFSMiniCluster and YarnCluster for test cases

855c5e0

Merge branch 'apache:master' into yarn-catalog

7fcd74f

init yarn.default.app table query and impl yarn catalog suite

827eb91

make YarnAppScan support BATCH_SCAN

a86b79c

modify reader logic in YarnAppScan

d03e568

get hadoop conf from SparkSession.active.SparkConf

133d307

add xml file sense

db4a774

split hdfs related xml and yarn related xml

f6064f1

reformat app query test code

ea8be05

try to read from hdfs

e092859

use more elegant method for hdfs and yarn api

357c0ef

predicates push down in app tables with equalTo(appId) and equalTo(st…

d8f8184

…ate)

predicates push down in app tables with equalTo(appType)

ce7930a

predicates push down in app tables with in appType or in appState

c0dd20a

refactor all log related scala codes

ebb7a28

refactor all log related scala codes

d1192d8

refactor all log related scala codes

7ef1fd3

add todo for modify task nums

0faade6

Merge branch 'apache:master' into yarn-catalog

5134708

try to push down predicates for log reading

445c405

github-actions bot added module:spark kind:build module:extensions labels Dec 20, 2024

naive-zhang added 6 commits December 21, 2024 01:28

fix list tables error

94a570d

rename row_number into line_num and add file_name column in the table…

2efb0df

… of app_logs

fix the case of dir which does not end with '/'

15ea677

fix the case of predicate push down when query apps table with equalT…

531d33c

…o or in some specific state(s)

fix code style problem in YarnAppPartitionReader

a2a2164

fix star match in log dir

1b46b7b

naive-zhang added 3 commits December 21, 2024 02:57

fix code style problem in YarnAppPartitionReader

c150558

remove fs close

240393c

remove supports for hadoop2

5f6597a

naive-zhang added 4 commits December 21, 2024 14:40

fix predicates push down error when query app_logs with line_num

f9f1fb2

fix predicates push down error when query app_logs with line_num and …

b2ccf50

…make line_num starts from 1.

fix style check error

c1775ee

change query condition from user into host and remove local username

21973e1

pan3793 reviewed Dec 21, 2024

View reviewed changes

naive-zhang and others added 7 commits December 22, 2024 11:39

Update extensions/spark/kyuubi-spark-connector-yarn/src/main/scala/or…

e0f0bbf

…g/apache/kyuubi/spark/connector/yarn/YarnLogTable.scala Co-authored-by: Cheng Pan <[email protected]>

canonicalize some error message

9be34a7

remove unnecessary abstract class BasicScanBuilder

54e1baf

Update extensions/spark/kyuubi-spark-connector-yarn/src/main/scala/or…

2f171e8

…g/apache/kyuubi/spark/connector/yarn/YarnApplicationTable.scala Co-authored-by: Cheng Pan <[email protected]>

Update extensions/spark/kyuubi-spark-connector-yarn/pom.xml

b258a75

Co-authored-by: Cheng Pan <[email protected]>

remove unnecessary abstract class BasicScanBuilder

86f9476

Merge remote-tracking branch 'origin/yarn-catalog' into yarn-catalog

1d4d45d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #6832] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6856

[KYUUBI #6832] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6856

naive-zhang commented Dec 20, 2024 •

edited by pan3793

Loading

codecov-commenter commented Dec 20, 2024 •

edited

Loading

pan3793 Dec 21, 2024

naive-zhang Dec 22, 2024

pan3793 Dec 21, 2024

naive-zhang Dec 22, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

naive-zhang Dec 22, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 Dec 21, 2024

pan3793 commented Dec 23, 2024

[KYUUBI #6832] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6856

Are you sure you want to change the base?

[KYUUBI #6832] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6856

Conversation

naive-zhang commented Dec 20, 2024 • edited by pan3793 Loading

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

codecov-commenter commented Dec 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 commented Dec 23, 2024

naive-zhang commented Dec 20, 2024 •

edited by pan3793

Loading

codecov-commenter commented Dec 20, 2024 •

edited

Loading