"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

peizhaoliu · 2020-01-21T07:07:38Z

dear author,

I came across this article "https://github.com/rapidsai/spark-examples/blob/master/getting-started-guides/on-prem-cluster/standalone-scala.md".
When i launch distributed training without GPUs (tree method hist), the parameters setting by following: "--num-executors 1 --executor-cores 19 --conf spark.cores.max=19 --conf spark.task.cpus=1 --class ai.rapids.spark.examples.mortgage.CPUMain -numWorkers=19 -treeMethod=hist"
However, tasks of the stage "foreachPartition at XGBoost.scala:703" always blocked in "running". In a few hours after submitted the job, we obtained the feeback:
java.lang.IllegalArgumentException: Frame length should be positive: -9223371863126827765 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748)

Could you please come up some tips about this issue? Thanks

sincerely

The text was updated successfully, but these errors were encountered:

wjxiz1992 · 2020-01-22T05:35:53Z

Hi,
Did you set "nthread" to 1? Because " XGBoost4J-Spark requires that all of nthread * numWorkers cores should be available before the training runs."
You can add "-nthread=1" to the end of your cmd directly.

You could also try some other parameter set like:

"--num-executors 1 --executor-cores 1 --conf spark.task.cpus=1 -numWorkers=19 -nthread=1 treeMethod=hist".

For hanging problem, there's a "timeout_request_workers" that may help(but not always). This parameter could reduce the hangi time when your app couldn't get enough resources from Spark. There're also some other possibilities that the program will hang.

To see where it hangs, you could go to Spark's web UI, jump into "Executors" to see the "Thread Dump"

peizhaoliu · 2020-02-26T02:51:52Z

hi,
When lanuch a GPU Mortage example on Spark-standalone, we obtain the following error:
2020-02-25 15:48:08 ERROR NativeDepsLoader:55 - Could not load cudf jni library... java.lang.UnsatisfiedLinkError: /tmp/rmm4687696644621164964.so: libnvToolsExt.so.1: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:83) at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:51) at ai.rapids.cudf.Table.<clinit>(Table.java:31) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readToTable(GpuCSVScan.scala:214) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readBatch(GpuCSVScan.scala:194) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.next(GpuCSVScan.scala:230)
I try to explore the dependencies jar, found that "libxgboost4j.so" and "librmm.so" inside there. So why cannot load cudf jni library? Could you please show me some tips to solve this problem?

wjxiz1992 · 2020-02-26T12:59:54Z

Hi, I guess it's probably you used the wrong version of your cudf jar. You should choose the right version according to your CUDA version. e.g. mvn package -Dcuda.classifier=cuda10, if your cuda is 10.0. You could see your cuda version by "cat /usr/local/cuda/version.txt"

peizhaoliu · 2020-03-03T03:53:29Z

Thanks your tip!
Inspired by the previous suggestions, we adopt the parameter '-nthread' in XGBoost4j-Spark without GPU. The results indicated that it works, '-nthread' can help for optimization. However, when launch distributed training with GPUs, adusting "-nthread=1" to "-nthread=6" seems not take any effect. The full parameter set is "--num-executors 1 --executor-cores 6 --conf spark.task.cpus=6 -numWorkers=1 -nthread=6 treeMethod=gpu_hist".
What caused this question?
sincerely

peizhaoliu changed the title ~~"Frame length should be positive" problem in XGBoost with CPU Mortgage-large~~ "Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) Jan 21, 2020

krajendrannv assigned GaryShen2008 and chuanlihao Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

peizhaoliu commented Jan 21, 2020

wjxiz1992 commented Jan 22, 2020

peizhaoliu commented Feb 26, 2020

wjxiz1992 commented Feb 26, 2020

peizhaoliu commented Mar 3, 2020 •

edited

Loading

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

Comments

peizhaoliu commented Jan 21, 2020

wjxiz1992 commented Jan 22, 2020

peizhaoliu commented Feb 26, 2020

wjxiz1992 commented Feb 26, 2020

peizhaoliu commented Mar 3, 2020 • edited Loading

peizhaoliu commented Mar 3, 2020 •

edited

Loading