Skip to content
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71

Open
peizhaoliu opened this issue Jan 21, 2020 · 4 comments
Assignees

Comments

@peizhaoliu
Copy link

dear author,

I came across this article "https://github.com/rapidsai/spark-examples/blob/master/getting-started-guides/on-prem-cluster/standalone-scala.md".
When i launch distributed training without GPUs (tree method hist), the parameters setting by following: "--num-executors 1 --executor-cores 19 --conf spark.cores.max=19 --conf spark.task.cpus=1 --class ai.rapids.spark.examples.mortgage.CPUMain -numWorkers=19 -treeMethod=hist"
However, tasks of the stage "foreachPartition at XGBoost.scala:703" always blocked in "running". In a few hours after submitted the job, we obtained the feeback:
java.lang.IllegalArgumentException: Frame length should be positive: -9223371863126827765 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748)

Could you please come up some tips about this issue? Thanks

sincerely

@peizhaoliu peizhaoliu changed the title "Frame length should be positive" problem in XGBoost with CPU Mortgage-large "Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) Jan 21, 2020
@wjxiz1992
Copy link
Member

Hi,
Did you set "nthread" to 1? Because " XGBoost4J-Spark requires that all of nthread * numWorkers cores should be available before the training runs."
You can add "-nthread=1" to the end of your cmd directly.

You could also try some other parameter set like:

"--num-executors 1 --executor-cores 1 --conf spark.task.cpus=1 -numWorkers=19 -nthread=1 treeMethod=hist".

For hanging problem, there's a "timeout_request_workers" that may help(but not always). This parameter could reduce the hangi time when your app couldn't get enough resources from Spark. There're also some other possibilities that the program will hang.

To see where it hangs, you could go to Spark's web UI, jump into "Executors" to see the "Thread Dump"

@peizhaoliu
Copy link
Author

hi,
When lanuch a GPU Mortage example on Spark-standalone, we obtain the following error:
2020-02-25 15:48:08 ERROR NativeDepsLoader:55 - Could not load cudf jni library... java.lang.UnsatisfiedLinkError: /tmp/rmm4687696644621164964.so: libnvToolsExt.so.1: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:83) at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:51) at ai.rapids.cudf.Table.<clinit>(Table.java:31) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readToTable(GpuCSVScan.scala:214) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readBatch(GpuCSVScan.scala:194) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.next(GpuCSVScan.scala:230)
I try to explore the dependencies jar, found that "libxgboost4j.so" and "librmm.so" inside there. So why cannot load cudf jni library? Could you please show me some tips to solve this problem?

@wjxiz1992
Copy link
Member

Hi, I guess it's probably you used the wrong version of your cudf jar. You should choose the right version according to your CUDA version. e.g. mvn package -Dcuda.classifier=cuda10, if your cuda is 10.0. You could see your cuda version by "cat /usr/local/cuda/version.txt"

@peizhaoliu
Copy link
Author

peizhaoliu commented Mar 3, 2020

Thanks your tip!
Inspired by the previous suggestions, we adopt the parameter '-nthread' in XGBoost4j-Spark without GPU. The results indicated that it works, '-nthread' can help for optimization. However, when launch distributed training with GPUs, adusting "-nthread=1" to "-nthread=6" seems not take any effect. The full parameter set is "--num-executors 1 --executor-cores 6 --conf spark.task.cpus=6 -numWorkers=1 -nthread=6 treeMethod=gpu_hist".
What caused this question?
sincerely

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants