Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] waymo dataset to kitti conversion stuck at ] 0/158081, elapsed: 0s, ETA: #2796

Open
3 tasks done
s95huang opened this issue Oct 27, 2023 · 4 comments
Open
3 tasks done

Comments

@s95huang
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.8.18 | packaged by conda-forge | (default, Oct 10 2023, 15:44:36) [GCC 12.3.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda-11.4
NVCC: Cuda compilation tools, release 11.4, V11.4.152
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.12.0+cu116
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.6
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.0+cu116
OpenCV: 4.5.5
MMEngine: 0.9.0
MMDetection: 3.0.0
MMDetection3D: 1.1.0+4ff1361
spconv2.0: True

Reproduces the problem - code sample

'''
python tools/create_data.py waymo --root-path ./data/waymo/ --out-dir ./data/waymo/ --workers 0 --extra-tag waymo
'''

Reproduces the problem - command or script

'''
python tools/create_data.py waymo --root-path ./data/waymo/ --out-dir ./data/waymo/ --workers 0 --extra-tag waymo
'''

Reproduces the problem - error message

With --workers 0, I have

Start converting ...
Traceback (most recent call last):
  File "tools/create_data.py", line 327, in <module>
    waymo_data_prep(
  File "tools/create_data.py", line 204, in waymo_data_prep
    converter.convert()
  File "/mnt/0c39e9c4-f324-420d-a1e9-f20a41d147a8/personal_repos/LoopX/mmdetection3d/tools/dataset_converters/waymo_converter.py", line 112, in convert
    mmengine.track_parallel_progress(self.convert_one, range(len(self)),
  File "/home/s95huang/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/utils/progressbar.py", line 191, in track_parallel_progress
    pool = init_pool(nproc, initializer, initargs)
  File "/home/s95huang/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/utils/progressbar.py", line 133, in init_pool
    return Pool(process_num)
  File "/home/s95huang/anaconda3/envs/openmmlab/lib/python3.8/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/s95huang/anaconda3/envs/openmmlab/lib/python3.8/multiprocessing/pool.py", line 205, in __init__
    raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1

with worker set to 1/8/10/16 on this i7-13700 or 16/32 on AMD threadripper 2950 using mmdetection3d v1.1, 1.3 and main dev-1.x,

The waymo-kitti conversion got stuck at

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 150/150, 0.0 task/s, elapsed: 4830s, ETA:     0s

Finished ...
Start converting ...
completed: 0, elapsed: 0s

Finished ...
created txt files indicating what to collect in  ['training', 'validation', 'testing', 'testing_3d_camera_only_detection']
Generate info. this may take several minutes.
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 158081/158081, 75.6 task/s, elapsed: 2090s, ETA:     0s
[                                                  ] 0/158081, elapsed: 0s, ETA:

for days and the system monitor indicates there is no disk read/write with very low CPU usage.
It appears the code is stuck when working on

mmengine.dump

image

Additional information

If needed,
the waymo dataset is v 1.4.1
waymo toolkit is 2.6, 1.4.9

I also tried the solution in #2371 and #2705 and the error is still there.

This solution #2364 won't work as GPU error out of memory.

I am currentyly testing the conversion in AWS and hope it might work~

@s95huang
Copy link
Author

s95huang commented Oct 29, 2023

Update:

I used AWS EC2 G5 instance with 32 CPU etc.
The same problem occurs and the SSH connection is dropped after 10 min due to no activity.
I changed limit to 10 hours and the monitoring shows very low usage for 10 hr.

@ammaryasirnaich
Copy link

@s95huang , How much RAM memory you are using for it ?

@s95huang
Copy link
Author

@s95huang , How much RAM memory you are using for it ?

For AWS, the RAM is 128 GB or 256GB

My local machine 13700K is 64 GB , docker version is 32GB
Threadripper machine is 64GB as well.

@ammaryasirnaich
Copy link

hmm, can you try this https://github.com/DYZhang09/SAM3D/issues/5 if it works for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants