MXIndexedRecordIO doesn't work with HDFS #10052

sethah · 2018-03-09T17:02:46Z

sethah
Mar 9, 2018

Description

MXIndexedRecordIO doesn't work with HDFS. This is because the indexed file is assumed to be a local file, and it is opened and parsed in python using open(self.idx_path, self.flag) which won't work when and HDFS path is provided.

One consequence of this is that in gluon, you cannot use the RecordFileDataset since it requires MXIndexedRecordIO.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.13')
('Compiler     :', 'GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)')
('Build        :', ('default', 'Dec 20 2016 23:05:08'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '9.0.1')
('Directory    :', '/Users/shendrickson/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '0.12.0')
('Directory    :', '/Users/shendrickson/.local/lib/python2.7/site-packages/mxnet-0.12.0-py2.7.egg/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Darwin-16.6.0-x86_64-i386-64bit')
('system       :', 'Darwin')
('node         :', 'shendrickson-MBP.local')
('release      :', '16.6.0')
('version      :', 'Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'i386')
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.brand_string: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Package used (Python/R/Scala/Julia):
Python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
ccb08fb

Minimum reproducible example

conda create -n mx_dev python=3.6
source activate mx_dev
conda install -c anaconda openblas
conda install -c anaconda opencv
conda install ipython

# in incubator-mxnet/config.mk
# the additional link flags you want to add
ADD_LDFLAGS =-L$(CONDA_ENV)/lib -L/usr/local/Cellar/hadoop/2.8.1/lib/native/

# the additional compile flags you want to add
ADD_CFLAGS =-g -O0 -I$(CONDA_ENV)/include -I/usr/local/Cellar/hadoop/2.8.1/include/

export CONDA_ENV=~/anaconda2/envs/mx_dev/ 
make -j6 USE_OPENCV=1 USE_BLAS=openblas USE_HDFS=1 LIBJVM=/Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/lib/server/

cd python
python setup.py install

# start HDFS
/usr/local/Cellar/hadoop/2.8.1/sbin/start-dfs.sh

export CLASSPATH=$(hadoop classpath --glob)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_ENV/lib/
python -c "import mxnet as mx; record = mx.recordio.MXIndexedRecordIO('hdfs:///tmp/data.idx', 'hdfs:///tmp/data.rec', 'w')"

stack trace:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/shendrickson/anaconda2/envs/mx_dev/lib/python3.6/site-packages/mxnet-1.1.0-py3.6.egg/mxnet/recordio.py", line 199, in __init__
    super(MXIndexedRecordIO, self).__init__(uri, flag)
  File "/Users/shendrickson/anaconda2/envs/mx_dev/lib/python3.6/site-packages/mxnet-1.1.0-py3.6.egg/mxnet/recordio.py", line 69, in __init__
    self.open()
  File "/Users/shendrickson/anaconda2/envs/mx_dev/lib/python3.6/site-packages/mxnet-1.1.0-py3.6.egg/mxnet/recordio.py", line 205, in open
    self.fidx = open(self.idx_path, self.flag)
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs:///tmp/data.idx'

Note that: python -c "import mxnet as mx; record = mx.recordio.MXRecordIO('hdfs:///tmp/data.rec', 'w')" works fine.

sethah · 2018-03-09T17:07:55Z

sethah
Mar 9, 2018
Author

BTW, I have a rough version of a patch to fix this. Basically, instead of opening the file using Python, we just open the file and store the hashmap in C++, using dmlc::Stream::Create since it supports HDFS. The problem with this solution is that I'm not sure how to make the key support arbitrary types (is that really necessary?) - since the current implementation in Python allows you to use any hashable Python type as the key.

Anyway, the reason I think this is a problem is that I'm not sure how to train on data sitting in HDFS otherwise. I had an example use case doing transfer learning, where the featurized output would be saved as indexed recordio, so that you can use the gluon data loader for training the final layers. It would be great if these records could be stored on HDFS.

0 replies

byronyi · 2020-03-08T10:22:06Z

byronyi
Mar 8, 2020

Same question there. Seems MXNet has abandoned HDFS support in one way or another.

0 replies

leezu · 2020-03-09T16:36:04Z

leezu
Mar 9, 2020
Collaborator

cc @zhreshold

0 replies

zhreshold · 2020-03-09T20:34:28Z

zhreshold
Mar 9, 2020
Collaborator

HDFS seems to be optionally enabled: https://github.com/apache/incubator-mxnet/blob/master/make/config.mk#L175

Have you tested the build with USE_HDFS = 1 ?

1 reply

Xiaoxiong-Liu Apr 8, 2021

hello, I install mxnet by pip without build.
So if I want HDFS supported, I have to build from source with USE_HDFS = 1?
thanks.
btw, the link you post is invalid now.

byronyi · 2020-03-09T22:56:20Z

byronyi
Mar 9, 2020

I did, that’s why RecordIO works with HDFS. But this issue is concerned about IndexRecordIO.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXIndexedRecordIO doesn't work with HDFS #10052

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

MXIndexedRecordIO doesn't work with HDFS #10052

sethah Mar 9, 2018

Description

Environment info (Required)

Build info (Required if built from source)

Minimum reproducible example

Replies: 5 comments · 1 reply

sethah Mar 9, 2018 Author

byronyi Mar 8, 2020

leezu Mar 9, 2020 Collaborator

zhreshold Mar 9, 2020 Collaborator

Xiaoxiong-Liu Apr 8, 2021

byronyi Mar 9, 2020

sethah
Mar 9, 2018

Replies: 5 comments 1 reply

sethah
Mar 9, 2018
Author

byronyi
Mar 8, 2020

leezu
Mar 9, 2020
Collaborator

zhreshold
Mar 9, 2020
Collaborator

byronyi
Mar 9, 2020