install tcmalloc #4564

LvHang · 2021-06-13T16:39:01Z

This is a PR for installing the ``tcmalloc'' library to speed up the decoding-like codes.

@huangruizhe , hi ruizhe, when you are free, could you please try to run the PR on the grid of CLSP. You can run the tools/extras/install_tcmalloc.sh to install it and include the --enable-tcmalloc'' option when you execute configure'' command. It's ok for me when I test on my local grid.

After that, may be you can use ``decoder/lattice-simple-decoder or simple-decoder'' (which use the build-in data structure such as unordered_map rather than kaldi's hash-lish) to test the speedup on any dataset. From my experience, it will speed up the former on a lot and the latter a little bit. Thanks.

francisr · 2021-06-14T22:18:21Z

Do you expect this to have any impact of the lattice-faster-decoder?

danpovey · 2021-06-15T04:14:04Z

I think it has more effect when you use lattice-decode-faster-parallel, with multiple threads I think regular malloc is slower.

rickychanhoyin · 2021-06-15T08:04:42Z

libunwind is installed with the script but why not enable during the gperftools installation ?

jtrmal · 2021-06-15T08:08:54Z

In my opinion adding a specific allocator into the mainline just complicate things. For researchers (especially beginners) it will be more of an hassle than use. For deployment, people might need to use _their own_ allocator, so again no benefit. So tl;dr; -- I don't see a substantial user group benefiting strongly from this. y.

…

On Tue, Jun 15, 2021 at 10:05 AM rickychanhoyin ***@***.***> wrote: libunwind is installed with the script but why not enable during the gperftools installation ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4564 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX6DGNJ4UTXRRKOYW2TTS4CTVANCNFSM46T4W7QQ> .

danpovey · 2021-06-15T12:34:05Z

We could perhaps test this as part of that optimization thingy that was discussed on kaldi-help, without actually merging it.

kkm000 · 2021-06-15T21:12:30Z

@danpovey, @jtrmal, IMO, tcmalloc grew from a simple library into a monstrosity. It requires C++17 to function and Bazel to build. I remember the issue, and we once compared tbbmalloc and tcmalloc. Both provided a comparable and non-negligible performance boost, but tcmalloc was much simpler to integrate. The latter part no longer holds.

Of course, we have an option to get a 2-year-old version that we've tested back then. It just works.

But right now, I'd rather go with tbbmalloc. The obvious advantage is you do not need to build it. TBB is installed the same way as MKL, and you do not need to build it. Also, TBB includes nice things like a work-stealing scheduler (what cudadecoder/threadpool-lite.h does). I was surprised to learn that TBB is used extensively by the LHCb CMS experiment (uxlfoundation/oneTBB#313). Anyway, tbbmalloc is a separate library from the rest; we do not need to depend on the full TBB. And it of course can be optional. MKL is currently optional but almost non-optional. As long as we install one, it's trivial to install both. The simplicity of use is a no less important factor than performance.

Tangentially:

When I'm thinking about the decoder code, it seems to me that it would make sense to make it more vectorization-friendly.
Speaking of MKL: the last version that can be installed by our script is 2019.something, or maybe 2020.1 Intel unified the perf libraries under oneAPI label. The installation procedure is same (the same APT/RPM feeds, only different package names), but there are quite a few advantages with the new model, I won't go into that. The install directory for MKL has changed tho.

kkm000 · 2021-06-15T21:16:10Z

RE CMS, if anyone is curious: https://github.com/cms-sw/cmssw. That's quite a lot of code, to say the least.

kkm000 · 2021-06-17T04:58:15Z

@danpovey, @jtrmal, @LvHang: Have you ever gave a thought to the arena allocator? The idea is you grab a chunk of contiguous pages (uncommitted), and have a single pointer pointing to the start of the first page range. When you malloc, see if you have enough committed pages ahead of this pointer, considering the remaining space in the page currently under pointer, commit more if necessary, return the pointer from malloc and also advance the stored pointer by the requested amount. free() does nothing. After the decode is complete, the whole arena is freed. The problem is we allocate a helluva lot small objects. The downside is that arena memory is never reused, but in applications where it's ok (I think decoding of a single lattice qualifies), it's the fastest allocator there is, and with zero memory overhead.

danpovey · 2021-06-17T11:40:01Z

In decoding we actually free a lot of tokens; that type of allocator might use up too much memory-- but it should be possible to add them to a linked list by some overloading of the token structure itself, so they can be reused with in the same decoder object.

…

On Thu, Jun 17, 2021 at 12:58 PM kkm000 ***@***.***> wrote: @danpovey <https://github.com/danpovey>, @jtrmal <https://github.com/jtrmal>, @LvHang <https://github.com/LvHang>: Have you ever gave a thought to the arena allocator? The idea is you grab a chunk of contiguous pages (uncommitted), and have a single pointer pointing to the start of the first page range. When you malloc, see if you have enough committed pages ahead of this pointer, considering the remaining space in the page currently under pointer, commit more if necessary, return the pointer from malloc and also advance the stored pointer by the requested amount. free() does nothing. After the decode is complete, the whole arena is freed. The problem is we allocate a helluva lot small objects. The downside is that arena memory is never reused, but in applications where it's ok (I think decoding of a single lattice qualifies), it's the fastest allocator there is, and with zero memory overhead. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4564 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO5X75XUVCPDVVDR6HTTTF6HFANCNFSM46T4W7QQ> .

rickychanhoyin · 2021-06-17T12:22:23Z

I only saw several % speed up with tcmalloc in my "slim version of lattice-faster-decoder.cc (slim means simplified fst data structure and without backpointer data structure" running in single thread, so not really see big speeding effect at all. In 64bits linux, you may want to build the gperftools with --enable-frame-pointers option (also the INSTALL file suggst to use libunwind-0.99-beta.tar.gz instead of other versions).
. The linking was actually simple, just with -ltcmalloc_minimal or with the libtcmalloc_minimal.a (when static).

kkm000 · 2021-06-17T14:23:49Z

@danpovey, if they are all of a fixed size, they would certainly benefit from the reuse. Same page trick, with each page chopped into fixed blanks, plus a free/used bitmap. I think TBB even had a memory_resource like this. The only problem is the memory_resource is a c++17 thing (std::pmr namespace) :(

Is it fair to say that this struct Token the most frequently allocated/freed object?

I want to profile it. Ugh, if I only had time!

LvHang · 2021-06-18T17:37:21Z

@rickychanhoyin , I think you are right. The speedup is limited for the decoders of Kaldi which use ''HashList''. It will be much useful for some decoders which use build-in ``unordered_map''. IIRC, Dan wants to replace the ''HashList'' data strucutre so that the code is easier to understand for the users.

From my knowledge, the ''HashList'' data structure uses the similar pool allocating method. It likes the arena allocator with a similar ''same page chopped trick'' method as @kkm000 mentioned above. The basic elements for a decoder are tokens so the memory page is chopped into its size directly. In ''HashList'', it has a pointer ''freed_head_'' to manage the used/could-reused elements in a linked list (i.e. free-list). The ``HashList'' object always allocates a block of elements/chunks (e.g. 1024) together and makes the ''freed_head_'' points to the head. During the decoding, when we apply a token, we will pick one from the 'free-list'; and when we delete one token, we will return it back to the 'free-list'. When a block is exhausted, it will allocate another block once. At last, until the decoding process is completed, all blocks will be freed. I think the ''HashList'' is the main reason why the speedup is limited in your test.

From my opinion, I think linking the tcmalloc is simple, after installing the gperf, adding -ltcmalloc, -ltcmalloc_minimal, -ltcmalloc_and_profile(if the profile is needed) to makefile is enough.
From my searching resutls , the libunwind is mainly used to reslove the trace-back conflict when we profile with gperf. The newest libunwind is version 1.5, and the 0.99 version is released 10 years ago, so I used the former. Frankly, I have not encounter the related problems. I'm not expert in this, maybe @rickychanhoyin is right or others know more.

stale · 2021-08-17T17:59:32Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kkm000 · 2021-09-06T17:45:55Z

tools/extras/install_tcmalloc.sh

+./configure --prefix=$PWD/install || exit 1;
+make || exit 1;
+make install || exit 1;
+)


exit exits the subshell here, i.e. the script continues past the closing ). Consider set -e, then you can get rid of most || exit checks. Otherwise, do this:

Suggested change

./configure --prefix=$PWD/install || exit 1;

make || exit 1;

make install || exit 1;

)

./configure --prefix=$PWD/install &&

make &&

make install

) || exit

All pipelining operators at end of line (&&, ||, |, |&) continue the pipe to next line without a need for the \.
|| exit preserves the non-zero exit code, explicit 1 is unnecessary.

kkm000 · 2021-09-06T17:58:33Z

tools/extras/install_tcmalloc.sh

+  echo "$0: Assuming gperftools is already installed Please delete the directory"
+  echo "./gperftools if you need to reinstall."
+  mv gproftools gproftools_backup


The message does not say what is really going on. Also gperftools != gproftools. You create the former below. Consider

Suggested change

echo "$0: Assuming gperftools is already installed Please delete the directory"

echo "./gperftools if you need to reinstall."

mv gproftools gproftools_backup

echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'"

mv -f gperftools gperftools.bak || exit

mv -f needed in case another gperftools.bak exists.

|| exit handles mv's failure. set -e sometimes works better, but then you need to take care not to return non-zero from any pipeline.

The extension .bak should be already ignored in .gitignore, either at the root level or in tools/. If not, it's a good idea to have added it (as *.bak; this will ignore both matching files and directories). In any case, make sure that all downloaded files and created directories are ignored by Git.

kkm000 · 2021-09-06T18:20:41Z

tools/extras/install_tcmalloc.sh

+(
+  wd=$PWD
+  echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib
+  echo export PATH=$PATH:$wd/gperftools/bin
+) >> env.sh


$LD_LIBRARY_PATH and $PATH will be expanded before writing to env.sh; you probably want to keep the $ literal in env.sh, not capture the current PATH.

A subshell is unnecessary; you can redirect output of a block.

From "man ld.so" on LD_LIBRARY_PATH: A zero-length directory name indicates the current working directory. This is not what you want if LD_LIBRARY_PATH is not set. An extra check for empty/unset variable is needed not to add a :. :+ in a variable substitution makes it expand to empty string, otherwise to the expasnion of an expression after the :+. PATH is usually set, so it's safe to assume $PATH: does not expand to : (the rule for PATH is the same: an empty string after splitting at every : means current directory).

Suggested change

(

wd=$PWD

echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib

echo export PATH=$PATH:$wd/gperftools/bin

) >> env.sh

{

echo export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH:+\${LD_LIBRARY_PATH}:}$PWD/gperftools/lib

echo export PATH=\$PATH:$PWD/gperftools/bin

} >> env.sh

Also:

gperftools/bin will not be built if we use minimal tcmalloc.

Why do we need LD_LIBRARY_PATH? We're building with -ltcmalloc[_minimal], so the path to the .so is baked into a binary anyway. I think it's needed only for libunwind?

kkm000 · 2021-09-06T19:02:36Z

tools/extras/install_tcmalloc.sh

+git clone https://github.com/gperftools/gperftools.git gperftools
+
+# install tcmalloc
+wd=$PWD
+cd gperftools
+bash autogen.sh
+./configure --prefix=$PWD || exit 1;
+make || exit 1;
+make install || exit 1;
+cd ..


We did not have a dependency on Autoconf/Automake before. Better, get a pre-autoconf'ed distribution tarball from https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz.

Use the same subshell pattern to keep PWD?

Do we need the whole thing, with the heap profiler and heap checker? I think that for the default user build, the default should be -ltcmalloc_minimal, which skips all the stuff. I think 5 people at most need it. :) This also gets rid of the dependency on libunwind. Users who want and know how to profile are certainly qualified to reconfigure and build as much of the toolset as they want (and can install libunwind with a package manager). If you do this, change the library name in 'configure' too.

Suggested change

git clone https://github.com/gperftools/gperftools.git gperftools

# install tcmalloc

wd=$PWD

cd gperftools

bash autogen.sh

./configure --prefix=$PWD || exit 1;

make || exit 1;

make install || exit 1;

cd ..

wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz &&

tar xzf gperftools-2.9.1.tar.gz &&

mv gperftools-2.9.1 gperftools || exit

# install tcmalloc

(

cd gperftools &&

./configure --prefix=$PWD --enable-minimal --disable-debugalloc --disable-static &&

make &&

make install

) || exit

kkm000 · 2021-09-06T19:14:50Z

tools/extras/install_tcmalloc.sh

+  echo "You must call this script from the tools/ directory" && exit 1;
+
+# prepare libunwind
+echo "****() Installing libunwind"


libtcmalloc_minimal does not depend on libunwind, see comments below. We can shed this dependency.

libunwind comes with most distros and can be installed with apt (libunwind-dev) or yum (libunwid-devel), unless the version 1.5-rc is strictly required. Also, clang comes with its own version; I do not know how compatible they are. I'd left the choice to the user.

kkm000 · 2021-09-06T19:17:30Z

tools/extras/install_tcmalloc.sh

+# install libunwind
+# The installation directory can be changed. But be careful, set the prefix as 
+#`pwd` will cause recursive problem.
+tar vxzf libunwind-1.5-rc1.tar.gz


tar is extremely noisy with v. Up to you, but the listing of unpacked files is hardly useful to the end user.

kkm000 · 2021-09-06T19:30:32Z

src/configure

+    else
+      failure "Can't find libtcmalloc.so in ../tools/gperftools,
+               refer to ../tools/extras/install_tcmalloc.sh for instruction."
+    fi
+  else
+    failure "--enable-tcmalloc switched on, but can't find tcmalloc in 
+             ../tools/gperftools, refer to ../tools/extras/install_tcmalloc.sh 
+             for instruction."
+  fi


I would merge both cases into one. "Can't find tcmalloc in tools/gperftools. Run 'extras/install_tcmalloc.sh' in 'tools/' to install." is good enough for both. install_tcmalloc.sh will print an error if the user tries to do what message says, i.e run ../tools/extras/install_tcmalloc.sh. "Refer for instructions" is a lie, the script installs stuff without giving any :)

If you split lines, unindent the continuations, it will look weird when printed. failure takes any number of arguments and prints them in a single string with a space between, so if a message is concise enough, this will work to split First half and second half into two code lines:

failure "First half and" \ "second half"

Just FYI, you may stick $'\n' into a string to break lines, outside double quotes, but w/o space, or else Line 2 will start with a space, e.g.: "Line 1"$'\n'"Line 2" or $'Line 1\nLine 2' do the same thing. You can also do \' for ' inside a $-prefixed string.

kkm000 · 2021-09-06T19:51:59Z

tools/extras/install_tcmalloc.sh

+  echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib
+  echo export PATH=$PATH:$wd/gperftools/bin
+) >> env.sh
+


Remove empty line

LvHang · 2021-09-08T13:51:26Z

@kkm000, thank you for your careful review and patient guidance. I tried to update the files according to your comments.

I removed the part about 'libunwind' and left a few comments. I left the 'libunwind' part before since I used it:). I agree that the users who need and know how to profile are qualified to reconfigure and build as they want. Getting rid of it and using 'tcmalloc_minimal' will be more compact.
I use set -e to replace the redundant exits, since it seems we need to conduct further actions according to the different return values.
I haven't found the .bak extension in .gitignore so I added.
Now, we use the -ltcmalloc_minimal and I prefer to bake the path with -Wl, -rpath into the binary as I think it can help to avoid some LD_LIBRARY_PATH asynchronous issues in the grid. Adding the increment of PATH and LD_LIBRARY_PATH to env.sh is needn't now.
I didn't take care of the 'failure' function before. Thank you for your information. As there isn't a -e option in the echo command of the function, the 'dollar-sign single quote' is helpful, thanks.

I tested the updated files and it seems ok on my machine.

kkm000 · 2021-09-08T20:46:29Z

LGTM, thanks, especially for updating the commentary in the file! Merging.

arlofaria · 2021-09-15T16:49:56Z

tools/extras/install_tcmalloc.sh

+# prepare tcmalloc
+if [ -d gperftools ]; then
+  echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'"
+  mv -f gpreftools gpreftools.bak


Thanks！ #4629

install tcmalloc

422126a

replace backticks

b55292f

kkm000 mentioned this pull request Jun 18, 2021

On the fly contextual adaptation, #4571

Open

stale bot added the stale Stale bot on the loose label Aug 17, 2021

Update install_tcmalloc.sh

a2644b9

stale bot removed the stale Stale bot on the loose label Sep 4, 2021

kkm000 requested changes Sep 6, 2021

View reviewed changes

update the tcmalloc installtion and get rid of libunwind part

659a50a

kkm000 approved these changes Sep 8, 2021

View reviewed changes

kkm000 merged commit 47ea7ae into kaldi-asr:master Sep 8, 2021

arlofaria reviewed Sep 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

install tcmalloc #4564

install tcmalloc #4564

LvHang commented Jun 13, 2021

francisr commented Jun 14, 2021

danpovey commented Jun 15, 2021

rickychanhoyin commented Jun 15, 2021

jtrmal commented Jun 15, 2021 via email

danpovey commented Jun 15, 2021

kkm000 commented Jun 15, 2021

kkm000 commented Jun 15, 2021

kkm000 commented Jun 17, 2021

danpovey commented Jun 17, 2021 via email

rickychanhoyin commented Jun 17, 2021

kkm000 commented Jun 17, 2021

LvHang commented Jun 18, 2021 •

edited

Loading

stale bot commented Aug 17, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

kkm000 Sep 6, 2021

LvHang commented Sep 8, 2021

kkm000 commented Sep 8, 2021

arlofaria Sep 15, 2021

LvHang Sep 16, 2021

install tcmalloc #4564

install tcmalloc #4564

Conversation

LvHang commented Jun 13, 2021

francisr commented Jun 14, 2021

danpovey commented Jun 15, 2021

rickychanhoyin commented Jun 15, 2021

jtrmal commented Jun 15, 2021 via email

danpovey commented Jun 15, 2021

kkm000 commented Jun 15, 2021

kkm000 commented Jun 15, 2021

kkm000 commented Jun 17, 2021

danpovey commented Jun 17, 2021 via email

rickychanhoyin commented Jun 17, 2021

kkm000 commented Jun 17, 2021

LvHang commented Jun 18, 2021 • edited Loading

stale bot commented Aug 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LvHang commented Sep 8, 2021

kkm000 commented Sep 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LvHang commented Jun 18, 2021 •

edited

Loading