-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
install tcmalloc #4564
install tcmalloc #4564
Conversation
Do you expect this to have any impact of the lattice-faster-decoder? |
I think it has more effect when you use lattice-decode-faster-parallel, with multiple threads I think regular malloc is slower. |
libunwind is installed with the script but why not enable during the gperftools installation ? |
In my opinion adding a specific allocator into the mainline just complicate
things. For researchers (especially beginners) it will be more of an hassle
than use. For deployment, people might need to use _their own_ allocator,
so again no benefit.
So tl;dr; -- I don't see a substantial user group benefiting strongly from
this.
y.
…On Tue, Jun 15, 2021 at 10:05 AM rickychanhoyin ***@***.***> wrote:
libunwind is installed with the script but why not enable during the
gperftools installation ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4564 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX6DGNJ4UTXRRKOYW2TTS4CTVANCNFSM46T4W7QQ>
.
|
We could perhaps test this as part of that optimization thingy that was discussed on kaldi-help, without actually merging it. |
@danpovey, @jtrmal, IMO, tcmalloc grew from a simple library into a monstrosity. It requires C++17 to function and Bazel to build. I remember the issue, and we once compared tbbmalloc and tcmalloc. Both provided a comparable and non-negligible performance boost, but tcmalloc was much simpler to integrate. The latter part no longer holds. Of course, we have an option to get a 2-year-old version that we've tested back then. It just works. But right now, I'd rather go with tbbmalloc. The obvious advantage is you do not need to build it. TBB is installed the same way as MKL, and you do not need to build it. Also, TBB includes nice things like a work-stealing scheduler (what cudadecoder/threadpool-lite.h does). I was surprised to learn that TBB is used extensively by the LHCb CMS experiment (uxlfoundation/oneTBB#313). Anyway, tbbmalloc is a separate library from the rest; we do not need to depend on the full TBB. And it of course can be optional. MKL is currently optional but almost non-optional. As long as we install one, it's trivial to install both. The simplicity of use is a no less important factor than performance. Tangentially:
|
RE CMS, if anyone is curious: https://github.com/cms-sw/cmssw. That's quite a lot of code, to say the least. |
@danpovey, @jtrmal, @LvHang: Have you ever gave a thought to the arena allocator? The idea is you grab a chunk of contiguous pages (uncommitted), and have a single pointer pointing to the start of the first page range. When you malloc, see if you have enough committed pages ahead of this pointer, considering the remaining space in the page currently under pointer, commit more if necessary, return the pointer from malloc and also advance the stored pointer by the requested amount. free() does nothing. After the decode is complete, the whole arena is freed. The problem is we allocate a helluva lot small objects. The downside is that arena memory is never reused, but in applications where it's ok (I think decoding of a single lattice qualifies), it's the fastest allocator there is, and with zero memory overhead. |
In decoding we actually free a lot of tokens; that type of allocator might
use up too much memory-- but it should be possible to add them to a linked
list by some overloading of the token structure itself, so they
can be reused with in the same decoder object.
…On Thu, Jun 17, 2021 at 12:58 PM kkm000 ***@***.***> wrote:
@danpovey <https://github.com/danpovey>, @jtrmal
<https://github.com/jtrmal>, @LvHang <https://github.com/LvHang>: Have
you ever gave a thought to the arena allocator? The idea is you grab a
chunk of contiguous pages (uncommitted), and have a single pointer pointing
to the start of the first page range. When you malloc, see if you have
enough committed pages ahead of this pointer, considering the remaining
space in the page currently under pointer, commit more if necessary, return
the pointer from malloc and also advance the stored pointer by the
requested amount. free() does nothing. After the decode is complete, the
whole arena is freed. The problem is we allocate a helluva lot small
objects. The downside is that arena memory is never reused, but in
applications where it's ok (I think decoding of a single lattice
qualifies), it's the fastest allocator there is, and with zero memory
overhead.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4564 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5X75XUVCPDVVDR6HTTTF6HFANCNFSM46T4W7QQ>
.
|
I only saw several % speed up with tcmalloc in my "slim version of lattice-faster-decoder.cc (slim means simplified fst data structure and without backpointer data structure" running in single thread, so not really see big speeding effect at all. In 64bits linux, you may want to build the gperftools with --enable-frame-pointers option (also the INSTALL file suggst to use libunwind-0.99-beta.tar.gz instead of other versions). |
@danpovey, if they are all of a fixed size, they would certainly benefit from the reuse. Same page trick, with each page chopped into fixed blanks, plus a free/used bitmap. I think TBB even had a memory_resource like this. The only problem is the memory_resource is a c++17 thing (std::pmr namespace) :( Is it fair to say that this struct Token the most frequently allocated/freed object? I want to profile it. Ugh, if I only had time! |
@rickychanhoyin , I think you are right. The speedup is limited for the decoders of Kaldi which use ''HashList''. It will be much useful for some decoders which use build-in ``unordered_map''. IIRC, Dan wants to replace the ''HashList'' data strucutre so that the code is easier to understand for the users. From my knowledge, the ''HashList'' data structure uses the similar pool allocating method. It likes the arena allocator with a similar ''same page chopped trick'' method as @kkm000 mentioned above. The basic elements for a decoder are tokens so the memory page is chopped into its size directly. In ''HashList'', it has a pointer ''freed_head_'' to manage the used/could-reused elements in a linked list (i.e. free-list). The ``HashList'' object always allocates a block of elements/chunks (e.g. 1024) together and makes the ''freed_head_'' points to the head. During the decoding, when we apply a token, we will pick one from the 'free-list'; and when we delete one token, we will return it back to the 'free-list'. When a block is exhausted, it will allocate another block once. At last, until the decoding process is completed, all blocks will be freed. I think the ''HashList'' is the main reason why the speedup is limited in your test. From my opinion, I think linking the tcmalloc is simple, after installing the gperf, adding -ltcmalloc, -ltcmalloc_minimal, -ltcmalloc_and_profile(if the profile is needed) to makefile is enough. |
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
tools/extras/install_tcmalloc.sh
Outdated
./configure --prefix=$PWD/install || exit 1; | ||
make || exit 1; | ||
make install || exit 1; | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exit
exits the subshell here, i.e. the script continues past the closing )
. Consider set -e
, then you can get rid of most || exit
checks. Otherwise, do this:
./configure --prefix=$PWD/install || exit 1; | |
make || exit 1; | |
make install || exit 1; | |
) | |
./configure --prefix=$PWD/install && | |
make && | |
make install | |
) || exit |
All pipelining operators at end of line (&&
, ||
, |
, |&
) continue the pipe to next line without a need for the \
.
|| exit
preserves the non-zero exit code, explicit 1
is unnecessary.
tools/extras/install_tcmalloc.sh
Outdated
echo "$0: Assuming gperftools is already installed Please delete the directory" | ||
echo "./gperftools if you need to reinstall." | ||
mv gproftools gproftools_backup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message does not say what is really going on. Also gperftools != gproftools. You create the former below. Consider
echo "$0: Assuming gperftools is already installed Please delete the directory" | |
echo "./gperftools if you need to reinstall." | |
mv gproftools gproftools_backup | |
echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'" | |
mv -f gperftools gperftools.bak || exit |
mv -f
needed in case another gperftools.bak
exists.
|| exit
handles mv's failure. set -e
sometimes works better, but then you need to take care not to return non-zero from any pipeline.
The extension .bak
should be already ignored in .gitignore
, either at the root level or in tools/. If not, it's a good idea to have added it (as *.bak
; this will ignore both matching files and directories). In any case, make sure that all downloaded files and created directories are ignored by Git.
tools/extras/install_tcmalloc.sh
Outdated
( | ||
wd=$PWD | ||
echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib | ||
echo export PATH=$PATH:$wd/gperftools/bin | ||
) >> env.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$LD_LIBRARY_PATH
and$PATH
will be expanded before writing to env.sh; you probably want to keep the$
literal in env.sh, not capture the current PATH.- A subshell is unnecessary; you can redirect output of a block.
- From "man ld.so" on
LD_LIBRARY_PATH
: A zero-length directory name indicates the current working directory. This is not what you want if LD_LIBRARY_PATH is not set. An extra check for empty/unset variable is needed not to add a:
.:+
in a variable substitution makes it expand to empty string, otherwise to the expasnion of an expression after the:+
. PATH is usually set, so it's safe to assume$PATH:
does not expand to:
(the rule for PATH is the same: an empty string after splitting at every:
means current directory).
( | |
wd=$PWD | |
echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib | |
echo export PATH=$PATH:$wd/gperftools/bin | |
) >> env.sh | |
{ | |
echo export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH:+\${LD_LIBRARY_PATH}:}$PWD/gperftools/lib | |
echo export PATH=\$PATH:$PWD/gperftools/bin | |
} >> env.sh |
Also:
gperftools/bin
will not be built if we use minimal tcmalloc.- Why do we need LD_LIBRARY_PATH? We're building with -ltcmalloc[_minimal], so the path to the .so is baked into a binary anyway. I think it's needed only for libunwind?
tools/extras/install_tcmalloc.sh
Outdated
git clone https://github.com/gperftools/gperftools.git gperftools | ||
|
||
# install tcmalloc | ||
wd=$PWD | ||
cd gperftools | ||
bash autogen.sh | ||
./configure --prefix=$PWD || exit 1; | ||
make || exit 1; | ||
make install || exit 1; | ||
cd .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We did not have a dependency on Autoconf/Automake before. Better, get a pre-autoconf'ed distribution tarball from https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz.
- Use the same subshell pattern to keep PWD?
- Do we need the whole thing, with the heap profiler and heap checker? I think that for the default user build, the default should be -ltcmalloc_minimal, which skips all the stuff. I think 5 people at most need it. :) This also gets rid of the dependency on libunwind. Users who want and know how to profile are certainly qualified to reconfigure and build as much of the toolset as they want (and can install libunwind with a package manager). If you do this, change the library name in 'configure' too.
git clone https://github.com/gperftools/gperftools.git gperftools | |
# install tcmalloc | |
wd=$PWD | |
cd gperftools | |
bash autogen.sh | |
./configure --prefix=$PWD || exit 1; | |
make || exit 1; | |
make install || exit 1; | |
cd .. | |
wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz && | |
tar xzf gperftools-2.9.1.tar.gz && | |
mv gperftools-2.9.1 gperftools || exit | |
# install tcmalloc | |
( | |
cd gperftools && | |
./configure --prefix=$PWD --enable-minimal --disable-debugalloc --disable-static && | |
make && | |
make install | |
) || exit |
tools/extras/install_tcmalloc.sh
Outdated
echo "You must call this script from the tools/ directory" && exit 1; | ||
|
||
# prepare libunwind | ||
echo "****() Installing libunwind" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- libtcmalloc_minimal does not depend on libunwind, see comments below. We can shed this dependency.
- libunwind comes with most distros and can be installed with apt (libunwind-dev) or yum (libunwid-devel), unless the version 1.5-rc is strictly required. Also, clang comes with its own version; I do not know how compatible they are. I'd left the choice to the user.
tools/extras/install_tcmalloc.sh
Outdated
# install libunwind | ||
# The installation directory can be changed. But be careful, set the prefix as | ||
#`pwd` will cause recursive problem. | ||
tar vxzf libunwind-1.5-rc1.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tar is extremely noisy with v
. Up to you, but the listing of unpacked files is hardly useful to the end user.
src/configure
Outdated
else | ||
failure "Can't find libtcmalloc.so in ../tools/gperftools, | ||
refer to ../tools/extras/install_tcmalloc.sh for instruction." | ||
fi | ||
else | ||
failure "--enable-tcmalloc switched on, but can't find tcmalloc in | ||
../tools/gperftools, refer to ../tools/extras/install_tcmalloc.sh | ||
for instruction." | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would merge both cases into one. "Can't find tcmalloc in tools/gperftools. Run 'extras/install_tcmalloc.sh' in 'tools/' to install." is good enough for both. install_tcmalloc.sh will print an error if the user tries to do what message says, i.e run ../tools/extras/install_tcmalloc.sh
. "Refer for instructions" is a lie, the script installs stuff without giving any :)
If you split lines, unindent the continuations, it will look weird when printed. failure
takes any number of arguments and prints them in a single string with a space between, so if a message is concise enough, this will work to split First half and second half
into two code lines:
failure "First half and" \
"second half"
Just FYI, you may stick $'\n'
into a string to break lines, outside double quotes, but w/o space, or else Line 2 will start with a space, e.g.: "Line 1"$'\n'"Line 2"
or $'Line 1\nLine 2'
do the same thing. You can also do \'
for '
inside a $
-prefixed string.
tools/extras/install_tcmalloc.sh
Outdated
echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib | ||
echo export PATH=$PATH:$wd/gperftools/bin | ||
) >> env.sh | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty line
@kkm000, thank you for your careful review and patient guidance. I tried to update the files according to your comments.
I tested the updated files and it seems ok on my machine. |
LGTM, thanks, especially for updating the commentary in the file! Merging. |
# prepare tcmalloc | ||
if [ -d gperftools ]; then | ||
echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'" | ||
mv -f gpreftools gpreftools.bak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! #4629
This is a PR for installing the ``tcmalloc'' library to speed up the decoding-like codes.
@huangruizhe , hi ruizhe, when you are free, could you please try to run the PR on the grid of CLSP. You can run the tools/extras/install_tcmalloc.sh to install it and include the
--enable-tcmalloc'' option when you execute
configure'' command. It's ok for me when I test on my local grid.After that, may be you can use ``decoder/lattice-simple-decoder or simple-decoder'' (which use the build-in data structure such as unordered_map rather than kaldi's hash-lish) to test the speedup on any dataset. From my experience, it will speed up the former on a lot and the latter a little bit. Thanks.