Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install tcmalloc #4564

Merged
merged 4 commits into from
Sep 8, 2021
Merged

install tcmalloc #4564

merged 4 commits into from
Sep 8, 2021

Conversation

LvHang
Copy link
Contributor

@LvHang LvHang commented Jun 13, 2021

This is a PR for installing the ``tcmalloc'' library to speed up the decoding-like codes.

@huangruizhe , hi ruizhe, when you are free, could you please try to run the PR on the grid of CLSP. You can run the tools/extras/install_tcmalloc.sh to install it and include the --enable-tcmalloc'' option when you execute configure'' command. It's ok for me when I test on my local grid.

After that, may be you can use ``decoder/lattice-simple-decoder or simple-decoder'' (which use the build-in data structure such as unordered_map rather than kaldi's hash-lish) to test the speedup on any dataset. From my experience, it will speed up the former on a lot and the latter a little bit. Thanks.

@francisr
Copy link
Contributor

Do you expect this to have any impact of the lattice-faster-decoder?

@danpovey
Copy link
Contributor

I think it has more effect when you use lattice-decode-faster-parallel, with multiple threads I think regular malloc is slower.

@rickychanhoyin
Copy link
Contributor

libunwind is installed with the script but why not enable during the gperftools installation ?

@jtrmal
Copy link
Contributor

jtrmal commented Jun 15, 2021 via email

@danpovey
Copy link
Contributor

We could perhaps test this as part of that optimization thingy that was discussed on kaldi-help, without actually merging it.

@kkm000
Copy link
Contributor

kkm000 commented Jun 15, 2021

@danpovey, @jtrmal, IMO, tcmalloc grew from a simple library into a monstrosity. It requires C++17 to function and Bazel to build. I remember the issue, and we once compared tbbmalloc and tcmalloc. Both provided a comparable and non-negligible performance boost, but tcmalloc was much simpler to integrate. The latter part no longer holds.

Of course, we have an option to get a 2-year-old version that we've tested back then. It just works.

But right now, I'd rather go with tbbmalloc. The obvious advantage is you do not need to build it. TBB is installed the same way as MKL, and you do not need to build it. Also, TBB includes nice things like a work-stealing scheduler (what cudadecoder/threadpool-lite.h does). I was surprised to learn that TBB is used extensively by the LHCb CMS experiment (uxlfoundation/oneTBB#313). Anyway, tbbmalloc is a separate library from the rest; we do not need to depend on the full TBB. And it of course can be optional. MKL is currently optional but almost non-optional. As long as we install one, it's trivial to install both. The simplicity of use is a no less important factor than performance.

Tangentially:

  1. When I'm thinking about the decoder code, it seems to me that it would make sense to make it more vectorization-friendly.

  2. Speaking of MKL: the last version that can be installed by our script is 2019.something, or maybe 2020.1 Intel unified the perf libraries under oneAPI label. The installation procedure is same (the same APT/RPM feeds, only different package names), but there are quite a few advantages with the new model, I won't go into that. The install directory for MKL has changed tho.

@kkm000
Copy link
Contributor

kkm000 commented Jun 15, 2021

RE CMS, if anyone is curious: https://github.com/cms-sw/cmssw. That's quite a lot of code, to say the least.

@kkm000
Copy link
Contributor

kkm000 commented Jun 17, 2021

@danpovey, @jtrmal, @LvHang: Have you ever gave a thought to the arena allocator? The idea is you grab a chunk of contiguous pages (uncommitted), and have a single pointer pointing to the start of the first page range. When you malloc, see if you have enough committed pages ahead of this pointer, considering the remaining space in the page currently under pointer, commit more if necessary, return the pointer from malloc and also advance the stored pointer by the requested amount. free() does nothing. After the decode is complete, the whole arena is freed. The problem is we allocate a helluva lot small objects. The downside is that arena memory is never reused, but in applications where it's ok (I think decoding of a single lattice qualifies), it's the fastest allocator there is, and with zero memory overhead.

@danpovey
Copy link
Contributor

danpovey commented Jun 17, 2021 via email

@rickychanhoyin
Copy link
Contributor

I only saw several % speed up with tcmalloc in my "slim version of lattice-faster-decoder.cc (slim means simplified fst data structure and without backpointer data structure" running in single thread, so not really see big speeding effect at all. In 64bits linux, you may want to build the gperftools with --enable-frame-pointers option (also the INSTALL file suggst to use libunwind-0.99-beta.tar.gz instead of other versions).
. The linking was actually simple, just with -ltcmalloc_minimal or with the libtcmalloc_minimal.a (when static).

@kkm000
Copy link
Contributor

kkm000 commented Jun 17, 2021

@danpovey, if they are all of a fixed size, they would certainly benefit from the reuse. Same page trick, with each page chopped into fixed blanks, plus a free/used bitmap. I think TBB even had a memory_resource like this. The only problem is the memory_resource is a c++17 thing (std::pmr namespace) :(

Is it fair to say that this struct Token the most frequently allocated/freed object?

I want to profile it. Ugh, if I only had time!

@LvHang
Copy link
Contributor Author

LvHang commented Jun 18, 2021

@rickychanhoyin , I think you are right. The speedup is limited for the decoders of Kaldi which use ''HashList''. It will be much useful for some decoders which use build-in ``unordered_map''. IIRC, Dan wants to replace the ''HashList'' data strucutre so that the code is easier to understand for the users.

From my knowledge, the ''HashList'' data structure uses the similar pool allocating method. It likes the arena allocator with a similar ''same page chopped trick'' method as @kkm000 mentioned above. The basic elements for a decoder are tokens so the memory page is chopped into its size directly. In ''HashList'', it has a pointer ''freed_head_'' to manage the used/could-reused elements in a linked list (i.e. free-list). The ``HashList'' object always allocates a block of elements/chunks (e.g. 1024) together and makes the ''freed_head_'' points to the head. During the decoding, when we apply a token, we will pick one from the 'free-list'; and when we delete one token, we will return it back to the 'free-list'. When a block is exhausted, it will allocate another block once. At last, until the decoding process is completed, all blocks will be freed. I think the ''HashList'' is the main reason why the speedup is limited in your test.

From my opinion, I think linking the tcmalloc is simple, after installing the gperf, adding -ltcmalloc, -ltcmalloc_minimal, -ltcmalloc_and_profile(if the profile is needed) to makefile is enough.
From my searching resutls , the libunwind is mainly used to reslove the trace-back conflict when we profile with gperf. The newest libunwind is version 1.5, and the 0.99 version is released 10 years ago, so I used the former. Frankly, I have not encounter the related problems. I'm not expert in this, maybe @rickychanhoyin is right or others know more.

@stale
Copy link

stale bot commented Aug 17, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Aug 17, 2021
@stale stale bot removed the stale Stale bot on the loose label Sep 4, 2021
Comment on lines 59 to 62
./configure --prefix=$PWD/install || exit 1;
make || exit 1;
make install || exit 1;
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit exits the subshell here, i.e. the script continues past the closing ). Consider set -e, then you can get rid of most || exit checks. Otherwise, do this:

Suggested change
./configure --prefix=$PWD/install || exit 1;
make || exit 1;
make install || exit 1;
)
./configure --prefix=$PWD/install &&
make &&
make install
) || exit

All pipelining operators at end of line (&&, ||, |, |&) continue the pipe to next line without a need for the \.
|| exit preserves the non-zero exit code, explicit 1 is unnecessary.

Comment on lines 66 to 68
echo "$0: Assuming gperftools is already installed Please delete the directory"
echo "./gperftools if you need to reinstall."
mv gproftools gproftools_backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message does not say what is really going on. Also gperftools != gproftools. You create the former below. Consider

Suggested change
echo "$0: Assuming gperftools is already installed Please delete the directory"
echo "./gperftools if you need to reinstall."
mv gproftools gproftools_backup
echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'"
mv -f gperftools gperftools.bak || exit

mv -f needed in case another gperftools.bak exists.

|| exit handles mv's failure. set -e sometimes works better, but then you need to take care not to return non-zero from any pipeline.

The extension .bak should be already ignored in .gitignore, either at the root level or in tools/. If not, it's a good idea to have added it (as *.bak; this will ignore both matching files and directories). In any case, make sure that all downloaded files and created directories are ignored by Git.

Comment on lines 82 to 86
(
wd=$PWD
echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib
echo export PATH=$PATH:$wd/gperftools/bin
) >> env.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • $LD_LIBRARY_PATH and $PATH will be expanded before writing to env.sh; you probably want to keep the $ literal in env.sh, not capture the current PATH.
  • A subshell is unnecessary; you can redirect output of a block.
  • From "man ld.so" on LD_LIBRARY_PATH: A zero-length directory name indicates the current working directory. This is not what you want if LD_LIBRARY_PATH is not set. An extra check for empty/unset variable is needed not to add a :. :+ in a variable substitution makes it expand to empty string, otherwise to the expasnion of an expression after the :+. PATH is usually set, so it's safe to assume $PATH: does not expand to : (the rule for PATH is the same: an empty string after splitting at every : means current directory).
Suggested change
(
wd=$PWD
echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib
echo export PATH=$PATH:$wd/gperftools/bin
) >> env.sh
{
echo export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH:+\${LD_LIBRARY_PATH}:}$PWD/gperftools/lib
echo export PATH=\$PATH:$PWD/gperftools/bin
} >> env.sh

Also:

  1. gperftools/bin will not be built if we use minimal tcmalloc.
  2. Why do we need LD_LIBRARY_PATH? We're building with -ltcmalloc[_minimal], so the path to the .so is baked into a binary anyway. I think it's needed only for libunwind?

Comment on lines 70 to 79
git clone https://github.com/gperftools/gperftools.git gperftools

# install tcmalloc
wd=$PWD
cd gperftools
bash autogen.sh
./configure --prefix=$PWD || exit 1;
make || exit 1;
make install || exit 1;
cd ..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We did not have a dependency on Autoconf/Automake before. Better, get a pre-autoconf'ed distribution tarball from https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz.
  • Use the same subshell pattern to keep PWD?
  • Do we need the whole thing, with the heap profiler and heap checker? I think that for the default user build, the default should be -ltcmalloc_minimal, which skips all the stuff. I think 5 people at most need it. :) This also gets rid of the dependency on libunwind. Users who want and know how to profile are certainly qualified to reconfigure and build as much of the toolset as they want (and can install libunwind with a package manager). If you do this, change the library name in 'configure' too.
Suggested change
git clone https://github.com/gperftools/gperftools.git gperftools
# install tcmalloc
wd=$PWD
cd gperftools
bash autogen.sh
./configure --prefix=$PWD || exit 1;
make || exit 1;
make install || exit 1;
cd ..
wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.9.1/gperftools-2.9.1.tar.gz &&
tar xzf gperftools-2.9.1.tar.gz &&
mv gperftools-2.9.1 gperftools || exit
# install tcmalloc
(
cd gperftools &&
./configure --prefix=$PWD --enable-minimal --disable-debugalloc --disable-static &&
make &&
make install
) || exit

echo "You must call this script from the tools/ directory" && exit 1;

# prepare libunwind
echo "****() Installing libunwind"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • libtcmalloc_minimal does not depend on libunwind, see comments below. We can shed this dependency.
  • libunwind comes with most distros and can be installed with apt (libunwind-dev) or yum (libunwid-devel), unless the version 1.5-rc is strictly required. Also, clang comes with its own version; I do not know how compatible they are. I'd left the choice to the user.

# install libunwind
# The installation directory can be changed. But be careful, set the prefix as
#`pwd` will cause recursive problem.
tar vxzf libunwind-1.5-rc1.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tar is extremely noisy with v. Up to you, but the listing of unpacked files is hardly useful to the end user.

src/configure Outdated
Comment on lines 1307 to 1315
else
failure "Can't find libtcmalloc.so in ../tools/gperftools,
refer to ../tools/extras/install_tcmalloc.sh for instruction."
fi
else
failure "--enable-tcmalloc switched on, but can't find tcmalloc in
../tools/gperftools, refer to ../tools/extras/install_tcmalloc.sh
for instruction."
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge both cases into one. "Can't find tcmalloc in tools/gperftools. Run 'extras/install_tcmalloc.sh' in 'tools/' to install." is good enough for both. install_tcmalloc.sh will print an error if the user tries to do what message says, i.e run ../tools/extras/install_tcmalloc.sh. "Refer for instructions" is a lie, the script installs stuff without giving any :)

If you split lines, unindent the continuations, it will look weird when printed. failure takes any number of arguments and prints them in a single string with a space between, so if a message is concise enough, this will work to split First half and second half into two code lines:

    failure "First half and" \
            "second half"

Just FYI, you may stick $'\n' into a string to break lines, outside double quotes, but w/o space, or else Line 2 will start with a space, e.g.: "Line 1"$'\n'"Line 2" or $'Line 1\nLine 2' do the same thing. You can also do \' for ' inside a $-prefixed string.

echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$wd/gperftools/lib
echo export PATH=$PATH:$wd/gperftools/bin
) >> env.sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty line

@LvHang
Copy link
Contributor Author

LvHang commented Sep 8, 2021

@kkm000, thank you for your careful review and patient guidance. I tried to update the files according to your comments.

  1. I removed the part about 'libunwind' and left a few comments. I left the 'libunwind' part before since I used it:). I agree that the users who need and know how to profile are qualified to reconfigure and build as they want. Getting rid of it and using 'tcmalloc_minimal' will be more compact.
  2. I use set -e to replace the redundant exits, since it seems we need to conduct further actions according to the different return values.
  3. I haven't found the .bak extension in .gitignore so I added.
  4. Now, we use the -ltcmalloc_minimal and I prefer to bake the path with -Wl, -rpath into the binary as I think it can help to avoid some LD_LIBRARY_PATH asynchronous issues in the grid. Adding the increment of PATH and LD_LIBRARY_PATH to env.sh is needn't now.
  5. I didn't take care of the 'failure' function before. Thank you for your information. As there isn't a -e option in the echo command of the function, the 'dollar-sign single quote' is helpful, thanks.

I tested the updated files and it seems ok on my machine.

@kkm000
Copy link
Contributor

kkm000 commented Sep 8, 2021

LGTM, thanks, especially for updating the commentary in the file! Merging.

@kkm000 kkm000 merged commit 47ea7ae into kaldi-asr:master Sep 8, 2021
# prepare tcmalloc
if [ -d gperftools ]; then
echo "$0: existing 'gperftools' subdirectory is renamed 'gperftools.bak'"
mv -f gpreftools gpreftools.bak

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! #4629

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants