Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken NeMo dependencies #372

Merged
merged 16 commits into from
Nov 15, 2024
Merged

Conversation

sarahyurick
Copy link
Collaborator

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick changed the title Add packaging module Fix broken NeMo dependencies Nov 14, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick requested a review from ko3n1g November 15, 2024 02:54
@sarahyurick sarahyurick mentioned this pull request Nov 15, 2024
3 tasks
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 15, 2024
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Comment on lines 105 to 108
try:
from nemo.collections.common.tokenizers import SentencePieceTokenizer
except (ImportError, ModuleNotFoundError):
from .sentencepiece_tokenizer import SentencePieceTokenizer
Copy link
Collaborator Author

@sarahyurick sarahyurick Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about this?

ModuleNotFoundError: No module named 'nemo'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions from slack, I think we can just transform this class to be something like this:

class TokenizerFertilityFilter(DocumentFilter):

    def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
        if path_to_tokenizer is None:
            raise ValueError(
                "Must provide a valid path to a SentencePiece " "tokenizer"
            )
        self._tokenizer = sentencepiece.SentencePieceProcessor()
        self._tokenizer.Load(path_to_tokenizer)
        self._threshold = min_char_to_token_ratio

        self._name = "tokenizer_fertility"

    def score_document(self, source):
        tokens = self._tokenizer.encode_as_pieces(source)
        num_chars = len(source)
        num_tokens = len(tokens)
        if num_tokens == 0:
            return -1
        return num_chars / num_tokens

    def keep_document(self, score):
        return score >= self._threshold

Then we can just delete the one file you copied over. Lmk what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably run this via a batch instead of running it on a per file pieces and return a single file. We can also probably use crossfit for it (if we want to)

Copy link
Collaborator

@VibhuJawa VibhuJawa Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what that will look like

 cf.op.Tokenizer(model, cols=["text"], tokenizer_type="sentencepiece")

That said, we might have to ensure this works in a CPU environment too so there might be some complexity here we need to fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VibhuJawa ! I have opened #377 to track this.

import torch


class SentencePieceTokenizer:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -54,14 +55,14 @@ dependencies = [
"lxml_html_clean",
"mecab-python3",
"mwparserfromhell==0.6.5",
"nemo_toolkit[nlp]>=1.23.0",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #376.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also probably need to add torch as a dependency now. We inherited that from NeMo. Though, not sure if the HF libraries pick that up automatically.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Nov 15, 2024

Signed-off-by: Sarah Yurick <[email protected]>
@ryantwolf
Copy link
Collaborator

@sarahyurick I think there's a place in the user guide under images/gettingstarted.rst that has the cython install instructions too.

@sarahyurick
Copy link
Collaborator Author

@sarahyurick I think there's a place in the user guide under images/gettingstarted.rst that has the cython install instructions too.

Yes I have updated docs/user-guide/image/gettingstarted.rst - let me know if there's another somewhere.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry I was blind. I'm used to the user guide being the first thing in the side bar. Looks good.

@sarahyurick sarahyurick merged commit 363a66b into NVIDIA:main Nov 15, 2024
3 checks passed
davzoku pushed a commit to davzoku/NeMo-Curator that referenced this pull request Nov 19, 2024
* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>
VibhuJawa pushed a commit that referenced this pull request Nov 19, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants