-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken NeMo dependencies #372
Conversation
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
nemo_curator/filters/code.py
Outdated
try: | ||
from nemo.collections.common.tokenizers import SentencePieceTokenizer | ||
except (ImportError, ModuleNotFoundError): | ||
from .sentencepiece_tokenizer import SentencePieceTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about this?
ModuleNotFoundError: No module named 'nemo'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussions from slack, I think we can just transform this class to be something like this:
class TokenizerFertilityFilter(DocumentFilter):
def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
if path_to_tokenizer is None:
raise ValueError(
"Must provide a valid path to a SentencePiece " "tokenizer"
)
self._tokenizer = sentencepiece.SentencePieceProcessor()
self._tokenizer.Load(path_to_tokenizer)
self._threshold = min_char_to_token_ratio
self._name = "tokenizer_fertility"
def score_document(self, source):
tokens = self._tokenizer.encode_as_pieces(source)
num_chars = len(source)
num_tokens = len(tokens)
if num_tokens == 0:
return -1
return num_chars / num_tokens
def keep_document(self, score):
return score >= self._threshold
Then we can just delete the one file you copied over. Lmk what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably run this via a batch instead of running it on a per file pieces and return a single file. We can also probably use crossfit
for it (if we want to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what that will look like
cf.op.Tokenizer(model, cols=["text"], tokenizer_type="sentencepiece")
That said, we might have to ensure this works in a CPU environment too so there might be some complexity here we need to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @VibhuJawa ! I have opened #377 to track this.
import torch | ||
|
||
|
||
class SentencePieceTokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -54,14 +55,14 @@ dependencies = [ | |||
"lxml_html_clean", | |||
"mecab-python3", | |||
"mwparserfromhell==0.6.5", | |||
"nemo_toolkit[nlp]>=1.23.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #376.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also probably need to add torch
as a dependency now. We inherited that from NeMo. Though, not sure if the HF libraries pick that up automatically.
Looking at the logs, it looks like |
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick I think there's a place in the user guide under |
Yes I have updated docs/user-guide/image/gettingstarted.rst - let me know if there's another somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops sorry I was blind. I'm used to the user guide being the first thing in the side bar. Looks good.
* add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]>
* update obsolete flag Signed-off-by: Walter Teng <[email protected]> * build: Improve caching (#352) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on main (#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on merge commit (#355) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * build: Add conda env to `$PATH` (#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * add newline Signed-off-by: Oliver Koenig <[email protected]> * run cleanup always Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Add `build-test-publish-wheel` CI file (#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Create package_info.py Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update package_info.py Signed-off-by: Sarah Yurick <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * remove extra version string Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * add `__all__` Signed-off-by: Sarah Yurick <[email protected]> * Fix version Signed-off-by: oliver könig <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <[email protected]> * Ko3n1g/sarahyurick/ci/build test publish wheel (#358) * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix --------- Signed-off-by: Oliver Koenig <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * run isort Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update pyproject.toml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken TestPyPi builder (#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Update Dockerfile Signed-off-by: Sarah Yurick <[email protected]> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * chore: Add `CHANGELOG.md` file (#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Release workflow (#360) * add file Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow to allow of `devN` semver (#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add code-freeze workflow (#367) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add cherry pick workflow (#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken NeMo dependencies (#372) * add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow (#373) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Skip reading files with incorrect extension (#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <[email protected]> * add type checking Signed-off-by: Sarah Yurick <[email protected]> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <[email protected]> * isort Signed-off-by: Sarah Yurick <[email protected]> * address ayush's comments Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> * more whitespace Signed-off-by: Sarah Yurick <[email protected]> * address praateek's review Signed-off-by: Sarah Yurick <[email protected]> * praateek's review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <[email protected]> --------- Signed-off-by: Walter Teng <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Sarah Yurick <[email protected]>
* add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Vinay Raman <[email protected]>
* update obsolete flag Signed-off-by: Walter Teng <[email protected]> * build: Improve caching (NVIDIA#352) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on main (NVIDIA#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on merge commit (NVIDIA#355) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * build: Add conda env to `$PATH` (NVIDIA#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * add newline Signed-off-by: Oliver Koenig <[email protected]> * run cleanup always Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Add `build-test-publish-wheel` CI file (NVIDIA#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Create package_info.py Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update package_info.py Signed-off-by: Sarah Yurick <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * remove extra version string Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * add `__all__` Signed-off-by: Sarah Yurick <[email protected]> * Fix version Signed-off-by: oliver könig <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <[email protected]> * Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358) * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix --------- Signed-off-by: Oliver Koenig <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * run isort Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update pyproject.toml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken TestPyPi builder (NVIDIA#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Update Dockerfile Signed-off-by: Sarah Yurick <[email protected]> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * chore: Add `CHANGELOG.md` file (NVIDIA#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Release workflow (NVIDIA#360) * add file Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow to allow of `devN` semver (NVIDIA#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add code-freeze workflow (NVIDIA#367) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add cherry pick workflow (NVIDIA#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken NeMo dependencies (NVIDIA#372) * add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow (NVIDIA#373) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Skip reading files with incorrect extension (NVIDIA#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <[email protected]> * add type checking Signed-off-by: Sarah Yurick <[email protected]> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <[email protected]> * isort Signed-off-by: Sarah Yurick <[email protected]> * address ayush's comments Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> * more whitespace Signed-off-by: Sarah Yurick <[email protected]> * address praateek's review Signed-off-by: Sarah Yurick <[email protected]> * praateek's review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <[email protected]> --------- Signed-off-by: Walter Teng <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Vinay Raman <[email protected]>
* add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
* update obsolete flag Signed-off-by: Walter Teng <[email protected]> * build: Improve caching (NVIDIA#352) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on main (NVIDIA#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Run on merge commit (NVIDIA#355) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * build: Add conda env to `$PATH` (NVIDIA#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * add newline Signed-off-by: Oliver Koenig <[email protected]> * run cleanup always Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Add `build-test-publish-wheel` CI file (NVIDIA#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Create package_info.py Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update package_info.py Signed-off-by: Sarah Yurick <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * remove extra version string Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * add `__all__` Signed-off-by: Sarah Yurick <[email protected]> * Fix version Signed-off-by: oliver könig <[email protected]> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <[email protected]> * Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358) * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix Signed-off-by: Oliver Koenig <[email protected]> * fix * fix --------- Signed-off-by: Oliver Koenig <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * run isort Signed-off-by: Sarah Yurick <[email protected]> * Update __init__.py Signed-off-by: Sarah Yurick <[email protected]> * Update pyproject.toml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken TestPyPi builder (NVIDIA#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> * Update Dockerfile Signed-off-by: Sarah Yurick <[email protected]> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * chore: Add `CHANGELOG.md` file (NVIDIA#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Release workflow (NVIDIA#360) * add file Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow to allow of `devN` semver (NVIDIA#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add code-freeze workflow (NVIDIA#367) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Add cherry pick workflow (NVIDIA#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Fix broken NeMo dependencies (NVIDIA#372) * add packaging Signed-off-by: Sarah Yurick <[email protected]> * move to requires Signed-off-by: Sarah Yurick <[email protected]> * move to github ci file Signed-off-by: Sarah Yurick <[email protected]> * add pin Signed-off-by: Sarah Yurick <[email protected]> * add torch Signed-off-by: Sarah Yurick <[email protected]> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <[email protected]> * try github install Signed-off-by: Sarah Yurick <[email protected]> * add comma Signed-off-by: Sarah Yurick <[email protected]> * another attempt Signed-off-by: Sarah Yurick <[email protected]> * remove nemo toolkit Signed-off-by: Sarah Yurick <[email protected]> * add datasets Signed-off-by: Sarah Yurick <[email protected]> * try removing cython Signed-off-by: Sarah Yurick <[email protected]> * remove cython Signed-off-by: Sarah Yurick <[email protected]> * sentencepiece Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * apply ryan's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * ci: Bump release workflow (NVIDIA#373) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Walter Teng <[email protected]> * Skip reading files with incorrect extension (NVIDIA#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <[email protected]> * add type checking Signed-off-by: Sarah Yurick <[email protected]> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <[email protected]> * isort Signed-off-by: Sarah Yurick <[email protected]> * address ayush's comments Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * trailing whitespace Signed-off-by: Sarah Yurick <[email protected]> * more whitespace Signed-off-by: Sarah Yurick <[email protected]> * address praateek's review Signed-off-by: Sarah Yurick <[email protected]> * praateek's review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Walter Teng <[email protected]> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <[email protected]> --------- Signed-off-by: Walter Teng <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: oliver könig <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]>
See example failure: https://github.com/NVIDIA/NeMo-Curator/actions/runs/11844241061/job/33007266564?pr=318