Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce stop_at_shortest in sample and round_robin #76

Merged
merged 35 commits into from
Oct 5, 2023
Merged

Conversation

najielhachem
Copy link
Contributor

@najielhachem najielhachem commented Oct 2, 2023

What does this PR do? Please describe:

Following this PR review we decided to do the following:

  • Extract common parts of sample and round_robin into separate class
  • Introduce circular logic for sample operator useful for up_sampling
  • Add stop_at_shortest flag to both operator to give user more flexibility on how the operator works.

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2023
@najielhachem najielhachem changed the title Up sampling Introduce stop_at_shortest in sample and round_robin Oct 3, 2023
@najielhachem najielhachem marked this pull request as ready for review October 3, 2023 09:18
@najielhachem najielhachem requested a review from gwenzek October 3, 2023 09:19
Copy link
Contributor

@gwenzek gwenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

If I had to nit, I'd says that I feed it weird to have the logic spread out in 6 files.
sampled_data_source and round_robin_data_source, may not need to be their own class anymore.
But it's also consistent with the rest of the codebase, where we have two files per operator.

tests/unit/data/data_pipeline/test_sample.py Show resolved Hide resolved
@@ -76,6 +77,28 @@ def test_op_works_when_pipelines_have_different_lengths(self) -> None:

pipeline.reset()

@pytest.mark.skipif(
python_devel_only(),
reason="New fairseq2n API in Python-only installation. Skipping till v0.2.",
Copy link
Contributor

@gwenzek gwenzek Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR:

I'm not fond of the behavior of this python devel only
First it seems to be about not disturbing Python only developer that didn't build fairseq2n locally. So fairseq2n_dev_only would be more clear.
Second I'd prefer if it was annotation @fairseq2n_dev_only with a standard reason.
Third when are we supposed to removed those annotations ? Will we have to do this manually ?

I don't really see who would want to run our unit tests from github main branch but can't build fairseq2n.
Can someone explain me the use case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First it seems to be about not disturbing Python only developer that didn't build fairseq2n locally. So fairseq2n_dev_only would be more clear.

I think if you convert it to a standalone annotation, a name like @fairseq2n_dev_only makes sense, but as a parameter to skip_if, fairseq2n_dev_only does not sound right, because it is just the opposite: you want to skip the test if this is a python-only installation. I am open to having a standalone annotation though.

Third when are we supposed to removed those annotations ? Will we have to do this manually ?

There are two ways to resolve this issue. We can either have a version range check, like @skipif(fairseq2n.__version__< 0.2) which would not require us to remove the annotation in the future. The disadvantage is, those checks would not be relevant after a stable release and will start to pile up with time for no good reason. The other alternative is what we have right now. Just checking whether the versions do not match @skipif(fairseq2.__version__ != fairseq2n.__version__). The advantage is that this forces you to remove the annotations before a stable release (otherwise the tests won't run), the disadvantage is that it is a manual work that needs to be done as part of the version bump. Having said that I think we won't need this annotation that often in the future once the native library becomes more mature. The manual removal is not ideal, but we might add some extra measures like forcibly fail any test that has this annotation and if the latest version is already past the one in the annotation.

Can someone explain me the use case ?

Pretty much anyone who wants to contribute to the Python code base of fairseq2, but does not want to deal (or have experience) with C++. The whole reason why we ended up with fairseq2n in last minute was because people had quite a bit of trouble with building the library (e.g. while working on seq-gen and UnitY upstreaming). The original feedback was to completely separate the C++ parts to a separate repo, but we then decided to keep it here and offer pre-built binaries so that people who want to contribute to fairseq2 won't have to build the native parts if they don't want to. By the way, I got this approach pretty much from jax where they similarly separate the library into two parts: jax (python only) and jaxlib (pre-built).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just gave it a try in #83. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, let's follow on that PR, and merge this one.

Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! Just left a few nit comments. There is only one line where we should leverage move-semantics for performance, but otherwise most of the my comments are optional suggestions. Let me know once you at least address that performance related issue, and I can stamp it right away.

fairseq2n/src/fairseq2n/data/multi_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/data_pipeline.h Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/multi_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/round_robin_data_source.cc Outdated Show resolved Hide resolved
tests/unit/data/data_pipeline/test_round_robin.py Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/sample_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/multi_data_source.h Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/multi_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/multi_data_source.cc Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/multi_data_source.cc Outdated Show resolved Hide resolved
@najielhachem
Copy link
Contributor Author

najielhachem commented Oct 5, 2023

@gwenzek I agree that there is too many files now but we must have a class for round_robin and sample since we have different logic for the save/reload and reset methods.

@najielhachem
Copy link
Contributor Author

@cbalioglu Thanks for the detailed review! I did all the changes requested and the PR should be ready by now.

@najielhachem najielhachem requested a review from cbalioglu October 5, 2023 09:29
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two very nit comments. Thanks again for this work! The implementation has become much cleaner! Feel free to merge it whenever you want.

src/fairseq2/data/data_pipeline.py Outdated Show resolved Hide resolved
src/fairseq2/data/data_pipeline.py Outdated Show resolved Hide resolved
fairseq2n/src/fairseq2n/data/composite_data_source.cc Outdated Show resolved Hide resolved
@najielhachem najielhachem merged commit 52b05c7 into main Oct 5, 2023
@najielhachem najielhachem deleted the up_sampling branch October 5, 2023 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants