Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to lower some tensor manipulation ops from aten to ttnn #12853

Open
jdh8 opened this issue Aug 22, 2024 · 8 comments
Open

Failed to lower some tensor manipulation ops from aten to ttnn #12853

jdh8 opened this issue Aug 22, 2024 · 8 comments

Comments

@jdh8
Copy link
Contributor

jdh8 commented Aug 22, 2024

The following ops refuse to lower to ttnn and stay in aten:

============================================================================== short test summary info ===============================================================================
FAILED tests/lowering/tensor_manipulation/test_concat.py::test_concat[input_shapes0-1] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_expand.py::test_expand[input_shape0-new_shape0] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_expand.py::test_expand_after_op[input_shape0-new_shape0] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_expand.py::test_expand_before_op[input_shape0-new_shape0] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_expand.py::test_expand_between_ops[input_shape0-new_shape0] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_repeat.py::test_repeat[input_shape0-sizes0] - AssertionError: assert 0 == 1
FAILED tests/lowering/tensor_manipulation/test_reshape.py::test_reshape[input_shape0-new_shape0] - AssertionError: assert 0 == 1
============================================================================ 7 failed, 9 passed in 8.42s =============================================================================
@ayerofieiev-tt
Copy link
Member

ayerofieiev-tt commented Aug 22, 2024

@jdh8 need more details so we can pass the info to the Op team.

Can you please specify individual problems with each op?
Ideally we should have single ticket for single issue, with issue categorized

Op Category:

  • TM (tensor manipulation/data movement),
  • MatMul,
  • Conv,
  • Eltwise,
  • Pool,
  • Reduction,
  • Misc

Issue Category:

  • Arguments order mismatch (lower prioritiy)
  • Arguments mismatch (no way to pass configuration to the op)
  • Unsupported case (op rejects the input)
  • Crash
  • Wrong answer
  • Low PCC (close to expected value, but smth off)

Specific input: [params].
Link to an xfail test.

github-merge-queue bot referenced this issue in tenstorrent/pytorch2.0_ttnn Aug 22, 2024
* change tracer type as a decorator, its API change, too

* torch_ttnn.backend apply tracer config

* mv json dump out of parse_fx_graph

* Wrap leaf func with function form instead of decorator

* Rename TorchTtnnOption to TenstorrentBackendOption

* Revert "Use subgraph_rewriter"

This reverts commit fc09080.

* Extrace mock_ttnn to a standalone package

* Register ttnn backend

* Add setup.py and pyproject.toml

* Update README.md

* mv torch_ttnn/tracer.py tracer/tracer.py

* Add model_sow2_list1 in tools/stat_models

* Fix counter bug

* fix try except

* detr_resnet50 retain_graph=True

* Update README.md for package building

* Update test case for ttnn inferface change

- ttnn.open(0) -> ttnn.open_device(device_id=0)
- ttnn.close(d) -> ttnn.close_device(d)

* Convert 14 pointwise binary operations

Add conversion and unit test cases

- add
- eq
- gt
- logical_and
- logical_or
- logical_xor
- lt
- maximum
- minimum
- mul
- ne
- pow
- sub
- xlogy

* Convert 35 pointwise unary operations

- aten.abs
- aten.acos
- aten.acosh
- aten.asin
- aten.asinh
- aten.atan
- aten.atan2  # binary
- aten.atanh
- aten.clone
- aten.cos
- aten.cosh
- aten.erf
- aten.exp
- aten.expm1
- aten.gelu
- aten.hardtanh
- aten.isinf
- aten.isnan
- aten.leaky_relu
- aten.log
- aten.log10
- aten.log1p
- aten.log2
- aten.logical_not
- aten.neg
- aten.reciprocal
- aten.relu
- aten.rsqrt
- aten.sigmoid
- aten.sign
- aten.sin
- aten.sinh
- aten.sqrt
- aten.tan
- aten.tanh

* Convert 3 pointwise trinary operations

- addcdiv
- addcmul
- where

* Convert 2 matmul operations

Also use fx.subgraph_rewriter

- matmul
- linear

* Simplify op conversion

* Fix wrap for ttnn & update test

- ttnn.add(and other ops) don't have __name__, so torch.compile will fail. We hard patch the op with the __name__
- Now ttnn need a to_layout before computation

* Fix more ops unit test

* Simpify pass insertion for gen_graphviz

* Update test cases for to_layour

* Fix add_data_move support kwargs

* Support linear without bias

* Don't gen graphviz for pointwise unary test

* Fix three ops, and verify some ops

3 op are fixed
- clone
- hardtanh
- leaky_relu

Following ops are verifyed and wont be fixed
- atan2: ttnn bug while x or y is 0
- pow: exponent don't suporrt tensor type
- xlogy: y==1 should be 0

* Simplify binary test, Add scalar support

* Supprt addcmul & addcdiv

* Fix support of addcdiv and where

- addcdiv with option is a special case in torch, need special pattern matching
- ttnn.where(c, x, y) c can not be bool, need cast

* Convert and test repeat op

* Add silu conversion

* Update new test cases according to torch.compile interface change

* Simplify unary test cases, reuse impl code

* Update test_stat import statment

* Add group_norm conversion

* Try convert layer_norm but the result is different

* Support repeat

* Update trinary tests

* Support concat

* Support split

* Update format

* [wip] group_norm

* group_norm use customized replacement

transformer method not work because it replace native_group_norm->getitem as ttnn.group_norm
pattern replacement not work because it use symbolic trace, which using proxy, not value,
  however, we need to do the value dependent conversion

* Add more test_group_norm

* Add more test_layer_norm

* Remove unused patterns/norm.py

* refactor, rename, add comment

* Support x.t()

* move layer_norm impl to customized_replace

* updatfor ttnn version 3023ec0f7

* Refactor test case

Don't know why but if I reuse some code in test case, wrap it as a function, it will fail if vscode run more than 8 cases at the same time. It is really strange.

I currently have no idea how it happen,
I can just refactor the case to dup code.

* Merge .gitignore

* Resolve conflicts in README.md

* Use incoming test_fall_back.py

* Use our tests/tools/test_stats.py

* Resolve generate_report.py

* Preserve intermediate node meta for composite ops and add checks in respective tests (#50)

Co-authored-by: Artem Yerofieiev <[email protected]>

* Resolve torch_ttnn/__init__.py

* Resolve confict

* Remove duplicate entries from .gitignore

* Update metadata in setup.py

* Correct the name of the test module in test_datamove.py

* Remove duplicate test for softmax

* Fix merge errors, now binary test passed.

* Remove test_if.py as `if` is already tested by lowering/misc/test_if.py

* Remove test_only_add_matmul.py, superseded by lowering/matmul/test_only_add_matmul.py

* Test group_norm

* Test torch.matmul -> ttnn.matmul

* Test compiling torch.nn.functional.linear

* Refactor test for CSE

* Remove test_norm.py as we've done its migration

* This test no longer tests falling back to torch op but division, which should be handled in lowering/eltwise/binary/test_div.py instead

* Convert tests for unary eltwise ops

* Fix loading pytest arguments

* Convert tests for binary eltwise ops

* Fix precision test for bivariate ops

* Fix precision test for univariate ops

* Remove test_pointwise_trinary.py for flexibility

Trivariate functions differ too much to share the same testing protocol

* Test torch.reshape -> ttnn.reshape

* Test compiling aten.repeat

* Test compiling torch.cat

* Remove test_datamove.py because all its tests have been moved to lowering/tensor_manipulation/

* Remove test already covered by test_resnet.py

* Use more descriptive names in torch_ttnn/patterns/add.py

Co-authored-by: Artem Yerofieiev <[email protected]>

* Use more descriptive names in torch_ttnn/patterns/addcdiv.py

Co-authored-by: Artem Yerofieiev <[email protected]>

* Simpler path to resolve ttnn.addcdiv

Co-authored-by: Artem Yerofieiev <[email protected]>

* Make test names unique for easier batch testing with pytest

* Fix import target_wrappers

* Move code in add_coreops_pass.py to add_data_move_pass.py first to help refactoring

* Refactor lowering univariate functions

* Simplify control flow in lowering

* Refactor lowering bivariate functions

* Sort ops to lower

* Lower to ttnn.atan2

* Lower to ttnn.leaky_relu

* Lower default hardtanh to ttnn.clip for now

* Lower addcdiv and addcmul

* Remove migrated lowering code

* Fix names of ttnn ops

* Remove duplicate entries in the op list

* Remove unused pattern file

* Remove the file that is already merged into add_data_move_pass.py

* Test broadcasting for bivariate ops

* Test broadcasting for all bivariate ops

* Remove intermediate test for `div` so we can test broadcastinig

* Regroup ops based on API docs

https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api.html

* Remove ops not working:

- Comparison ops
- Logical ops
- aten.pow.Tensor_Tensor

* Apply @kevinwuTT's patch on tt.repeat at model teardown

* Format the code with `black` for a consistent style

* Reformat with the specified config

* Mark tests xfail based on #64

* Remove test for unsupported pow(tensor, tensor)

* Mark broadcasting issues (#64) with atan2 and xlogy

* Reformat the code

* Mark broadcasting issues (#64) with (min|max)imum

* Mark broadcasting issues (#64) with subtraction

* Mark numerical issues with atan2

* Try setting realistic tolerance for low precision math ops

* Tolerate more with pointwise unary math ops

* Reflect that we convert torch.hardtanh to ttnn.clip for now

* Remove test for unsupported group norm

* Mark conversion failure with `linear`

* Fix test_clone.py for the patch

* Mark argument mismtach in `arange` (#65)

* Link #66 to test_linear.py

* Mark lowering issues with tensor manipulation (#67)

* Reciprocal needs an offset because it has a pole at 0

* More tolerance for matmul for its accumulated error

* Symetrically mark xfail for (min|max)imum

* Merge human-made docs from README.md to docs/README.md.in

* Do not use braces for shell variables to avoid clashing with .format

* Generate README.md with new metrics

* Mark xfail for xlogy involving broadcasting

xlogy asserts the same size for inputs for now

---------

Co-authored-by: swimdi <[email protected]>
Co-authored-by: yoco <[email protected]>
Co-authored-by: Zahid Wakeel <[email protected]>
Co-authored-by: Artem Yerofieiev <[email protected]>
Co-authored-by: yoco <[email protected]>
@ayerofieiev-tt
Copy link
Member

@jdh8 please provide more details so we can fire individual tickets to Op lead

@boris-drazic
Copy link
Contributor

@jdh8 For reshape your test (tests/lowering/tensor_manipulation/test_reshape.py) is not actually dispatching aten.reshape to compiler, but rather aten.view.

Your test module:

def forward(self, x, new_shape):
  return torch.reshape(x, new_shape)

is passed to compiler as

def forward(self, arg0_1):
    view = torch.ops.aten.view.default(arg0_1, [21, 5]);  arg0_1 = None
    return (view,)

which will yield the same result, but without using repeat.

@boris-drazic
Copy link
Contributor

I don't see any code for processing aten OPs reshape, concat, and repeat in torch_ttnn/passes/lowering/to_tt_pass.py. That would explain why they are not being lowered to ttnn during compilation.

@jdh8
Copy link
Contributor Author

jdh8 commented Sep 12, 2024

@boris-drazic, I once removed conversion for these ops to keep the size of tenstorrent/pytorch2.0_ttnn#54 reviewable. Now I make separate PR for each op.

@ayerofieiev-tt ayerofieiev-tt transferred this issue from tenstorrent/pytorch2.0_ttnn Sep 18, 2024
@ntarafdar
Copy link
Contributor

@jdh8 for each sub issue can you give us a ttnn unit test thats failing. CC @ayerofieiev-tt

@jdh8
Copy link
Contributor Author

jdh8 commented Sep 19, 2024

@tarafdarTT, I've made each sub-issue as a PR, so we can experiment on tests without affecting other ops. In each PR, I 'unlocked' some tests by removing @pytest-mark-xfail, and these are the failing tests.

@ntarafdar
Copy link
Contributor

ntarafdar commented Sep 19, 2024

@jdh8 oh okay, do you need anything from me (or the TM team ) right now or its that after you lower to ttnn ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants