Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttnn.full and ttnn.ones operations fail by assert valid_page_size on some dimensions #8633

Closed
nemanjagrujic opened this issue May 20, 2024 · 15 comments
Assignees
Labels
bug Something isn't working GS op_cat: eltwise WH

Comments

@nemanjagrujic
Copy link
Contributor

nemanjagrujic commented May 20, 2024

When testing ttnn ops with TILE layout and random shapes like [4, 7, 21, 133] most of the operations work correctly.

But ttnn.full and ttnn.ones operations fail with assertion like:

TT_FATAL @ ../tt_metal/impl/buffers/buffer.cpp:31: valid_page_size
info:
For valid non-interleaved buffers page size 2048 must equal buffer size 156408. For interleaved-buffers page size should be divisible by buffer size

which is followed by crash:

corrupted size vs. prev_size
Fatal Python error: Aborted

or sometimes:

malloc(): invalid size (unsorted)
Fatal Python error: Aborted

Note that similar operation ttnn.zeros works in same conditions!

Problem is observed on both GS and WH cards.

To Reproduce
Steps to reproduce the behavior:

  1. Checkout branch ngrujic/op_bug_unit_tests (soon to be merged into main).
  2. Run unit test test_eltwise_full_and_ones.py using this command:
pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_full_and_ones.py

Expected behavior
There are few test cases presented in the unit test, which are failing with:

TT_FATAL @ ../tt_metal/impl/buffers/buffer.cpp:31: valid_page_size
info:
For valid non-interleaved buffers page size 2048 must equal buffer size 156408. For interleaved-buffers page size should be divisible by buffer size
backtrace:
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::optional<tt::tt_metal::ShardSpecBuffer>, bool)
 --- tt::tt_metal::tensor_impl::detail::allocate_interleaved_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&)
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer>)
 --- /home/ubuntu/tt-metal/build/lib/libtt_eager.so(+0x6dab76) [0x7f6b51f6ab76]
 --- /home/ubuntu/tt-metal/build/lib/libtt_eager.so(_ZNSt8__detail9__variant17__gen_vtable_implILb1ENS0_12_Multi_arrayIPFSt10shared_ptrIN2tt8tt_metal6BufferEEOZNS5_11tensor_impl16to_device_bufferI8bfloat16EES7_RKSt7variantIJNS5_12OwnedStorageENS5_13DeviceStorageENS5_15BorrowedStorageENS5_22MultiDeviceHostStorageENS5_18MultiDeviceStorageEEEPNS5_6DeviceERKNS5_5ShapeENS5_8DataTypeENS5_6LayoutERKNS5_12MemoryConfigESt8optionalINS5_15ShardSpecBufferEESU_ISt17reference_wrapperINS5_12CommandQueueEEEEUlOT_E_SJ_EJEEESt5tupleIJSJ_EESt16integer_sequenceImJLm0EEEE14__visit_invokeES14_SJ_+0x22) [0x7f6b51f6b0c2]
 --- /home/ubuntu/tt-metal/build/lib/libtt_eager.so(+0x6a8140) [0x7f6b51f38140]
 --- std::_Function_handler<tt::tt_metal::Tensor (tt::tt_metal::Tensor const&, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)>::_M_invoke(std::_Any_data const&, tt::tt_metal::Tensor const&, tt::tt_metal::Device*&&, tt::tt_metal::MemoryConfig const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >&&)
 --- tt::tt_metal::tensor_impl::to_device_wrapper(tt::tt_metal::Tensor const&, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)
 --- /home/ubuntu/tt-metal/build/lib/libtt_eager.so(+0x764d86) [0x7f6b51ff4d86]
 --- tt::tt_metal::Device::push_work(std::function<void ()>&&, bool)
 --- tt::tt_metal::Tensor::to(tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) const
 --- /home/ubuntu/tt-metal/build/lib/libtt_eager.so(+0x562a17) [0x7f6b51df2a17]
 --- tt::tt_metal::full(tt::tt_metal::Shape, float, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /home/ubuntu/tt-metal/tt_eager/tt_lib/_C.so(+0x1d13f9) [0x7f6b523683f9]
 --- /home/ubuntu/tt-metal/tt_eager/tt_lib/_C.so(+0x1038e1) [0x7f6b5229a8e1]
 --- python(PyCFunction_Call+0x59) [0x5d5499]
 --- python(_PyEval_EvalFrameDefault+0x6eb0) [0x54d9f0]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python() [0x57a4af]
 --- python(_PyObject_MakeTpCall+0x296) [0x5d6066]
 --- python(_PyEval_EvalFrameDefault+0x690a) [0x54d44a]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python() [0x57a4af]
 --- python(_PyObject_MakeTpCall+0x296) [0x5d6066]
 --- python(_PyEval_EvalFrameDefault+0x690a) [0x54d44a]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(PyObject_Call+0x62) [0x5d4c12]
 --- python(_PyEval_EvalFrameDefault+0x1f26) [0x548a66]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x547265]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(_PyFunction_Vectorcall+0x393) [0x5d5a23]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x547265]
 --- python(_PyFunction_Vectorcall+0x1b6) [0x5d5846]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x547265]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x54552a]
 --- python(PyEval_EvalCode+0x27) [0x684327]
 --- python() [0x673a41]
 --- python() [0x673abb]
 --- python() [0x673b61]
 --- python(PyRun_SimpleFileExFlags+0x197) [0x6747e7]
 --- python(Py_RunMain+0x212) [0x6b4072]
 --- python(Py_BytesMain+0x2d) [0x6b43fd]
 --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f6b915b9083]
 --- python(_start+0x2e) [0x5da67e]

which is followed by crash:

corrupted size vs. prev_size
Fatal Python error: Aborted

Current thread 0x00007f1d5c860740 (most recent call first):
  File "/home/ubuntu/tt-metal/tests/tt_eager/python_api_testing/sweep_tests/common.py", line 50 in run_tt_lib_test
  File "/home/ubuntu/tt-metal/tests/tt_eager/python_api_testing/sweep_tests/common.py", line 74 in _try_except_wrapper
  File "/home/ubuntu/tt-metal/tests/tt_eager/python_api_testing/sweep_tests/common.py", line 93 in run_test_and_save_results
  File "/home/ubuntu/tt-metal/tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py", line 241 in run_sweep_test
  File "/home/ubuntu/tt-metal/tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py", line 301 in run_sweep_tests
  File "/home/ubuntu/tt-metal/tests/ttnn/python_api_testing/sweep_tests/run_sweep_test.py", line 27 in test_run_sweep
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/ubuntu/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/ubuntu/tt-metal/python_env/bin/pytest", line 8 in <module>
Aborted (core dumped)

Note to observe this crash (and not only assertion fail), you must run sweep:

Running sweeps
To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps and check the results. To do this you should:

  1. Run non working sweep by using pytest tests/ttnn/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_full_test.yaml --input-method cli --cli-input results_ttnn_full
  2. After the run is completed all test sweeps results should be available inside specified output directory.

There are more sweeps which you can try by changing the above command to target files:

tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_full_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_ones_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_full_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_ones_test.yaml
@ruthreshx
Copy link
Contributor

Hi @nemanjagrujic ,

I have ran the test the issue is reproducible. I have a question here.
Since the input is Tile_Layout, the shape should be tile right?
I have ran the test with random tile shapes (4, 7, 288, 3072) of H & W was multiple of 32. The test was passing.

@eyonland any comment here

@nemanjagrujic
Copy link
Contributor Author

nemanjagrujic commented Sep 17, 2024

Hello @ruthreshx, most of ttnn ops work correctly with tile layout even if shape is not tile (almost all ops). There is some automatic handling of that case.

@ruthreshx
Copy link
Contributor

Hi @nemanjagrujic ,
Not sure how the test is passing even without the shape being tile, since the layout is Tile_one.
Somehow it has been padded, and unpad has happened during such cases.
Can I have any sample test that has been passing since the layout is Tile_one and the shape is not Tile_one?

So that I'm able to track the function calls over there to compare such issues between those ops.
Thank you!

@nemanjagrujic
Copy link
Contributor Author

nemanjagrujic commented Sep 17, 2024

@ruthreshx ,

Not sure what do you mean by Tile_one. But you can checkout branch ngrujic/tile_layout_example where I added test to test_eltwise_full_and_ones.py file where ttnn.abs is passing with exactly same shape and layout where ttnn.full and ttnn.ones are failing.

You can try:

pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_full_and_ones.py

@ruthreshx
Copy link
Contributor

ruthreshx commented Sep 17, 2024

Hi @nemanjagrujic , @eyonland ,

Yes, Reg the abs, it should pass, but Full Op has a restriction from the kernel side.
When the layout is TILE, our flow gets inside, and it is acquired by the assert 132 % 32 = 5, as a result 5 = 0.
It was supposed to throw an error, but somehow it has been redirected to a valid_page issue. 

if (layout == Layout::TILE) {

        std::cout<<" ENTERED INTO Layout OP "<<shape[-1]<<" "<< tt::constants::TILE_WIDTH<<" "<<shape[-1] % 
        tt::constants::TILE_WIDTH<<std::endl;
        if (shape.rank() < 2) {
            TT_THROW("TILE layout requires rank >= 2");
        }
        TT_ASSERT(
            shape[-1] % tt::constants::TILE_WIDTH == 0,
            "TILE layout requires width dimension to be multiple of {}",
            tt::constants::TILE_WIDTH);
        TT_ASSERT(
            shape[-2] % tt::constants::TILE_HEIGHT == 0,
            "TILE layout requires height dimension to be multiple of {}",
            tt::constants::TILE_HEIGHT);
    }

Please find the full op implementation, where it has an TT assert link

And, ones op has also using the full op implementation.
Untitled

@nemanjagrujic
Copy link
Contributor Author

nemanjagrujic commented Sep 18, 2024

@ruthreshx, @eyonland Well, that seems like an opportunity for improvement then. If only few ops have that restriction it can (unnecessarly) complicate user workflow.

For instance even ttnn.zeros, which is very similar op, works in such cases.

@ruthreshx
Copy link
Contributor

ruthreshx commented Sep 20, 2024

Hi @nemanjagrujic , @eyonland ,
From the debug mode we found when the layout is TILE, we have a restriction for the full op and ones op.
Where H & W has to multiples of 32.

Either we need to change the shape to Tile or we need to redirect this to the respective author who implemented this op to add such support.

Thank you!

@nemanjagrujic
Copy link
Contributor Author

@ruthreshx, @eyonland, well I recommend redirecting..

@ruthreshx
Copy link
Contributor

Hi @nemanjagrujic , @eyonland ,
Case 1:
As I mentioned in the above comment, we found when the layout is TILE, we have a restriction for the full op and ones op.

Case 2:
It is supporting the row_major order, but there is an restriction that the width should be multiples of 2.

I have raised the PR for TT_Fatal to avoid the issue been capturing in debug mode.
PR link: 12921

ruthreshx added a commit that referenced this issue Sep 25, 2024
#8633: Add tt_fatal for full and ones op
@eyonland eyonland assigned umadevimcw and unassigned ruthreshx Sep 30, 2024
@VirdhatchaniKN
Copy link
Contributor

Merged PR #12921 to main. can we close this issue @nemanjagrujic

@nemanjagrujic
Copy link
Contributor Author

nemanjagrujic commented Oct 15, 2024

Merged PR #12921 to main. can we close this issue @nemanjagrujic

@VirdhatchaniKN Have in mind that the limitation is not added to docs:

https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.full.html

and, example given in docs does not work since we have this limitation.

@VirdhatchaniKN
Copy link
Contributor

VirdhatchaniKN commented Oct 15, 2024

@VirdhatchaniKN Have in mind that the limitation is not added to docs:
https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.full.html
and, example given in docs does not work since we have this limitation.

Hi @nemanjagrujic ,
I've updated the docs in PR #13808

@VirdhatchaniKN
Copy link
Contributor

Hi @nemanjagrujic , I've updated the docs like shown below in #13808
Screenshot 2024-10-15 at 7 35 13 PM

@VirdhatchaniKN
Copy link
Contributor

Merged #13808

@VirdhatchaniKN
Copy link
Contributor

Closing Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GS op_cat: eltwise WH
Projects
None yet
Development

No branches or pull requests

4 participants