Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Cluster Error at tokenizing documents #667

Open
Ivan-Zhou opened this issue Jul 22, 2024 · 6 comments
Open

Ray Cluster Error at tokenizing documents #667

Ivan-Zhou opened this issue Jul 22, 2024 · 6 comments

Comments

@Ivan-Zhou
Copy link
Contributor

  • The tests/test_tokenized_document_cache.py fails at test_doc_cache_reproduces_data_one_batch_per_shard (see below);
  • The same error happened on training jobs on dataset without existing document cache, regardless of the # files: config, job.

Below is log from the unit test:

(ChunkCacheBuilder pid=69299) 2024-07-21 17:11:04,538 - levanter.data.shard_cache.builder::tmpaadrd2z7/cache - INFO - Starting cache build for 10 shards
(ChunkCacheBroker pid=69263) 2024-07-21 17:11:01,152 - levanter.data.shard_cache - INFO - Finalizing cache /var/folders/jm/jpc4s6kn3t98gt3rtmrtjmjm0000gn/T/tmpbwjrn1_1/cache...
(ChunkCacheBuilder pid=69270) 2024-07-21 17:11:01,146 - levanter.data.shard_cache - INFO - Shard 0 finished
2024-07-21 17:11:11,335	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
2024-07-21 17:11:11,341	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
============================================================================================================ warnings summary ============================================================================================================
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[1]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[2]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[3]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[8]
tests/test_tokenized_document_cache.py::test_doc_cache_sharding
  /Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/site-packages/dataclasses_json/core.py:189: RuntimeWarning: 'NoneType' object value of non-optional type metadata detected when decoding CacheLedger.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== short test summary info =========================================================================================================
FAILED test_tokenized_document_cache.py::test_doc_cache_reproduces_data_one_batch_per_shard - ray.exceptions.RayTaskError(OwnerDiedError): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
=========================================================================================== 1 failed, 8 passed, 5 warnings in 73.16s (0:01:13) ============================================================
@abhinavg4
Copy link
Contributor

A few other notes:

  1. Branch to try this issue on : fineweb_data and dclm
  2. The issue might be external to levanter as I had processed a dataset(fineweb_md) earlier with the same code.

@dlwh
Copy link
Member

dlwh commented Jul 22, 2024

i'm really confused. tests pass locally and i was able to run a job to completion. Can you download /tmp/ray/session_latest/logs/

@abhinavg4
Copy link
Contributor

I am using dlwh/fineweb_llama_txt and it fails when the number of files are very high (for fineweb).

Also, I'm using shuffle buffer of 100000. This is the error I'm getting:

These links might be helpful:

  1. https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html

Maybe we can just spill to GCP or some other store instead of using TPUs?

[2024-07-26 19:42:17,659 I 2062 2062] (raylet) local_object_manager.cc:245: :info_message:Spilled 1530 MiB, 1999 objects, write throughput 586 MiB/s.
[2024-07-26 19:42:17,664 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,764 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,783 I 2062 2062] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe
[state-dump] Node name: 10.130.3.18
[state-dump] InitialConfigResources: {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:__internal_head__: 10000, memory: 3816690937850000,
node:10.130.3.18: 10000, TPU-v4-256-head: 10000, object_store_memory: 326417475580000, TPU: 40000, CPU: 2400000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 7369852251879098477 Local resources: {"total":{memory: [3816690937850000], object_store_memory: [326417475580000], node:10.130.3.18: [10000], n
ode:__internal_head__: [10000], abhinavg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000]
, CPU: [2400000]}}, "available": {memory: [3816690937850000], object_store_memory: [80228517160000], node:10.130.3.18: [10000], node:__internal_head__: [10000], abhin
avg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000], CPU: [1915000]}}, "labels":{"ray.io
/node_id":"896e4bb4773427857862284c837792ebc366061252134ea14bf51abe",} is_draining: 0 is_idle: 0 Cluster resources: node id: 2657070934577218762{"total":{TPU: 40000,
CPU: 2400000, object_store_memory: 326417475580000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3925625952250000, accelerator_type:TPU-V4: 10000, node:10.130.3.34:
 10000}}, "available": {accelerator_type:TPU-V4: 10000, memory: 3925625952250000, object_store_memory: 293687982380000, CPU: 1795000, TPU: 40000, node:10.130.3.34: 10
000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"d6c1845c76582eee006857ae81b0ef895deabaaf0b3e2b122207d24e",}, "is_draining": 0, "draining_dea
dline_timestamp_ms": -1}node id: 305753936840358228{"total":{accelerator_type:TPU-V4: 10000, node:10.130.3.26: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_st
ore_memory: 326417475580000, CPU: 2400000, TPU: 40000, memory: 3925650733050000}}, "available": {node:10.130.3.26: 10000, CPU: 2275000, TPU: 40000, object_store_memor
y: 97949744960000, memory: 3925650733050000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"3eec0bb7d611e7aad286
1a1efdf26a37a9c3d6e960425f200761470d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 1294823379791528053{"total":{TPU: 40000, object_store_memory:
 326417475580000, node:10.130.2.207: 10000, memory: 3925623003130000, abhinavg-fw-txt-preempt-v4-256: 10000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "availabl
e": {memory: 3925623003130000, accelerator_type:TPU-V4: 10000, object_store_memory: 298065983560000, node:10.130.2.207: 10000, TPU: 40000, CPU: 1795000, abhinavg-fw-t
xt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"2f831786398540a40003d66664592345bd68e714da3e718e1cef1df5",}, "is_draining": 0, "draining_deadline_timestamp_ms
": -1}node id: 8300664650656660378{"total":{node:10.130.3.30: 10000, memory: 3925644548090000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 32641747558
0000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available": {CPU: 2275000, TPU: 40000, memory: 3925644548090000, accelerator_type:TPU-V4: 10000, ab
hinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.30: 10000, object_store_memory: 78596266190000}}, "labels":{"ray.io/node_id":"6389a432307b5cb2dd0deca0dbac5b6695695
486218e47b86d678dce",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 7369852251879098477{"total":{memory: 3816690937850000, node:10.130.3.18: 10000
, object_store_memory: 326417475580000, TPU-v4-256-head: 10000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:
__internal_head__: 10000}}, "available": {object_store_memory: 80228517160000, TPU-v4-256-head: 10000, CPU: 1915000, node:10.130.3.18: 10000, accelerator_type:TPU-V4:
 10000, TPU: 40000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3816690937850000, node:__internal_head__: 10000}}, "labels":{"ray.io/node_id":"896e4bb4773427857862
284c837792ebc366061252134ea14bf51abe",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 6093590469938625960{"total":{memory: 3925636888570000, node:1
0.130.3.42: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 326417475580000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available
": {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.42: 10000, TPU: 40000, memory: 3925636888570000, object_store_memory: 1921516
57400000, CPU: 2035000}}, "labels":{"ray.io/node_id":"d890a7e9a2827c5a73f42f3cbf5c463c528912f9a00d7b2b2790dcda",}, "is_draining": 0, "draining_deadline_timestamp_ms":
 -1}node id: -5257004634468912003{"total":{TPU: 40000, object_store_memory: 326417475580000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3
925643810810000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "available": {TPU: 40000, CPU: 2395000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 1000
...skipping...
[2024-07-26 19:42:19,652 E 2062 2062] (raylet) local_object_manager.cc:243: :info_message:Spilled 7641 MiB, 9998 objects, write throughput 1660 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.

@dlwh
Copy link
Member

dlwh commented Jul 26, 2024

sure spilling to GCS sounds good. want to try that?

@dlwh
Copy link
Member

dlwh commented Jul 26, 2024

Kiloshard would have fixed this probably actually

@dlwh
Copy link
Member

dlwh commented Jul 26, 2024

(really we just need better back pressure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants