Ray Cluster Error at tokenizing documents #667

Ivan-Zhou · 2024-07-22T00:24:28Z

The tests/test_tokenized_document_cache.py fails at test_doc_cache_reproduces_data_one_batch_per_shard (see below);
The same error happened on training jobs on dataset without existing document cache, regardless of the # files: config, job.

Below is log from the unit test:

(ChunkCacheBuilder pid=69299) 2024-07-21 17:11:04,538 - levanter.data.shard_cache.builder::tmpaadrd2z7/cache - INFO - Starting cache build for 10 shards
(ChunkCacheBroker pid=69263) 2024-07-21 17:11:01,152 - levanter.data.shard_cache - INFO - Finalizing cache /var/folders/jm/jpc4s6kn3t98gt3rtmrtjmjm0000gn/T/tmpbwjrn1_1/cache...
(ChunkCacheBuilder pid=69270) 2024-07-21 17:11:01,146 - levanter.data.shard_cache - INFO - Shard 0 finished
2024-07-21 17:11:11,335	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
2024-07-21 17:11:11,341	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
============================================================================================================ warnings summary ============================================================================================================
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[1]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[2]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[3]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[8]
tests/test_tokenized_document_cache.py::test_doc_cache_sharding
  /Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/site-packages/dataclasses_json/core.py:189: RuntimeWarning: 'NoneType' object value of non-optional type metadata detected when decoding CacheLedger.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== short test summary info =========================================================================================================
FAILED test_tokenized_document_cache.py::test_doc_cache_reproduces_data_one_batch_per_shard - ray.exceptions.RayTaskError(OwnerDiedError): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
=========================================================================================== 1 failed, 8 passed, 5 warnings in 73.16s (0:01:13) ============================================================

The text was updated successfully, but these errors were encountered:

abhinavg4 · 2024-07-22T00:41:48Z

A few other notes:

Branch to try this issue on : fineweb_data and dclm
The issue might be external to levanter as I had processed a dataset(fineweb_md) earlier with the same code.

dlwh · 2024-07-22T05:52:25Z

i'm really confused. tests pass locally and i was able to run a job to completion. Can you download /tmp/ray/session_latest/logs/

abhinavg4 · 2024-07-26T21:49:47Z

I am using dlwh/fineweb_llama_txt and it fails when the number of files are very high (for fineweb).

Also, I'm using shuffle buffer of 100000. This is the error I'm getting:

These links might be helpful:

https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html

Maybe we can just spill to GCP or some other store instead of using TPUs?

[2024-07-26 19:42:17,659 I 2062 2062] (raylet) local_object_manager.cc:245: :info_message:Spilled 1530 MiB, 1999 objects, write throughput 586 MiB/s.
[2024-07-26 19:42:17,664 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,764 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,783 I 2062 2062] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe
[state-dump] Node name: 10.130.3.18
[state-dump] InitialConfigResources: {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:__internal_head__: 10000, memory: 3816690937850000,
node:10.130.3.18: 10000, TPU-v4-256-head: 10000, object_store_memory: 326417475580000, TPU: 40000, CPU: 2400000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 7369852251879098477 Local resources: {"total":{memory: [3816690937850000], object_store_memory: [326417475580000], node:10.130.3.18: [10000], n
ode:__internal_head__: [10000], abhinavg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000]
, CPU: [2400000]}}, "available": {memory: [3816690937850000], object_store_memory: [80228517160000], node:10.130.3.18: [10000], node:__internal_head__: [10000], abhin
avg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000], CPU: [1915000]}}, "labels":{"ray.io
/node_id":"896e4bb4773427857862284c837792ebc366061252134ea14bf51abe",} is_draining: 0 is_idle: 0 Cluster resources: node id: 2657070934577218762{"total":{TPU: 40000,
CPU: 2400000, object_store_memory: 326417475580000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3925625952250000, accelerator_type:TPU-V4: 10000, node:10.130.3.34:
 10000}}, "available": {accelerator_type:TPU-V4: 10000, memory: 3925625952250000, object_store_memory: 293687982380000, CPU: 1795000, TPU: 40000, node:10.130.3.34: 10
000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"d6c1845c76582eee006857ae81b0ef895deabaaf0b3e2b122207d24e",}, "is_draining": 0, "draining_dea
dline_timestamp_ms": -1}node id: 305753936840358228{"total":{accelerator_type:TPU-V4: 10000, node:10.130.3.26: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_st
ore_memory: 326417475580000, CPU: 2400000, TPU: 40000, memory: 3925650733050000}}, "available": {node:10.130.3.26: 10000, CPU: 2275000, TPU: 40000, object_store_memor
y: 97949744960000, memory: 3925650733050000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"3eec0bb7d611e7aad286
1a1efdf26a37a9c3d6e960425f200761470d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 1294823379791528053{"total":{TPU: 40000, object_store_memory:
 326417475580000, node:10.130.2.207: 10000, memory: 3925623003130000, abhinavg-fw-txt-preempt-v4-256: 10000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "availabl
e": {memory: 3925623003130000, accelerator_type:TPU-V4: 10000, object_store_memory: 298065983560000, node:10.130.2.207: 10000, TPU: 40000, CPU: 1795000, abhinavg-fw-t
xt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"2f831786398540a40003d66664592345bd68e714da3e718e1cef1df5",}, "is_draining": 0, "draining_deadline_timestamp_ms
": -1}node id: 8300664650656660378{"total":{node:10.130.3.30: 10000, memory: 3925644548090000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 32641747558
0000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available": {CPU: 2275000, TPU: 40000, memory: 3925644548090000, accelerator_type:TPU-V4: 10000, ab
hinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.30: 10000, object_store_memory: 78596266190000}}, "labels":{"ray.io/node_id":"6389a432307b5cb2dd0deca0dbac5b6695695
486218e47b86d678dce",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 7369852251879098477{"total":{memory: 3816690937850000, node:10.130.3.18: 10000
, object_store_memory: 326417475580000, TPU-v4-256-head: 10000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:
__internal_head__: 10000}}, "available": {object_store_memory: 80228517160000, TPU-v4-256-head: 10000, CPU: 1915000, node:10.130.3.18: 10000, accelerator_type:TPU-V4:
 10000, TPU: 40000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3816690937850000, node:__internal_head__: 10000}}, "labels":{"ray.io/node_id":"896e4bb4773427857862
284c837792ebc366061252134ea14bf51abe",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 6093590469938625960{"total":{memory: 3925636888570000, node:1
0.130.3.42: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 326417475580000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available
": {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.42: 10000, TPU: 40000, memory: 3925636888570000, object_store_memory: 1921516
57400000, CPU: 2035000}}, "labels":{"ray.io/node_id":"d890a7e9a2827c5a73f42f3cbf5c463c528912f9a00d7b2b2790dcda",}, "is_draining": 0, "draining_deadline_timestamp_ms":
 -1}node id: -5257004634468912003{"total":{TPU: 40000, object_store_memory: 326417475580000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3
925643810810000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "available": {TPU: 40000, CPU: 2395000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 1000
...skipping...
[2024-07-26 19:42:19,652 E 2062 2062] (raylet) local_object_manager.cc:243: :info_message:Spilled 7641 MiB, 9998 objects, write throughput 1660 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.

dlwh · 2024-07-26T22:05:45Z

sure spilling to GCS sounds good. want to try that?

dlwh · 2024-07-26T22:05:59Z

Kiloshard would have fixed this probably actually

dlwh · 2024-07-26T22:06:30Z

(really we just need better back pressure)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Cluster Error at tokenizing documents #667

Ray Cluster Error at tokenizing documents #667

Ivan-Zhou commented Jul 22, 2024

abhinavg4 commented Jul 22, 2024

dlwh commented Jul 22, 2024

abhinavg4 commented Jul 26, 2024

dlwh commented Jul 26, 2024

dlwh commented Jul 26, 2024

dlwh commented Jul 26, 2024

Ray Cluster Error at tokenizing documents #667

Ray Cluster Error at tokenizing documents #667

Comments

Ivan-Zhou commented Jul 22, 2024

abhinavg4 commented Jul 22, 2024

dlwh commented Jul 22, 2024

abhinavg4 commented Jul 26, 2024

dlwh commented Jul 26, 2024

dlwh commented Jul 26, 2024

dlwh commented Jul 26, 2024