You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a bug where if we try to tokenize a new dataset using levanter on a large TPU with lots of workers.
For example if I try to re-tokenize the dolma dataset on a v4-256 I get the following, unrelated error that claims the validation failed even though they never existed (files should be re-made during tokenization after nuking the prior tokenized dataset).
PRIOR LOG TRACE:
Traceback (most recent call last):
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in<module>levanter.config.main(main)()
File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 119, in main
tagged_eval_datasets: list = config.data.tagged_eval_sets(Pos.size)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 584, in tagged_eval_sets
eval_sets = self.validation_sets(seq_len, monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 789, in validation_sets
doc_caches = self.build_caches("validation", monitors=monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 827, in build_caches
cache.await_finished()
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 1081, in await_finished
x = ray.get(self.finished_sentinel(), timeout=timeout)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 2664, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): [36mray::_TreeStoreCacheBuilder.finished_sentinel()[39m (pid=38797, ip=10.130.15.218, actor_id=0335e6394f02e4081441cf7f01000000, repr=<levanter.store.cache._TreeStoreCacheBuilder object at 0x7fafa840d570>)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
returnself.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 784, in finished_sentinel
await self._finished_promise
File "/home/ahmed/levanter/src/levanter/utils/ray_utils.py", line 91, in log_failures_to
yield
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 288, in batch_finished
self._attempt_to_write_batches()
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 353, in _attempt_to_write_batches
_serialize_json_and_commit(os.path.join(self.cache_dir, LEDGER_FILE_NAME), self._ledger)
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 630, in _serialize_json_and_commit
fs.rename(f"{path}.tmp", path)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1613, in rename
return self.mv(path1, path2, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1186, in mv
self.copy(
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 405, in _copy
raise ex
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 245, in _run_coro
return await asyncio.wait_for(coro, timeout=timeout), i
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 1161, in _cp_file
out = await self._call(
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 447, in _call
status, headers, info, contents = await self._request(
File "/home/ahmed/venv310/lib/python3.10/site-packages/decorator.py", line 221, in fun
return await caller(func, *(extras + args), **kw)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 126, in retry_request
return await func(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 440, in _request
validate_response(status, contents, path, args)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 95, in validate_response
raise FileNotFoundError(path)
FileNotFoundError: b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json.tmp/rewriteTo/b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json
Traceback (most recent call last):
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in<module>levanter.config.main(main)()
File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 119, in main
tagged_eval_datasets: list = config.data.tagged_eval_sets(Pos.size)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 584, in tagged_eval_sets
eval_sets = self.validation_sets(seq_len, monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 789, in validation_sets
doc_caches = self.build_caches("validation", monitors=monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 827, in build_caches
cache.await_finished()
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 1081, in await_finished
x = ray.get(self.finished_sentinel(), timeout=timeout)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 2664, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): [36mray::_TreeStoreCacheBuilder.finished_sentinel()[39m (pid=38797, ip=10.130.15.218, actor_id=0335e6394f02e4081441cf7f01000000, repr=<levanter.store.cache._TreeStoreCacheBuilder object at 0x7fafa840d570>)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
returnself.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 784, in finished_sentinel
await self._finished_promise
File "/home/ahmed/levanter/src/levanter/utils/ray_utils.py", line 91, in log_failures_to
yield
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 288, in batch_finished
self._attempt_to_write_batches()
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 353, in _attempt_to_write_batches
_serialize_json_and_commit(os.path.join(self.cache_dir, LEDGER_FILE_NAME), self._ledger)
File "/home/ahmed/levanter/src/levanter/store/cache.py", line 630, in _serialize_json_and_commit
fs.rename(f"{path}.tmp", path)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1613, in rename
return self.mv(path1, path2, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1186, in mv
self.copy(
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 405, in _copy
raise ex
File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 245, in _run_coro
return await asyncio.wait_for(coro, timeout=timeout), i
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 1161, in _cp_file
out = await self._call(
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 447, in _call
status, headers, info, contents = await self._request(
File "/home/ahmed/venv310/lib/python3.10/site-packages/decorator.py", line 221, in fun
return await caller(func, *(extras + args), **kw)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 126, in retry_request
return await func(*args, **kwargs)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 440, in _request
validate_response(status, contents, path, args)
File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 95, in validate_response
raise FileNotFoundError(path)
FileNotFoundError: b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json.tmp/rewriteTo/b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json
Note that the above complains about a validation file not being found. However if you run on a single worker v4-8 the error is actually clear:
TreeStoreCacheBuilder pid=105678) 2024-09-20 19:18:37,438 - levanter.store.cache - INFO - Finalizing cache gs://marin-data/tokenized/OLMo-1B/dolma-v1.7/paloma/dolma_100_subreddits/validation...
Traceback (most recent call last):
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in<module>levanter.config.main(main)()
File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 121, in main
config.data.train_set(Pos.size, key=data_key), Pos, KeyPos, ignore_index=config.data.ignore_token_id
File "/home/ahmed/levanter/src/levanter/data/text.py", line 744, in train_set
doc_caches = self.build_caches("train", monitors=monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 816, in build_caches
cache = dataset.build_or_load_cache(split, monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 659, in build_or_load_cache
source = self.get_shard_source(split)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 504, in get_shard_source
return TextUrlDataSource(split_urls, self.text_key)
File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 209, in __init__
self._shard_name_to_url_mapping = _mk_shard_name_mapping(urls)
File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 449, in _mk_shard_name_mapping
raise FileNotFoundError(f"Could not find the following urls:\n - {missing_urls_str}")
FileNotFoundError: Could not find the following urls:
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0000.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0001.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0002.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0003.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0004.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0005.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0006.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0007.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0008.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0009.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0010.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0011.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0012.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0013.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0014.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0015.json.gz
Traceback (most recent call last):
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in<module>levanter.config.main(main)()
File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 121, in main
config.data.train_set(Pos.size, key=data_key), Pos, KeyPos, ignore_index=config.data.ignore_token_id
File "/home/ahmed/levanter/src/levanter/data/text.py", line 744, in train_set
doc_caches = self.build_caches("train", monitors=monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 816, in build_caches
cache = dataset.build_or_load_cache(split, monitors)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 659, in build_or_load_cache
source = self.get_shard_source(split)
File "/home/ahmed/levanter/src/levanter/data/text.py", line 504, in get_shard_source
return TextUrlDataSource(split_urls, self.text_key)
File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 209, in __init__
self._shard_name_to_url_mapping = _mk_shard_name_mapping(urls)
File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 449, in _mk_shard_name_mapping
raise FileNotFoundError(f"Could not find the following urls:\n - {missing_urls_str}")
FileNotFoundError: Could not find the following urls:
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0000.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0001.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0002.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0003.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0004.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0005.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0006.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0007.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0008.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0009.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0010.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0011.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0012.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0013.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0014.json.gz
- gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0015.json.gz
The above error occured because the files were not at the correct location at all to be tokenized!
The text was updated successfully, but these errors were encountered:
Update that this issue continues to persist even if the tokenized dataset is fine. So training runs and works for a v4-8 but fails at a v4-256. Not sure at which scale this becomes a problem.
There is a bug where if we try to tokenize a new dataset using levanter on a large TPU with lots of workers.
For example if I try to re-tokenize the dolma dataset on a v4-256 I get the following, unrelated error that claims the validation failed even though they never existed (files should be re-made during tokenization after nuking the prior tokenized dataset).
PRIOR LOG TRACE:
Note that the above complains about a validation file not being found. However if you run on a single worker v4-8 the error is actually clear:
The above error occured because the files were not at the correct location at all to be tokenized!
The text was updated successfully, but these errors were encountered: