Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tokenization Pipeline Fails Silently #736

Open
ahmeda14960 opened this issue Sep 20, 2024 · 1 comment
Open

New Tokenization Pipeline Fails Silently #736

ahmeda14960 opened this issue Sep 20, 2024 · 1 comment

Comments

@ahmeda14960
Copy link
Contributor

There is a bug where if we try to tokenize a new dataset using levanter on a large TPU with lots of workers.

For example if I try to re-tokenize the dolma dataset on a v4-256 I get the following, unrelated error that claims the validation failed even though they never existed (files should be re-made during tokenization after nuking the prior tokenized dataset).

PRIOR LOG TRACE:

Traceback (most recent call last):
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in <module>
    levanter.config.main(main)()
  File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 119, in main
    tagged_eval_datasets: list = config.data.tagged_eval_sets(Pos.size)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 584, in tagged_eval_sets
    eval_sets = self.validation_sets(seq_len, monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 789, in validation_sets
    doc_caches = self.build_caches("validation", monitors=monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 827, in build_caches
    cache.await_finished()
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 1081, in await_finished
    x = ray.get(self.finished_sentinel(), timeout=timeout)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 2664, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): [36mray::_TreeStoreCacheBuilder.finished_sentinel()[39m (pid=38797, ip=10.130.15.218, actor_id=0335e6394f02e4081441cf7f01000000, repr=<levanter.store.cache._TreeStoreCacheBuilder object at 0x7fafa840d570>)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 784, in finished_sentinel
    await self._finished_promise
  File "/home/ahmed/levanter/src/levanter/utils/ray_utils.py", line 91, in log_failures_to
    yield
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 288, in batch_finished
    self._attempt_to_write_batches()
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 353, in _attempt_to_write_batches
    _serialize_json_and_commit(os.path.join(self.cache_dir, LEDGER_FILE_NAME), self._ledger)
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 630, in _serialize_json_and_commit
    fs.rename(f"{path}.tmp", path)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1613, in rename
    return self.mv(path1, path2, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1186, in mv
    self.copy(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 405, in _copy
    raise ex
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 245, in _run_coro
    return await asyncio.wait_for(coro, timeout=timeout), i
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 1161, in _cp_file
    out = await self._call(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 447, in _call
    status, headers, info, contents = await self._request(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 95, in validate_response
    raise FileNotFoundError(path)
FileNotFoundError: b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json.tmp/rewriteTo/b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json
Traceback (most recent call last):
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in <module>
    levanter.config.main(main)()
  File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 119, in main
    tagged_eval_datasets: list = config.data.tagged_eval_sets(Pos.size)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 584, in tagged_eval_sets
    eval_sets = self.validation_sets(seq_len, monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 789, in validation_sets
    doc_caches = self.build_caches("validation", monitors=monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 827, in build_caches
    cache.await_finished()
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 1081, in await_finished
    x = ray.get(self.finished_sentinel(), timeout=timeout)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 2664, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): [36mray::_TreeStoreCacheBuilder.finished_sentinel()[39m (pid=38797, ip=10.130.15.218, actor_id=0335e6394f02e4081441cf7f01000000, repr=<levanter.store.cache._TreeStoreCacheBuilder object at 0x7fafa840d570>)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 784, in finished_sentinel
    await self._finished_promise
  File "/home/ahmed/levanter/src/levanter/utils/ray_utils.py", line 91, in log_failures_to
    yield
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 288, in batch_finished
    self._attempt_to_write_batches()
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 353, in _attempt_to_write_batches
    _serialize_json_and_commit(os.path.join(self.cache_dir, LEDGER_FILE_NAME), self._ledger)
  File "/home/ahmed/levanter/src/levanter/store/cache.py", line 630, in _serialize_json_and_commit
    fs.rename(f"{path}.tmp", path)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1613, in rename
    return self.mv(path1, path2, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/spec.py", line 1186, in mv
    self.copy(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 405, in _copy
    raise ex
  File "/home/ahmed/venv310/lib/python3.10/site-packages/fsspec/asyn.py", line 245, in _run_coro
    return await asyncio.wait_for(coro, timeout=timeout), i
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 1161, in _cp_file
    out = await self._call(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 447, in _call
    status, headers, info, contents = await self._request(
  File "/home/ahmed/venv310/lib/python3.10/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File "/home/ahmed/venv310/lib/python3.10/site-packages/gcsfs/retry.py", line 95, in validate_response
    raise FileNotFoundError(path)
FileNotFoundError: b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json.tmp/rewriteTo/b/marin-data/o/tokenized%2FOLMo-1B%2Fdolma-v1.7%2Fpaloma%2F4chan%2Fvalidation%2Fshard_ledger.json

Note that the above complains about a validation file not being found. However if you run on a single worker v4-8 the error is actually clear:

TreeStoreCacheBuilder pid=105678) 2024-09-20 19:18:37,438 - levanter.store.cache - INFO - Finalizing cache gs://marin-data/tokenized/OLMo-1B/dolma-v1.7/paloma/dolma_100_subreddits/validation...
Traceback (most recent call last):
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in <module>
    levanter.config.main(main)()
  File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 121, in main
    config.data.train_set(Pos.size, key=data_key), Pos, KeyPos, ignore_index=config.data.ignore_token_id
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 744, in train_set
    doc_caches = self.build_caches("train", monitors=monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 816, in build_caches
    cache = dataset.build_or_load_cache(split, monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 659, in build_or_load_cache
    source = self.get_shard_source(split)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 504, in get_shard_source
    return TextUrlDataSource(split_urls, self.text_key)
  File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 209, in __init__
    self._shard_name_to_url_mapping = _mk_shard_name_mapping(urls)
  File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 449, in _mk_shard_name_mapping
    raise FileNotFoundError(f"Could not find the following urls:\n  - {missing_urls_str}")
FileNotFoundError: Could not find the following urls:
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0000.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0001.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0002.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0003.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0004.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0005.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0006.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0007.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0008.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0009.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0010.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0011.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0012.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0013.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0014.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0015.json.gz
Traceback (most recent call last):
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 223, in <module>
    levanter.config.main(main)()
  File "/home/ahmed/levanter/src/levanter/config.py", line 84, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/ahmed/levanter/src/levanter/main/train_lm.py", line 121, in main
    config.data.train_set(Pos.size, key=data_key), Pos, KeyPos, ignore_index=config.data.ignore_token_id
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 744, in train_set
    doc_caches = self.build_caches("train", monitors=monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 816, in build_caches
    cache = dataset.build_or_load_cache(split, monitors)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 659, in build_or_load_cache
    source = self.get_shard_source(split)
  File "/home/ahmed/levanter/src/levanter/data/text.py", line 504, in get_shard_source
    return TextUrlDataSource(split_urls, self.text_key)
  File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 209, in __init__
    self._shard_name_to_url_mapping = _mk_shard_name_mapping(urls)
  File "/home/ahmed/levanter/src/levanter/data/sharded_datasource.py", line 449, in _mk_shard_name_mapping
    raise FileNotFoundError(f"Could not find the following urls:\n  - {missing_urls_str}")
FileNotFoundError: Could not find the following urls:
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0000.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0001.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0002.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0003.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0004.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0005.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0006.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0007.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0008.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0009.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0010.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0011.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0012.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0013.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0014.json.gz
  - gs://marin-data/raw/dolma/dolma-v1.7/algebraic-stack-train-0015.json.gz

The above error occured because the files were not at the correct location at all to be tokenized!

@ahmeda14960
Copy link
Contributor Author

Update that this issue continues to persist even if the tokenized dataset is fine. So training runs and works for a v4-8 but fails at a v4-256. Not sure at which scale this becomes a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant