-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] extract keys #406
base: main
Are you sure you want to change the base?
Changes from 2 commits
751da99
5dc2a89
59298b7
45ae754
b31d721
ba9b5a4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,59 @@ | ||||||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||||||
# All rights reserved. | ||||||
# | ||||||
# This source code is licensed under the BSD-style license found in the | ||||||
# LICENSE file in the root directory of this source tree. | ||||||
|
||||||
from fnmatch import fnmatch | ||||||
from typing import Dict, Iterator, Tuple | ||||||
|
||||||
from torchdata.datapipes import functional_datapipe | ||||||
from torchdata.datapipes.iter import IterDataPipe | ||||||
|
||||||
|
||||||
@functional_datapipe("extract_keys") | ||||||
class ExtractKeysIterDataPipe(IterDataPipe[Dict]): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we rename this to We can still keep |
||||||
r""" | ||||||
Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Args: | ||||||
source_datapipe: a DataPipe yielding a stream of dictionaries. | ||||||
duplicate_is_error: it is an error if the same key is selected twice (True) | ||||||
ignore_missing: skip any dictionaries where one or more patterns don't match (False) | ||||||
Comment on lines
+21
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Duplicate lines of descriptions |
||||||
*args: list of glob patterns or list of glob patterns | ||||||
|
||||||
Returns: | ||||||
a DataPipe yielding a stream of tuples | ||||||
|
||||||
Examples: | ||||||
>>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt") | ||||||
""" | ||||||
Comment on lines
+32
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me. |
||||||
|
||||||
def __init__( | ||||||
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False | ||||||
) -> None: | ||||||
super().__init__() | ||||||
self.source_datapipe: IterDataPipe[Dict] = source_datapipe | ||||||
self.duplicate_is_error = duplicate_is_error | ||||||
self.patterns = args | ||||||
self.ignore_missing = ignore_missing | ||||||
|
||||||
def __iter__(self) -> Iterator[Tuple]: | ||||||
for sample in self.source_datapipe: | ||||||
result = [] | ||||||
for pattern in self.patterns: | ||||||
pattern = [pattern] if not isinstance(pattern, (list, tuple)) else pattern | ||||||
matches = [x for x in sample.keys() if any(fnmatch(x, p) for p in pattern)] | ||||||
if len(matches) == 0: | ||||||
if self.ignore_missing: | ||||||
continue | ||||||
else: | ||||||
raise ValueError(f"Cannot find {pattern} in sample keys {sample.keys()}.") | ||||||
if len(matches) > 1 and self.duplicate_is_error: | ||||||
tmbdev marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
raise ValueError(f"Multiple sample keys {sample.keys()} match {pattern}.") | ||||||
value = sample[matches[0]] | ||||||
result.append(value) | ||||||
yield tuple(result) | ||||||
|
||||||
def __len__(self) -> int: | ||||||
return len(self.source_datapipe) | ||||||
Comment on lines
+72
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: A sample will always be yielded even if nothing matches right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We used to have a different
extractor