-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BalancedRandomGeoSampler balancing positives and negatives #1883
Comments
A few comments on this.
|
Thanks for you feedback. Indeed, my initial thoughts have a reduced scope to reduce complexity while figuring it out. I do think a general approach for 2+ classes and even Also, as I currently have a use case for I agree that we should draw inspiration from their API and get to a similar usage. There is one loop like before, where my initial point 1. and 2. would be inside the loop for Here we have a misunderstanding. No more pixels (raster-data) will be loaded than today. My suggested implementation is based on spatial operations on vector data (not rasterised masks). So the rasters will be loaded the same way as today, and after the choice of the sample. Will try to illustrate it in the below example. It is possible. But for sparse positives, e.g. the label vector-file reprecents a diagonal river across the image, the IntersectionDataset Here is the main parts of our implementation. It is tested and works, but I did not include the changes in the init methods of Datasets. This version has one difference from my initial description, where point 2. and 4. is not implemented. Instead of creating a grid and assign positive and negative contents to those, it randomly generates bounds for the new sample, checks if it contains labels (vector operation), and tries again if it does not (like in #1881). Wether to yield a positive or negative sample is determined by keeping track of how many of each has been yielded so far, and comparing it with the user specified class IntersectionDataset(GeoDataset):
def _merge_dataset_indices(self) -> None:
"""Create a new R-tree out of the individual indices from two datasets."""
i = 0
ds1, ds2 = self.datasets
# Assuming ds1 is the RasterDataset
for hit1 in ds1.index.intersection(ds1.index.bounds, objects=True):
# Assuming ds2 is the (positive) label dataset (VectorDataset or RasterDataset)
for hit2 in ds2.index.intersection(hit1.bounds, objects=True):
box1 = BoundingBox(*hit1.bounds)
box2 = BoundingBox(*hit2.bounds)
box_intersection = box1 & box2
# Create vector-geometry of these intersection bounds
bounds = shapely.geometry.box(
minx=box_intersection.minx,
miny=box_intersection.miny,
maxx=box_intersection.maxx,
maxy=box_intersection.maxy,
)
# Raster files may contain nodata. `valid_footprint` have been calculated
# for each file and been added to the rtree-index for the RasterDataset.
# See PR #1881 for discarding nodata samples.
# __getitem__ does already read across all files overlapping the the choosen sample,
# so we need to collect valid footprints for all of them.
# For these bounds, gather all valid_footprints for all (raster) files
# as the current hit's file may contain coverage from other files.
ds1_footprints = [
other.object["valid_footprint"]
for other in ds1.index.intersection(
tuple(box_intersection), objects=True
)
]
# Merge these footprints to one vector feature (MultiPolygon)
ds1_footprint = shapely.unary_union(ds1_footprints)
# Crop it to the bounds for the rtee-index hit
ds1_footprint_in_bounds = shapely.intersection(
ds1_footprint, bounds
)
# Do the same thing for the other dataset
# For VectorDataset this is the actual vector features (not rasterized)
# representing the class coverage. Lets call this valid_fotprint too.
# This too, have been added to the rtree-index in the VectorDataset init
# Similar to above, multiple files may be covered by the this bounds, so
# need to merge their footprints.
ds2_footprints = [
other.object["valid_footprint"]
for other in ds2.index.intersection(
tuple(box_intersection), objects=True
)
]
ds2_footprint = shapely.unary_union(ds2_footprints)
ds2_footprint_in_bounds = shapely.intersection(
ds2_footprint, bounds
)
# Here we made the choice to only include bounds where both datasets
# have coverage, effectively disregarding negative-only rtree-indexes.
if ds1_footprint_in_bounds.is_empty or ds2_footprint_in_bounds.is_empty:
continue
# The intersection of the dataset's footprint's gives us the polygon of positives regions
positive_polygon = shapely.intersection(ds1_footprint_in_bounds, ds2_footprint_in_bounds)
# The difference gives us the polygon of the negative regions
# assuming ds1 is the raster and ds2 is the labels.
# Adding these to the new index
negative_polygon = ds1_footprint_in_bounds - positive_polygon
self.index.insert(
i,
tuple(box_intersection),
{
"positive_polygon": positive_polygon,
"negative_polygon": negative_polygon,
},
)
i += 1
if i == 0:
raise RuntimeError("Datasets have no spatiotemporal intersection")
class RandomGeoSampler(GeoSampler):
def __init__(
self,
dataset: GeoDataset,
size: Union[tuple[float, float], float],
length: Optional[int],
roi: Optional[BoundingBox] = None,
units: Units = Units.PIXELS,
exclude_nodata_samples: bool = False,
max_retries: int = 50_000,
pos_neg_frac: float = 0.5,
) -> None:
"""
Args:
exclude_nodata_samples: will ensure that samples are not outside of the
footprint of the valid pixel. No-data regions may occur due to
re-projection or inherit no-data regions in rasters.
max_retries: is used when exclude_nodata_samples are True. Is a safe-guard
for infinite loops in case the nodata-mask of the raster is wrong.
pos_neg_frac: fraction of positive samples
"""
super().__init__(dataset, roi)
self.size = _to_tuple(size)
self.exclude_nodata_samples = exclude_nodata_samples
self.max_retries = max_retries
self.pos_neg_frac = pos_neg_frac
self.pos_sampled = 0
self.neg_sampled = 0
# Rest of __init__ as is
def __iter__(self) -> Iterator[BoundingBox]:
"""Return the index of a dataset.
Returns:
(minx, maxx, miny, maxy, mint, maxt) coordinates to index a dataset
"""
for _ in range(len(self)):
# Choose a random tile, weighted by area
idx = torch.multinomial(self.areas, 1)
hit = self.hits[idx]
bounds = BoundingBox(*hit.bounds)
# New code, balancing positive and negative samples
# by checking that they overlap with the corresponding polygons
# Counts how many of each has been sampled already,
# and decides what should be sampled next.
if (
self.pos_sampled == 0
or (self.pos_sampled / (self.pos_sampled + self.neg_sampled))
< self.pos_neg_frac
):
footprint = hit.object["positive_polygon"]
spatial_operator = shapely.overlaps # ensures some positive
self.pos_sampled += 1
else:
footprint = hit.object["negative_polygon"]
spatial_operator = shapely.within # ensures no positive
self.neg_sampled += 1
# Method implemented in #1881
# https://github.com/microsoft/torchgeo/blob/bbb21d5adb91c5e9fabf1efc83507515532f94b9/torchgeo/samplers/utils.py#L81
yield get_random_bounding_box_check_valid_overlap(
bounds=bounds,
size=self.size,
res=self.res,
valid_footprint=footprint,
max_retries=self.max_retries,
spatial_operator=spatial_operator,
) |
Summary
Lets the user balance positives and negatives, sampled from IntersectionDataset of RastersDataset and VectorDataset.
Rationale
May help reduce bias and increase generalisation. The RandomGeoSampler currently does not support this feature, and you may end up with heavily unbalanced datasets.
Implementation
def __iter__():
length
.Should somehow cache the resulting grids for the next time the same hit is randomly chosen. Or pre-compute these.
One thought is to extend the rtree-index with these grid cells, even extending the index dimension with the label (class = positive or negative). The query could then become
(minx, maxx, miny, maxy, mint, maxt, class)
in some order. (But this would require the GeoDataset to know the desiredpatch_size
though...)Alternatives
One good alternative/evolution that I see is replacing of the rtree-index created in IntersectionDataset. Instead of the current rtree-index, add the vectorized raster-footprints and/or vector data (actual features/shapes from e.g. shapefiles) to GeoPandas GeoDataFrames, and use their implementation of rtree to rapidly find areas where they intersect. The temporal dimension would not be part of the index (not supported), but can be added as a filter once spatial matches are found. At the end of this article they split their polygons [raster-footprints] into small grid cells [of patch size] and can rapidly retrieve cells [samples] where there is overlap with the other data set [labels], or no overlap. The underlying rtree-indices in RasterDatasets and VectorDatasets will be the same, and still be used to read the data upon sampling.
Additional information
Probably depends on #1881 to be merged.
EDIT: add alternative
The text was updated successfully, but these errors were encountered: