-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: add property to disable client-side purging in spark #11317
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent here is good but i'm not sure the implementation is quite correct. My first issue would be that the property probably needs a spark specific prefix or naming somewhere but the bigger issue is that this flag does not actually guarentee that a server side purge will occur when disabled. For many of our catalog implementations, the purge flag still is executed locally on the driver (see HiveCatalog). At best this would be a "request catalog purge" flag which in some cases would be the catalog in and other cases would not.
Also this is placed with in "Catalog Properties" but it is checked as a Table Property, I think Catalog is probably the correct place to put this because it's really a client decision and not really intrinsic to the table.
I don't think these are deal breakers though and I think with proper naming and we can include this.
Hi @RussellSpitzer, thanks for the quick feedback! I changed the property to be read from catalog properties. Regarding your other concern - I'm not sure I follow - in our use-case, being a REST catalog, we'd like to set this flag to keep clients from deleting stuff, so we'd like it to not be a client-decision. If the server sets this flag, it tells the client, I guarantee the purge. Does this perspective conflict with other catalogs? I thought with the current proposal, it'd be up to the specific catalog to decide if they or the respective client performs the deletes. For the names, I'm happy to change them, names are hard.. Kind regards, Tobias |
The problem is that table properties will only be respected by clients which know how to use it, so although you may set this property, you have no guarantee clients will follow the property. We can keep this a table property, but then it goes in the TableProperties file. The implementation problem is that we already have several catalog implementations which are essentially client only which do not support a remote purge. For example HiveCatalog, GlueCatalog, [HadoopCatalog] (
So this property is actually whether to use the "Catalog Purge" implementation which may or may not be on the client |
That is clear, we can prevent non-conforming clients from deleting stuff by not signing the respective request / handing out downscoped tokens without delete permissions. This would cause client side errors so eventually clients would have to conform if they want to talk to a catalog server with the UNDROP feature.
Sure, I made it a catalog property since I agree with you, it's a property of the catalog and will likely not change on a per table basis. I had originally put it under table property since that's where the
Could we prefix the property with something like rest-catalog to convince any client that they won't have an effect against the other catalogs for which this flag doesn't have a meaning? |
@@ -365,24 +368,35 @@ public boolean purgeTable(Identifier ident) { | |||
String metadataFileLocation = | |||
((HasTableOperations) table).operations().current().metadataFileLocation(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably simpler here to just do
// Remote Purge and early exit
if (catalog.instanceOf[Rest...] and property = true) {
remote drop purge
return
}
// Original Code Path
* | ||
* <p>When set to false, the client will not | ||
*/ | ||
public static final String IO_CLIENT_SIDE_PURGE_ENABLED = "io.client-side.purge-enabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly rename
/**
* Controls whether engines using a REST Catalog should delegate the drop table purge requests to the Catalog.
* Defaults to false, allowing the engine to use its own implementation for purging.
*/
REST_PURGE = "rest-purge"
REST_PURGE_DEFAULT = false;
68a07ae
to
3033ab0
Compare
21257ee
to
48de51b
Compare
import org.junit.jupiter.api.TestTemplate; | ||
import org.junit.jupiter.api.extension.ExtendWith; | ||
|
||
@ExtendWith(ParameterizedTestExtension.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably still be a part of the other test classes, or a new file called TestSparkCatalogREST
On a higher point I think looking at this we should just add a new method to the SparkCatalog
/**
* Reset the delegate Iceberg Catalog to a test Mock Catalog object
*/
@VisibleForTesting
void injectCatalog(Catalog iceberg) {
this.icebergCatalog = icebergCatalog;
}
Which can use to just inject a mock Catalog into the Spark environment so then the code would look something like
Psuedo Code Follows
Spark Config Here
RestCatalog mockCatalog = mock(new RestCatalog)
Spark3Util.loadCatalog("name").injectCatalog(mockCatalog)
sql("drop table purge ...")
verify(mockCatalog).called(drop, purge true)
@nastra and @amogh-jahagirdar , I know y'all have written a bunch of REST testing, how does that sound to you?
I am wondering whether we should have parameter for the normal test code that uses Hive, Hadoop, and add in a REST that injects like ^ ... but that may be overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need any of the injection logic. We should rather wait for #11093 to get in, which would enable much easier testing with Spark + REST catalog
@@ -60,6 +60,10 @@ public static Catalog wrap( | |||
return new CachingCatalog(catalog, caseSensitive, expirationIntervalMillis); | |||
} | |||
|
|||
public boolean wrapped_is_instance(Class<?> cls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed once #11093 gets in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for checking the type of the catalog impl that CachingCatalog
is wrapping so that we can correctly respect the purge property at https://github.com/apache/iceberg/pull/11317/files#diff-bd61838d4e3a9aef52a670696750b30deac10b90f526435e32beacb0107eea24R376. Unless CachingCatalog
is only used in tests, I think we need to keep this method?
closes #11023
This PR adds a new table property
io.client-side.purge-enabled
. It allows a catalog to control spark's behavior onDROP .. PURGE
. This in turn, enables REST catalogs to offer an UNDROP feature for all storage providers. Currently, this is only possible with s3 due to thes3.delete_enabled
property.