Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Blazer-007
Copy link
Contributor

@Blazer-007 Blazer-007 commented Sep 22, 2024

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • [✅] Here are some details about my PR, including screenshots (if applicable):
    • Currently, in Iceberg Distcp it is not possible to specify which partitions to copy. This PR aims to do that by adding support for partition level copy in Iceberg distcp.
    • It supports partition copy between two different iceberg tables meaning with different UUIDs.

Tests

  • [✅] My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • IcebergPartitionDatasetTest
    • IcebergOverwritePartitionsStepTest
    • IcebergTableTest [ Updated ]
      - testGetPartitionSpecificDataFiles()
      - testReplacePartitions()
    • IcebergMatchesAnyPropNamePartitionFilterPredicateTest
    • IcebergPartitionFilterPredicateUtilTest

Commits

  • [✅] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@Blazer-007 Blazer-007 force-pushed the iceberg_distcp_partition_copy_0 branch from b4f6369 to d8356e1 Compare September 24, 2024 12:32
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great start! mostly suggestions to leverage a bit more of the existing classes (rather than creating near clones) and also to simplify some interfaces (esp. for the partition filter predicates) to take in specific params, rather than Properties. given the latter may hold just about anything, the API "contract" they define is weaker than we'd want.

Comment on lines +119 to +123
CopyableFile fileEntity = CopyableFile.fromOriginAndDestination(
actualSourceFs, srcFileStatus, targetFs.makeQualified(destPath), copyConfig)
.fileSet(fileSet)
.datasetOutputPath(targetFs.getUri().getPath())
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you skip first doing this, like in IcebergDataset:

      // preserving ancestor permissions till root path's child between src and dest
      List<OwnerAndPermission> ancestorOwnerAndPermissionList =
          CopyableFile.resolveReplicatedOwnerAndPermissionsRecursively(actualSourceFs,
              srcPath.getParent(), greatestAncestorPath, copyConfig);

is that intentional? do you feel it's not necessary or actually contra-indicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the IcebergDataset the path of tables are exactly since table UUID are same on source and destination here it can be different, so copying permissions atleast in first draft is not necessary I believe.

Even if there is need that we need to make sure ancestor path, parent path are ones we want, that's why I have removed it for now.

Comment on lines 130 to 133
// Adding this check to avoid adding post publish step when there are no files to copy.
if (CollectionUtils.isNotEmpty(destDataFiles)) {
copyEntities.add(createPostPublishStep(destDataFiles));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is one difference with IcebergDataset::generateCopyEntities, which always wants to add its post-publish step. (but it shouldn't be hard to refactor to isolate this difference)

* @throws IOException if an I/O error occurs
*/
@Override
Collection<CopyEntity> generateCopyEntities(FileSystem targetFs, CopyConfiguration copyConfig) throws IOException {
Copy link
Contributor

@phet phet Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this impl is really, really similar to the one it's based on in its base class. deriving from a class and then overriding methods w/ only small changes is pretty nearly cut-and-paste code. sometimes it's inevitable, but let's avoid when we can. in this case, could we NOT override this method, but only GetFilePathsToFileStatusResult getFilePathsToFileStatus(...) so this derived class's version runs the new code instead:

    IcebergTable srcIcebergTable = getSrcIcebergTable();
    List<DataFile> srcDataFiles = srcIcebergTable.getPartitionSpecificDataFiles(this.partitionFilterPredicate);
    List<DataFile> destDataFiles = getDestDataFiles(srcDataFiles);
    Configuration defaultHadoopConfiguration = new Configuration();

    for (FilePathsWithStatus filePathsWithStatus : getFilePathsStatus(srcDataFiles, destDataFiles, this.sourceFs)) {
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will list down my reason here -

  1. In IcebergDataset implementation it is assumed that srcPath and destPath are same which is not the case here, if you see the code we are using srcPath, srcFileStatus but here those needs to be changed to destPath & srcFileStatus for readability and maintaining the code as well.
  2. Currently I have added just ReplacePartitionStep as post publish step but IcebergRegisterStep too needs to be added based on Schema Validation scenario which I will be raising as different PR because that needs a proper validation so that we are not corrupting datafiles on dest table.
  3. I am not fully convinced on copying Ancestor Permission, whether it is even required or not, although I did tried making it work by changing ancestor path parent path but wasn't working so removing it is a must for now.
  4. If i will try to just override GetFilePathsToFileStatusResult getFilePathsToFileStatus(...) then we need to override Data class GetFilePathsToFileStatusResult too as we need datafiles too along with destPath srcFileStatus.

To conclude it -
reader should understand whether it is actually srcPath or destPath while creating copyable file
need of adding replacepartition commit step along with registerstep (based on condition)
and to remove copying permission for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants