Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data pipeline failed without retry #154

Open
zuston opened this issue Nov 6, 2024 · 17 comments
Open

Data pipeline failed without retry #154

zuston opened this issue Nov 6, 2024 · 17 comments

Comments

@zuston
Copy link
Contributor

zuston commented Nov 6, 2024

image

When using the hdfs-native crate, I encountered the data pipeline failed. If this happened by the problem datanode, do we need to request a new block location?

Could you help look this problem? @Kimahriman

@zuston
Copy link
Contributor Author

zuston commented Nov 6, 2024

After digging the vannila hdfs code, it looks it will replace bad datanode if pipeline failed.

image

@Kimahriman
Copy link
Owner

Yeah the write path is the least resilient part of the library right now. I need to take a deeper look to understand exactly how the Java writer handles lost data nodes, and then figure out how I could write a test for something like that.

If you're not already, I'd currently recommend retrying the whole write from scratch if you still have access to all the data you are trying to write.

@Kimahriman
Copy link
Owner

Based on a quick look, it seems like the logic is generally:

  • If a DataNode fails, either the one it is talking to or any of the others being replicated to, simply remove that DataNode from the block being written and continue on
  • If certain conditions are true, add a new DataNode to the block. Specifically the default conditions are in this comment:
DEFAULT condition:
   *   Let r be the replication number.
   *   Let n be the number of existing datanodes.
   *   Add a new datanode only if r >= 3 and either
   *   (1) floor(r/2) >= n or (2) the block is hflushed/appended.

So for example with replication of 3, a new DataNode would only get added if two DataNodes fail during writing that block.

The first one is simpler and I could look into trying to add that, though still not exactly sure how I would test it. That should make things more resilient than the current behavior which is simply failing on the first DataNode error.

The second is a bit more complex and requires copying the already written data to a new replica before continuing, and probably wouldn't be something I could tackle anytime soon.

@zuston
Copy link
Contributor Author

zuston commented Nov 6, 2024

Thanks for your quick reply, I found another error status when appending file. Please see this:

IMG_2676

And I think the first one solution is good for my append case. @Kimahriman

@zuston
Copy link
Contributor Author

zuston commented Nov 6, 2024

Yeah the write path is the least resilient part of the library right now.

Yes, it looks unstable especially in a busy hdfs cluster.

If you have any improvement patch, i’m happy to test this. Almostly, i could reproduce it by my cases.

@Kimahriman
Copy link
Owner

Thanks for your quick reply, I found another error status when appending file. Please see this:

This would be when a data node the block is replicating to fails. The connection drop would be when the one you are talking to dies, so I think they're effectively the same issue

@zuston
Copy link
Contributor Author

zuston commented Nov 7, 2024

After simply looking this article https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-2/, I think the bad data node replacement will not trigger the original data resend. Once client found the bad data node, it will build the new pipeline through the namenode, and then resend the data start from last acked packet. @Kimahriman

@Kimahriman
Copy link
Owner

Very helpful article! I hoped that might be the case for node replacement but then found https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1550 while tracing through the code. The data is re-replicated by the client on node replacement. It can only keep going from where it left off if the node isn't replaced.

@zuston
Copy link
Contributor Author

zuston commented Nov 7, 2024

Thanks for your code reference. Let me take a deep look.

@zuston
Copy link
Contributor Author

zuston commented Nov 8, 2024

The data is re-replicated by the client on node replacement.

From digging the DataStreamer.java, the replicated transfer is from the healthy datanode -> replacement datanodes. The client only trigger the recovery pipeline creating to back fill the data from healthy node to replacement node.

BTW, if ignore the bad nodes and do nothing, the missing blocks will exist. So If we want to have a resilient writing process, the datanode replacement may need to be supported

@Kimahriman
Copy link
Owner

From digging the DataStreamer.java, the replicated transfer is from the healthy datanode -> replacement datanodes. The client only trigger the recovery pipeline creating to back fill the data from healthy node to replacement node.

Yeah looks like it just makes an RPC call to trigger the replication, just a little bit of extra complexity on there. I think that would still be follow on work to just getting the write to continue with fewer DataNodes on failures.

BTW, if ignore the bad nodes and do nothing, the missing blocks will exist. So If we want to have a resilient writing process, the datanode replacement may need to be supported

My understanding is even if a single node (or multiple but not all) in a pipeline are lost, if you successfully write the block to at least one DataNode, HDFS will re-replicate it after the fact as it will see it is under replicated. I think the main reason for replicating as part of the pipeline recovery is just to make that a little more resilient. i.e. if in the end all but one DataNodes fail in your pipeline, and you "successfully" finish your write, but then that one DataNode immediately dies, your write was successful but now you are missing that block.

@zuston
Copy link
Contributor Author

zuston commented Nov 8, 2024

My understanding is even if a single node (or multiple but not all) in a pipeline are lost, if you successfully write the block to at least one DataNode, HDFS will re-replicate it after the fact as it will see it is under replicated.

Got it. If so, the simply ignore bad data node in the pipeline is acceptable. We only just need to recognize the bad node from the ack response’s reply and flag and then ignore this.

@zuston
Copy link
Contributor Author

zuston commented Nov 13, 2024

image

another possible error message

@zuston
Copy link
Contributor Author

zuston commented Nov 13, 2024

image

@Kimahriman
Copy link
Owner

Hmm that's very odd, might be unrelated to writing data. What version of HDFS are you running? Also if you can replicate those having some debug logs would be helpful to see what's going on.

@zuston
Copy link
Contributor Author

zuston commented Nov 14, 2024

Hmm that's very odd, might be unrelated to writing data. What version of HDFS are you running? Also if you can replicate those having some debug logs would be helpful to see what's going on.

HDFS with 3.2.2 version.

Also if you can replicate those having some debug logs would be helpful to see what's going on.

It's OK, I will attach it if I having more detailed logs from the namenode or datanodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants