You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:
deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
with torch.no_grad():
a = net_a_frozen(data_dict['a'])
b = net_b_frozen(data_dict['b'])
# loss = net_c_training(a, b)
# the loss is only used in training.
save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
append_to_deeplake(deeplake_path, save_dict)
if index % 100 == 0:
commit_to_deeplake(deeplake_path)
Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:
I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
The frequent append and commit function takes me a lot of time.
detach() and to_cpu() function takes me a lot of time.
So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:
or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.
Use Cases
Distributed parallel computing and saving to deeplake.
The text was updated successfully, but these errors were encountered:
Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.
Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.
Thanks! I hope that I have explained my use case clearly.
Maybe I need function likes this:
Description
Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:
Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:
So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:
or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.
Use Cases
Distributed parallel computing and saving to deeplake.
The text was updated successfully, but these errors were encountered: