You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that CUDA operators can be directly encapsulated into PyTorch’s cpp_extension for implementation and calling. Why would one need to rebuild the entire network using C++?
The text was updated successfully, but these errors were encountered:
Thanks for raising this question. For now, we rebuilt the entire network in C to maximize the efficiency and minimize the overhead of PyTorch. We are also considering wrapping our core kernels into torchao as extensions.
Thanks for raising this question. For now, we rebuilt the entire network in C to maximize the efficiency and minimize the overhead of PyTorch. We are also considering wrapping our core kernels into torchao as extensions.
Whether torch extensions leads to performance degradation compared to rebuilding the entire network in C ?
`
JointTransformerBlock::JointTransformerBlock(int dim, int num_attention_heads, int attention_head_dim, bool context_pre_only, Tensor::ScalarType dtype, Device device) :
dim(dim),
dim_head(attention_head_dim / num_attention_heads),
num_heads(num_attention_heads),
context_pre_only(context_pre_only),
norm1(dim, false, dtype, device),
norm1_context(dim, context_pre_only, dtype, device),
qkv_proj(dim, dim * 3, true, dtype, device),
qkv_proj_context(dim, dim * 3, true, dtype, device),
norm_q(dim_head, 1e-6, false, dtype, device),
norm_k(dim_head, 1e-6, false, dtype, device),
norm_added_q(dim_head, 1e-6, false, dtype, device),
norm_added_k(dim_head, 1e-6, false, dtype, device),
attn(num_attention_heads, attention_head_dim / num_attention_heads, device),
out_proj(dim, dim, true, dtype, device),
out_proj_context(dim, dim, true, dtype, device),
norm2(dim, 1e-6, false, dtype, device),
norm2_context(dim, 1e-6, false, dtype, device),
mlp_fc1(dim, dim * 4, true, dtype, device),
mlp_fc2(dim * 4, dim, true, dtype, device),
mlp_context_fc1(dim, dim * 4, true, dtype, device),
mlp_context_fc2(dim * 4, dim, true, dtype, device)
{
registerChildren
(norm1, "norm1")
(norm1_context, "norm1_context")
(qkv_proj, "qkv_proj")
(qkv_proj_context, "qkv_proj_context")
(norm_q, "norm_q")
(norm_k, "norm_k")
(norm_added_q, "norm_added_q")
(norm_added_k, "norm_added_k")
(attn, "attn")
(out_proj, "out_proj")
(out_proj_context, "out_proj_context")
(norm2, "norm2")
(norm2_context, "norm2_context")
(mlp_fc1, "mlp_fc1")
(mlp_fc2, "mlp_fc2")
(mlp_context_fc1, "mlp_context_fc1")
(mlp_context_fc2, "mlp_context_fc2")
;
}
`
I understand that CUDA operators can be directly encapsulated into PyTorch’s cpp_extension for implementation and calling. Why would one need to rebuild the entire network using C++?
The text was updated successfully, but these errors were encountered: