-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local execution tests #1418
Local execution tests #1418
Conversation
… tests, other review changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check and make sure all of the std::cout
s are removed? I left comments on all I found but it's not impossible that I missed one or two
Reviewed 48 of 67 files at r8, 1 of 1 files at r10, 34 of 34 files at r11, all commit messages.
Reviewable status: all files reviewed, 51 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 172 at r11 (raw file):
""); std::string format_as(std::vector<GenericTensorAccessorR> const &x);
Why do we have this instead of just having fmt
support for GenericTensorAccessorR
and using the default utils/fmt/vector.h
fmt
support for std::vector
?
lib/kernels/include/kernels/device.h
line 102 at r4 (raw file):
std::stringstream _error; \ if (status != 0) { \ _error << "CUDA failure: " << cudaGetErrorString(status) << " (" \
This should be kept, why was this removed?
lib/kernels/src/ff_handle.cc
line 6 at r11 (raw file):
std::string format_as(PerDeviceFFHandle const &x) { return fmt::format("PerDeviceFFHandle");
Why not include some fields in the fmt description?
lib/local-execution/include/local-execution/cost_estimate.h
line 21 at r11 (raw file):
std::vector<ParallelTensorAttrs> const &outputs, MachineView const &mv) const = 0; virtual CostDetails
Remove this second overload and just ignore MachineView
where it's not necessary. Overall CostEstimator
should include network modelling, etc. rather than just being a thin interface onto kernel runtimes
lib/local-execution/include/local-execution/device_specific_device_states.variant.toml
line 2 at r11 (raw file):
namespace = "FlexFlow" name = "DeviceSpecificDeviceStates"
What is the purpose of this variant? What's the reasoning for having a variant<DeviceSpecific<T>, ...>
vs a DeviceSpecific<variant<T, ...>>
?
lib/local-execution/include/local-execution/device_states.variant.toml
line 2 at r11 (raw file):
namespace = "FlexFlow" name = "DeviceStates"
Suggestion:
name = "PerDeviceState"
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 47 at r11 (raw file):
CHECK_RC_COPY_VIRTUAL_COMPLIANT(LocalTaskArgumentAccessor); std::string format_as(std::unordered_map<slot_id_t, ConcreteArgSpec> const &x);
utils/fmt/unordered_map.h
lib/local-execution/include/local-execution/local_training_backing.h
line 17 at r11 (raw file):
TensorBackingMap const &, RuntimeArgConfig const &); ~LocalTrainingBacking() = default;
lib/local-execution/include/local-execution/local_training_backing.h
line 38 at r11 (raw file):
ComputationGraph computation_graph; TaskRegistry task_registry; LocalSlotsBacking local_slots_backing;
Suggestion:
private:
Allocator allocator;
ComputationGraph computation_graph;
TaskRegistry task_registry;
LocalSlotsBacking local_slots_backing;
lib/local-execution/include/local-execution/op_task_invocation.h
line 102 at r11 (raw file):
std::unordered_map<slot_id_t, OpArgSpec> arg_bindings; std::tuple<decltype(tensor_bindings) const &, decltype(arg_bindings) const &> tie() const;
Having a separate section for member variables can improve readability
Suggestion:
private:
std::unordered_map<SlotGradId, OpTensorSpec> tensor_bindings;
std::unordered_map<slot_id_t, OpArgSpec> arg_bindings;
private:
void insert_arg_spec(slot_id_t name, OpArgSpec const &arg_spec);
std::tuple<decltype(tensor_bindings) const &, decltype(arg_bindings) const &>
tie() const;
lib/local-execution/include/local-execution/task_argument_accessor.h
line 13 at r11 (raw file):
template <typename T> T const &get_argument(slot_id_t slot) const { if constexpr (DeviceStates::IsPartOfDeviceStates_v<T>) {
Is doing this based on type the best way of doing this, or should this be done based on whether a value was passed as device-specific in the invocation?
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 12 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I think here the
key
is expressive enough for the type anyway?
key
isn't really used for much except for json serialization iirc
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
namespace FlexFlow { struct TaskRegistry {
Move TaskRegistry
over to dtgen
?
lib/local-execution/include/local-execution/task_registry.h
line 26 at r11 (raw file):
bool operator!=(TaskRegistry const &other) const; private:
Suggestion:
bool operator==(TaskRegistry const &other) const;
bool operator!=(TaskRegistry const &other) const;
private:
std::unordered_map<layer_guid_t, std::optional<task_id_t>> init_task_ids;
std::unordered_map<layer_guid_t, std::optional<task_id_t>> forward_task_ids;
std::unordered_map<layer_guid_t, std::optional<task_id_t>> backward_task_ids;
std::unordered_map<task_id_t, TaskSignatureAndImpl> task_mapping;
private:
lib/local-execution/include/local-execution/task_registry.h
line 35 at r11 (raw file):
std::string format_as( std::unordered_map<layer_guid_t, std::optional<task_id_t>> const &x);
Why not use the default utils/fmt/unordered_map.h
and utils/fmt/optional.h
?
lib/local-execution/include/local-execution/task_signature_impl.h
line 19 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
The
TaskImplFunction
doesn't haveeq
because I don't think that works forstd::function
, so for this I think it's fine? since the==
forTaskSignatureAndImpl
isn't actually comparing all values, just the signatures.
In that case it would be better to have a separate function rather than an operator==
--I'd recommend only having operator==
if the type does actually support equality
lib/local-execution/src/concrete_arg.cc
line 19 at r11 (raw file):
std::string format_as(ConcreteArgSpec const &x) { std::ostringstream oss; oss << "<ConcreteArgSpec";
Would probably be good to print the value too since it's concrete?
lib/local-execution/src/device_states.cc
line 11 at r11 (raw file):
[device_idx](DeviceSpecific<MHAPerDeviceState> const &device_specific_device_state) { std::cout << "im";
Delete
lib/local-execution/src/device_states.cc
line 14 at r11 (raw file):
return DeviceStates{*(device_specific_device_state.get(device_idx))}; }, // [device_idx](DeviceSpecific<BatchNormPerDeviceState> const
What is going on here?
lib/local-execution/src/local_training_backing.cc
line 21 at r6 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Why can't its output tensor guid just point to the same region?
Probably? It's fine for now, this can always be addressed if necessary when the legion backend is added
lib/local-execution/src/local_training_backing.cc
line 33 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Shouldn't every operator have
std::nullopt
by default? Some ops don't have init task for example, but if we just moved this insertion into the else case, then those operators won't have a key in theinit_task_ids
In that case you'd add it to the list of attrs not to register tasks for along with WeightAttrs
, InputAttrs
, and NoopAttrs
--that at least seems to be the current concept behind how the code works, so having two different methods for not registering a task seems odd to me
lib/local-execution/src/local_training_backing.cc
line 60 at r11 (raw file):
DeviceSpecificDeviceStates device_state = this->call_init_task_impl(invocation.task_id, accessor); std::cout << "post init";
Delete
lib/local-execution/src/op_task_signature.cc
line 120 at r11 (raw file):
OpTaskSignature infer_bwd_signature(OpTaskSignature const &fwd) { OpTaskSignature bwd(fwd);
Suggestion:
OpTaskSignature bwd = fwd;
lib/local-execution/src/task_registry.cc
line 38 at r11 (raw file):
this->task_mapping.insert({task_id, task_signature_impl}); } }
It seems like I'd like to know if something I expected to register a task doesn't--otherwise the optional
doesn't have much of a point--yes this will complain if something isn't registered, but that seems like the point
Suggestion:
if (!(attrs.has<WeightAttrs>() || attrs.has<InputAttrs>() ||
attrs.has<NoopAttrs>())) {
// register tasks
std::vector<task_id_t> task_ids = get_task_ids(attrs);
for (task_id_t task_id : task_ids) {
TaskSignatureAndImpl task_signature_impl = get_task_sig_impl(task_id);
switch (task_signature_impl.task_signature.type) {
case OpTaskType::INIT:
assert(is_invocation_valid(task_signature_impl.task_signature,
init(attrs)));
this->init_task_ids[op_id] = task_id;
break;
case OpTaskType::FWD:
assert(is_invocation_valid(task_signature_impl.task_signature,
forward(attrs)));
this->forward_task_ids[op_id] = task_id;
break;
case OpTaskType::BWD:
assert(is_invocation_valid(task_signature_impl.task_signature,
backward(attrs)));
this->backward_task_ids[op_id] = task_id;
break;
default:
throw mk_runtime_error("Invalid OpTaskType");
}
this->task_mapping.insert({task_id, task_signature_impl});
} else {
this->init_task_ids.insert({op_id, std::nullopt});
this->forward_task_ids.insert({op_id, std::nullopt});
this->backward_task_ids.insert({op_id, std::nullopt});
}
}
lib/local-execution/src/task_registry.cc
line 61 at r11 (raw file):
std::string format_as( std::unordered_map<layer_guid_t, std::optional<task_id_t>> const &x) {
utils/fmt/unordered_map.h
lib/local-execution/src/task_registry.cc
line 72 at r11 (raw file):
std::string format_as(std::unordered_map<task_id_t, TaskSignatureAndImpl> const &x) {
utils/fmt/unordered_map.h
lib/local-execution/src/task_registry.cc
line 82 at r11 (raw file):
} std::string format_as(TaskRegistry const &x) {
Fix (if you move TaskRegistry
over to dtgen
, this should be automatically generated for you)
lib/local-execution/src/task_signature_impl.cc
line 314 at r11 (raw file):
std::string format_as(TaskSignatureAndImpl const &x) { return fmt::format("TaskSignatureAndImpl");
Fix (ideally through dtgen
)
lib/local-execution/src/tasks.cc
line 6 at r11 (raw file):
namespace FlexFlow { std::string format_as(task_id_t const &x) {
Probably best to just make task_id_t
a dtgen
enum
, and then you get format_as
for free. We'll need a bit of conversion code when registering things in Legion, but it should be pretty trivial
lib/local-execution/src/ops/attention.cc
line 134 at r11 (raw file):
kvSeqLength, attrs.add_bias_kv); std::cout << "pre fn return";
Remove
lib/local-execution/src/ops/conv_2d.cc
line 54 at r11 (raw file):
} static DeviceSpecificDeviceStates
Why did this return type get changed? Why not just return a DeviceSpecific<Conv2DPerDeviceState>
?
Code quote:
DeviceSpecificDeviceStates
lib/local-execution/test/src/test_local_slots_backing.cc
line 12 at r11 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Local Slots Backing -- Attention Op") {
Suggestion:
TEST_CASE("LocalSlotsBacking -- Attention Op") {
lib/local-execution/test/src/test_local_slots_backing.cc
line 31 at r11 (raw file):
GenericTensorAccessorW key = allocator.allocate_tensor(input_tensor_shape); GenericTensorAccessorW value = allocator.allocate_tensor(input_tensor_shape);
Suggestion:
TensorShape input_tensor_shape = TensorShape{
TensorDims{FFOrdered<size_t>{batch_size, seq_len, feature_size}},
DataType::FLOAT,
};
TensorShape query_shape = input_tensor_shape;
TensorShape key_shape = input_tensor_shape;
TensorShape value_shape = input_tensor_shape;
GenericTensorAccessorW query =
allocator.allocate_tensor(query_shape);
GenericTensorAccessorW key = allocator.allocate_tensor(key_shape);
GenericTensorAccessorW value =
allocator.allocate_tensor(value_shape);
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
for (layer_guid_t const &node : topological_ordering(cg_builder.computation_graph)) { local_slots_backing.allocate_tensors(
If we're testing allocate_tensors
, while are we calling allocate_tensors
so many times (i.e., in a loop)? Seems like it would be easier to call it for one input and then check the output
lib/local-execution/test/src/test_local_slots_backing.cc
line 81 at r11 (raw file):
} SUBCASE("Allocate and insert new tensors into slots") { SUBCASE("Tensor allocation") {
Suggestion:
SUBCASE("Tensor allocation") {
auto get_result_shape_and_dtype_for_tensor_guid = [&](tensor_guid_t t) -> std::pair<ArrayShape, DataType> {
GenericTensorAccessorW accessor = local_slots_backing.gradient_tensor_backing.at(query_guid);
return get_shape_and_datatype(query_grad);
});
lib/local-execution/test/src/test_local_slots_backing.cc
line 90 at r11 (raw file):
get_shape_and_datatype(query_grad); CHECK(result == correct_input); }
Suggestion:
SUBCASE("Query grad") {
std::pair<ArrayShape, DataType> result = [] {
GenericTensorAccessorW query_grad =
local_slots_backing.gradient_tensor_mapping.at(query_guid);
return get_shape_and_datatype(query_grad);
}();
std::pair<ArrayShape, DataType> correct = {ArrayShape{query_shape}, dtype};
CHECK(result == correct);
}
lib/local-execution/test/src/test_local_slots_backing.cc
line 124 at r11 (raw file):
CHECK(result == correct_output); } }
Suggestion:
SUBCASE("LocalSlotsBacking::allocate_tensors") {
for (layer_guid_t const &node :
topological_ordering(cg_builder.computation_graph)) {
local_slots_backing.allocate_tensors(
node, cg_builder.computation_graph, allocator);
}
SUBCASE("Tensor allocation") {
std::pair<ArrayShape, DataType> correct_input =
std::make_pair(ArrayShape{input_tensor_shape}, dtype);
SUBCASE("Query grad") {
GenericTensorAccessorW query_grad =
local_slots_backing.gradient_tensor_mapping.at(query_guid);
std::pair<ArrayShape, DataType> result =
get_shape_and_datatype(query_grad);
CHECK(result == correct_input);
}
SUBCASE("Key grad") {
GenericTensorAccessorW key_grad =
local_slots_backing.gradient_tensor_mapping.at(key_guid);
std::pair<ArrayShape, DataType> result =
get_shape_and_datatype(key_grad);
CHECK(result == correct_input);
}
SUBCASE("Value grad") {
GenericTensorAccessorW value_grad =
local_slots_backing.gradient_tensor_mapping.at(value_guid);
std::pair<ArrayShape, DataType> result =
get_shape_and_datatype(value_grad);
CHECK(result == correct_input);
}
TensorAttrs output_attrs =
get_tensor_attrs(cg_builder.computation_graph, output_guid);
std::pair<ArrayShape, DataType> correct_output =
std::make_pair(ArrayShape{output_attrs.shape}, dtype);
SUBCASE("Output") {
GenericTensorAccessorW output =
local_slots_backing.tensor_mapping.at(output_guid);
std::pair<ArrayShape, DataType> result =
get_shape_and_datatype(output);
CHECK(result == correct_output);
}
SUBCASE("Output grad") {
GenericTensorAccessorW output_grad =
local_slots_backing.tensor_mapping.at(output_guid);
std::pair<ArrayShape, DataType> result =
get_shape_and_datatype(output_grad);
CHECK(result == correct_output);
}
}
lib/local-execution/test/src/test_local_slots_backing.cc
line 176 at r11 (raw file):
}(); SUBCASE("Tensor Slots Backing") {
Suggestion:
SUBCASE("LocalSlotsBacking::construct_tensor_slots_backing") {
lib/local-execution/test/src/test_local_slots_backing.cc
line 195 at r11 (raw file):
{SlotGradId{slot_id_t{WEIGHTS}, IsGrad::NO}, weights}, {SlotGradId{slot_id_t{OUTPUT}, IsGrad::NO}, output}, {SlotGradId{slot_id_t{QUERY}, IsGrad::YES}, query}};
Suggestion:
TensorSlotsBacking correct = [&] {
TensorShape weights_shape = throw_if_unexpected(get_weights_shape(
attrs, input_tensor_shape, input_tensor_shape, input_tensor_shape));
GenericTensorAccessorW weights =
allocator.allocate_tensor(weights_shape);
TensorAttrs output_attrs =
get_tensor_attrs(cg_builder.computation_graph, output_guid);
GenericTensorAccessorW output =
allocator.allocate_tensor(output_attrs.shape);
return TensorSlotsBacking{
{SlotGradId{slot_id_t{QUERY}, IsGrad::NO}, query},
{SlotGradId{slot_id_t{KEY}, IsGrad::NO}, key},
{SlotGradId{slot_id_t{VALUE}, IsGrad::NO}, value},
{SlotGradId{slot_id_t{WEIGHTS}, IsGrad::NO}, weights},
{SlotGradId{slot_id_t{OUTPUT}, IsGrad::NO}, output},
{SlotGradId{slot_id_t{QUERY}, IsGrad::YES}, query},
};
}();
lib/local-execution/test/src/test_local_slots_backing.cc
line 197 at r11 (raw file):
{SlotGradId{slot_id_t{QUERY}, IsGrad::YES}, query}}; CHECK(are_slots_backings_equivalent_up_to_tensor_allocation_addresses(
Maybe instead have a get_slot_backings_without_tensor_allocation_addresses
and compare them as equal?
lib/local-execution/test/src/test_local_slots_backing.cc
line 200 at r11 (raw file):
correct, result)); } SUBCASE("Arg Slots Backing") {
Suggestion:
SUBCASE("ArgSlotsBacking::construct_arg_slots_backing")
lib/local-execution/test/src/test_local_slots_backing.cc
line 216 at r11 (raw file):
{slot_id_t{PROFILING}, ConcreteArgSpec::create(runtime_arg_config.profiling_settings)}, {slot_id_t{HANDLE}, ConcreteArgSpec::create(handle)}};
Suggestion:
ArgSlotsBacking correct = [&] {
ParallelTensorShape query_parallel_tensor_shape =
lift_to_parallel(input_tensor_shape);
return ArgSlotsBacking{
{slot_id_t{QPROJSIZE},
ConcreteArgSpec::create(get_qProjSize(attrs))},
{slot_id_t{ATTRS}, ConcreteArgSpec::create(attrs)},
{slot_id_t{QUERY_PARALLEL_TENSOR_SHAPE},
ConcreteArgSpec::create(query_parallel_tensor_shape)},
{slot_id_t{PROFILING},
ConcreteArgSpec::create(runtime_arg_config.profiling_settings)},
{slot_id_t{HANDLE}, ConcreteArgSpec::create(handle)}};
}();
lib/local-execution/test/src/test_local_slots_backing.cc
line 220 at r11 (raw file):
CHECK(result == correct); SUBCASE("FF Handle") {
This seems like a check for resolve_runtime_arg_ref_spec
, not construct_arg_slots_backing
?
lib/local-execution/test/src/test_local_slots_backing.cc
line 230 at r11 (raw file):
} SUBCASE("Device Specific Device States") {
What function is being tested here?
Code quote:
SUBCASE("Device Specific Device States") {
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 9 at r11 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Local Task Argument Accessor") {
Suggestion:
TEST_CASE("LocalTaskArgumentAccessor") {
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 49 at r11 (raw file):
SUBCASE("get_tensor") { SUBCASE("Read-only input tensor") {
Suggestion:
SUBCASE("get_tensor(slot_id_t, Permissions::RO, IsGrad::NO)") {
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 54 at r11 (raw file):
GenericTensorAccessorR result = std::get<GenericTensorAccessorR>( acc.get_tensor(slot_id_t{INPUT}, Permissions::RO, IsGrad::NO)); CHECK(correct == result);
Suggestion:
GenericTensorAccessor correct = GenericTensorAccessor{
read_only_accessor_from_write_accessor(input),
};
GenericTensorAccessor result = acc.get_tensor(slot_id_t{INPUT}, Permissions::RO, IsGrad::NO);
CHECK(correct == result);
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 81 at r11 (raw file):
GenericTensorAccessorW result = std::get<GenericTensorAccessorW>( acc.get_tensor(slot_id_t{INPUT}, Permissions::RW, IsGrad::YES)); CHECK(input_grad == result);
Suggestion:
GenericTensorAccessor result =
acc.get_tensor(slot_id_t{INPUT}, Permissions::RW, IsGrad::YES);
GenericTensorAccessor correct = input_grad;
CHECK(result == correct);
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 93 at r11 (raw file):
std::get<std::vector<GenericTensorAccessorR>>( acc.get_variadic_tensor( slot_id_t{VARIADIC_TENSORS}, Permissions::RO, IsGrad::NO));
Suggestion:
VariadicGenericTensorAccessor result =
acc.get_variadic_tensor(
slot_id_t{VARIADIC_TENSORS}, Permissions::RO, IsGrad::NO));
lib/local-execution/test/src/test_task_registry.cc
line 30 at r11 (raw file):
}}; SUBCASE("register single layer") {
Why not just do a single ==
comparison on the TaskRegistry
, rather than having all of these subcases?
lib/local-execution/test/src/test_task_registry.cc
line 64 at r11 (raw file):
SUBCASE("multiple layers same task") { layer_guid_t layer_1 = layer_guid_t{Node{1}}; layer_guid_t layer_2 = layer_guid_t{Node{2}};
Suggestion:
layer_guid_t other_layer_guid = layer_guid_t{Node{1}};
lib/local-execution/test/src/test_task_registry.cc
line 67 at r11 (raw file):
task_registry.register_tasks_for_layer(layer_guid, attrs); task_registry.register_tasks_for_layer(layer_1, attrs); task_registry.register_tasks_for_layer(layer_2, attrs);
Suggestion:
task_registry.register_tasks_for_layer(layer_guid, attrs);
task_registry.register_tasks_for_layer(other_layer_guid, attrs);
lib/local-execution/src/runtime_arg_ref.cc
line 10 at r8 (raw file):
} RuntimeArgRef<DeviceSpecific<PerDeviceFFHandle>> ff_handle() {
What's the argument for this being a DeviceSpecific<PerDeviceFFHandle>
instead of just a PerDeviceFFHandle
?
Code quote:
RuntimeArgRef<DeviceSpecific<PerDeviceFFHandle>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r12, all commit messages.
Reviewable status: all files reviewed, 51 unresolved discussions (waiting on @reyna-abhyankar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Reviewable status: 104 of 177 files reviewed, 51 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 172 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we have this instead of just having
fmt
support forGenericTensorAccessorR
and using the defaultutils/fmt/vector.h
fmt
support forstd::vector
?
Done.
lib/kernels/include/kernels/device.h
line 102 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This should be kept, why was this removed?
Temporarily removed it because I was getting a linker error when running proj test
for local-execution-tests
: undefined reference to 'cudaGetErrorString'
lib/kernels/src/ff_handle.cc
line 6 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not include some fields in the fmt description?
Done.
lib/local-execution/include/local-execution/device_specific_device_states.variant.toml
line 2 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is the purpose of this variant? What's the reasoning for having a
variant<DeviceSpecific<T>, ...>
vs aDeviceSpecific<variant<T, ...>>
?
We want the device specific to hold the actual device state type, and then the init functions for all operators can return one of these device specifics.
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 47 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
utils/fmt/unordered_map.h
Done.
lib/local-execution/include/local-execution/local_training_backing.h
line 17 at r11 (raw file):
TensorBackingMap const &, RuntimeArgConfig const &); ~LocalTrainingBacking() = default;
Done.
lib/local-execution/include/local-execution/local_training_backing.h
line 38 at r11 (raw file):
ComputationGraph computation_graph; TaskRegistry task_registry; LocalSlotsBacking local_slots_backing;
Done.
lib/local-execution/include/local-execution/op_task_invocation.h
line 102 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Having a separate section for member variables can improve readability
Done.
lib/local-execution/include/local-execution/task_argument_accessor.h
line 13 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is doing this based on type the best way of doing this, or should this be done based on whether a value was passed as device-specific in the invocation?
I'm going to make an issue for this because it's related to actually adding support for device specific bindings. #1462
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move
TaskRegistry
over todtgen
?
We modify it on the fly while constructing the local backing, so I think we should keep as is.
lib/local-execution/include/local-execution/task_registry.h
line 26 at r11 (raw file):
bool operator!=(TaskRegistry const &other) const; private:
Why private these?
lib/local-execution/include/local-execution/task_registry.h
line 35 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not use the default
utils/fmt/unordered_map.h
andutils/fmt/optional.h
?
Done.
lib/local-execution/include/local-execution/task_signature_impl.h
line 19 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In that case it would be better to have a separate function rather than an
operator==
--I'd recommend only havingoperator==
if the type does actually support equality
Done.
lib/local-execution/src/concrete_arg.cc
line 19 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would probably be good to print the value too since it's concrete?
the actual value is a template function .get<T>
and we don't know the type
lib/local-execution/src/local_training_backing.cc
line 21 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably? It's fine for now, this can always be addressed if necessary when the legion backend is added
Ok.
lib/local-execution/src/local_training_backing.cc
line 33 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In that case you'd add it to the list of attrs not to register tasks for along with
WeightAttrs
,InputAttrs
, andNoopAttrs
--that at least seems to be the current concept behind how the code works, so having two different methods for not registering a task seems odd to me
Changed this so that the get_task_ids
function handles the individual attr types instead. I think this cleans up the function
lib/local-execution/src/local_training_backing.cc
line 60 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/local-execution/src/op_task_signature.cc
line 120 at r11 (raw file):
OpTaskSignature infer_bwd_signature(OpTaskSignature const &fwd) { OpTaskSignature bwd(fwd);
Done.
lib/local-execution/src/task_registry.cc
line 38 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It seems like I'd like to know if something I expected to register a task doesn't--otherwise the
optional
doesn't have much of a point--yes this will complain if something isn't registered, but that seems like the point
The purpose of the optional is because not every operator has all tasks. So when we are doing init, for example, an operator without init can be skipped. This solution will throw an error when something you expect NOT to register a task correctly doesn't. Are you saying these maps shouldn't even contain the operators that don't have that task? (in which case why have the optional at all / why even insert weight/noop/input into the map)
lib/local-execution/src/task_registry.cc
line 61 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
utils/fmt/unordered_map.h
Done.
lib/local-execution/src/task_registry.cc
line 72 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
utils/fmt/unordered_map.h
Done.
lib/local-execution/src/task_registry.cc
line 82 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Fix (if you move
TaskRegistry
over todtgen
, this should be automatically generated for you)
Made it more descriptive, see other comment about moving it to dtgen
lib/local-execution/src/task_signature_impl.cc
line 314 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Fix (ideally through
dtgen
)
Done.
lib/local-execution/src/ops/attention.cc
line 134 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove
Done.
lib/local-execution/src/ops/conv_2d.cc
line 54 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why did this return type get changed? Why not just return a
DeviceSpecific<Conv2DPerDeviceState>
?
previously it was still a variant, just the DeviceSpecific. This is for consistency as mentioned above.
lib/local-execution/test/src/test_local_slots_backing.cc
line 12 at r11 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Local Slots Backing -- Attention Op") {
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 31 at r11 (raw file):
GenericTensorAccessorW key = allocator.allocate_tensor(input_tensor_shape); GenericTensorAccessorW value = allocator.allocate_tensor(input_tensor_shape);
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If we're testing
allocate_tensors
, while are we callingallocate_tensors
so many times (i.e., in a loop)? Seems like it would be easier to call it for one input and then check the output
The interface for allocate tensors allocates all outgoing tensors for a layer (because the assumption is that all incoming tensors have been allocated by the previous call). So it's actually much easier this way unless we want to change that interface to make it more granular (i.e. a function that takes a shape, allocator, and tensor guid and then makes the appropriate allocator call). We could do that, there'd just be a bunch of smaller functions when the current allocate_tensors
is already only ~30 lines
lib/local-execution/test/src/test_local_slots_backing.cc
line 81 at r11 (raw file):
} SUBCASE("Allocate and insert new tensors into slots") { SUBCASE("Tensor allocation") {
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 90 at r11 (raw file):
get_shape_and_datatype(query_grad); CHECK(result == correct_input); }
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 124 at r11 (raw file):
CHECK(result == correct_output); } }
The later subcase also relies on the allocation so it should be pulled out.
lib/local-execution/test/src/test_local_slots_backing.cc
line 176 at r11 (raw file):
}(); SUBCASE("Tensor Slots Backing") {
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 195 at r11 (raw file):
{SlotGradId{slot_id_t{WEIGHTS}, IsGrad::NO}, weights}, {SlotGradId{slot_id_t{OUTPUT}, IsGrad::NO}, output}, {SlotGradId{slot_id_t{QUERY}, IsGrad::YES}, query}};
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 197 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe instead have a
get_slot_backings_without_tensor_allocation_addresses
and compare them as equal?
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 200 at r11 (raw file):
correct, result)); } SUBCASE("Arg Slots Backing") {
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 216 at r11 (raw file):
{slot_id_t{PROFILING}, ConcreteArgSpec::create(runtime_arg_config.profiling_settings)}, {slot_id_t{HANDLE}, ConcreteArgSpec::create(handle)}};
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 220 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This seems like a check for
resolve_runtime_arg_ref_spec
, notconstruct_arg_slots_backing
?
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 230 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What function is being tested here?
Here, I was just testing device specific unwrapping, but I don't think this test is necessary anymore.
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 9 at r11 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Local Task Argument Accessor") {
Done.
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 49 at r11 (raw file):
SUBCASE("get_tensor") { SUBCASE("Read-only input tensor") {
Done.
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 54 at r11 (raw file):
GenericTensorAccessorR result = std::get<GenericTensorAccessorR>( acc.get_tensor(slot_id_t{INPUT}, Permissions::RO, IsGrad::NO)); CHECK(correct == result);
Done.
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 81 at r11 (raw file):
GenericTensorAccessorW result = std::get<GenericTensorAccessorW>( acc.get_tensor(slot_id_t{INPUT}, Permissions::RW, IsGrad::YES)); CHECK(input_grad == result);
Done.
lib/local-execution/test/src/test_local_task_arg_accessor.cc
line 93 at r11 (raw file):
std::get<std::vector<GenericTensorAccessorR>>( acc.get_variadic_tensor( slot_id_t{VARIADIC_TENSORS}, Permissions::RO, IsGrad::NO));
Done.
lib/local-execution/test/src/test_task_registry.cc
line 30 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just do a single
==
comparison on theTaskRegistry
, rather than having all of these subcases?
Done.
lib/local-execution/test/src/test_task_registry.cc
line 64 at r11 (raw file):
SUBCASE("multiple layers same task") { layer_guid_t layer_1 = layer_guid_t{Node{1}}; layer_guid_t layer_2 = layer_guid_t{Node{2}};
Done.
lib/local-execution/test/src/test_task_registry.cc
line 67 at r11 (raw file):
task_registry.register_tasks_for_layer(layer_guid, attrs); task_registry.register_tasks_for_layer(layer_1, attrs); task_registry.register_tasks_for_layer(layer_2, attrs);
Done.
lib/local-execution/include/local-execution/cost_estimate.h
line 21 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove this second overload and just ignore
MachineView
where it's not necessary. OverallCostEstimator
should include network modelling, etc. rather than just being a thin interface onto kernel runtimes
Done.
lib/local-execution/src/device_states.cc
line 11 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/local-execution/src/device_states.cc
line 14 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is going on here?
Removed, just made it auto.
lib/local-execution/src/runtime_arg_ref.cc
line 10 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What's the argument for this being a
DeviceSpecific<PerDeviceFFHandle>
instead of just aPerDeviceFFHandle
?
I'm not too knowledgeable about how handles work/or are passed around, but I'd assume we need to know what device it's referring to? So device specific would take care of that.
lib/local-execution/src/tasks.cc
line 6 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably best to just make
task_id_t
adtgen
enum
, and then you getformat_as
for free. We'll need a bit of conversion code when registering things in Legion, but it should be pretty trivial
Done.
lib/local-execution/include/local-execution/device_states.variant.toml
line 2 at r11 (raw file):
namespace = "FlexFlow" name = "DeviceStates"
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 76 of 76 files at r13, all commit messages.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/include/kernels/array_shape.h
line 58 at r13 (raw file):
TensorShape get_tensor_shape(ArrayShape const &, DataType); std::string format_as(std::pair<ArrayShape, DataType> const &);
utils/fmt/pair.h
?
lib/kernels/include/kernels/device.h
line 102 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Temporarily removed it because I was getting a linker error when running
proj test
forlocal-execution-tests
:undefined reference to 'cudaGetErrorString'
It should get added back in, the key is just only to call it from non-template code in kernels
so the linker does the right thing. But we really do want the string version of those error messages. Of course if you wanted you could also pull parts of it out of macros and into constant functions to reduce the restrictions--for example if you just wrapped cudaGetErrorString
in kernels this wouldn't be an issue
lib/kernels/src/ff_handle.cc
line 9 at r13 (raw file):
oss << "<PerDeviceFFHandle"; oss << " workSpaceSize=" << x.workSpaceSize; oss << " allowTensorOpMathConversion=" << x.allowTensorOpMathConversion;
You're missing a closing >
. You can also always use fmt
syntax if you prefer, i.e., `return fmt::format("", x.workSpaceSize, x.allowTensorOpMathConversion);
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 1 at r13 (raw file):
#include "local-execution/task_argument_accessor.h"
Header guard?
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 7 at r13 (raw file):
struct FwdBwdTaskImplFunction { std::optional<float> (*fwd_bwd_task_impl_function)(TaskArgumentAccessor const &);
Suggestion:
std::optional<float> (*function_ptr)(TaskArgumentAccessor const &);
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 11 at r13 (raw file):
bool operator==(FwdBwdTaskImplFunction const &) const; bool operator!=(FwdBwdTaskImplFunction const &) const;
Ideally also add operator<
lib/local-execution/include/local-execution/init_task_impl_function.h
line 1 at r13 (raw file):
#include "local-execution/device_specific_device_states.dtg.h"
Header guard?
lib/local-execution/include/local-execution/init_task_impl_function.h
line 8 at r13 (raw file):
struct InitTaskImplFunction { DeviceSpecificDeviceStates (*init_task_impl_function)(TaskArgumentAccessor const &);
Suggestion:
DeviceSpecificDeviceStates (*function_ptr)(TaskArgumentAccessor const &);
lib/local-execution/include/local-execution/init_task_impl_function.h
line 12 at r13 (raw file):
bool operator==(InitTaskImplFunction const &) const; bool operator!=(InitTaskImplFunction const &) const;
Ideally also add operator<
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 42 at r13 (raw file):
}; using TensorSlotsBackingWithoutAddresses = std::unordered_map<
dtgen
wrapper would be preferable
lib/local-execution/include/local-execution/op_task_signature.h
line 95 at r13 (raw file):
std::unordered_set<OpTensorSlotSpec> op_tensor_slots; }; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(
Eventually this should be migrated on from visitable
, but doesn't have to happen now
lib/local-execution/include/local-execution/per_device_state.h
line 1 at r13 (raw file):
#include "local-execution/device_specific_device_states.dtg.h"
Header guard?
lib/local-execution/include/local-execution/per_device_state.variant.toml
line 2 at r13 (raw file):
namespace = "FlexFlow" name = "PerDeviceState"
These are all for operators, so PerDeviceOpState
seems clearer--PerDeviceState
makes it seem there is one per device and it has nothing to do with operators
Code quote:
"PerDeviceState"
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 4 at r13 (raw file):
name = "TaskImplFunction" features = [ "eq",
Would be good to have hash
. ord
too if it's possible (currently there's no ord
for unordered_set
and unordered_map
, so it's not always possible)
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 13 at r13 (raw file):
"local-execution/init_task_impl_function.h", "local-execution/fwd_bwd_task_impl_function.h", ]
Suggestion:
includes = [
"local-execution/init_task_impl_function.h",
"local-execution/fwd_bwd_task_impl_function.h",
]
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
We modify it on the fly while constructing the local backing, so I think we should keep as is.
dtgen
types can be modified after creation, they just don't normally have default constructors. You could define a register_tasks_for_layer
function instead of the method. It's not a huge deal, just might automate some operations you currently have to maintain
lib/local-execution/include/local-execution/task_registry.h
line 26 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Why private these?
Oh, I meant just pull them into a separate section, didn't think of the fact that they're not currently private
, so it would just be public:
instead
lib/local-execution/include/local-execution/task_signature_impl.h
line 5 at r13 (raw file):
// #include "local-execution/op_task_invocation.h" // #include "local-execution/task_impl_function.dtg.h"
Delete
Code quote:
// #include "local-execution/op_task_invocation.h"
// #include "local-execution/task_impl_function.dtg.h"
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 4 at r13 (raw file):
name = "TaskSignatureAndImpl" features = [ "eq",
Ideally also add at least hash
(ord
if possible)
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 10 at r13 (raw file):
includes = [ "local-execution/task_impl_function.dtg.h", "local-execution/op_task_invocation.h",
Suggestion:
"local-execution/op_task_signature.h",
lib/local-execution/src/concrete_arg.cc
line 19 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
the actual value is a template function
.get<T>
and we don't know the type
Fair enough
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 17 at r13 (raw file):
std::ostringstream oss; oss << "<FwdBwdTaskImplFunction"; oss << " fwd_bwd_task_impl_function=" << x.fwd_bwd_task_impl_function;
Suggestion:
oss << " function_ptr=" << x.fwd_bwd_task_impl_function;
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 17 at r13 (raw file):
std::ostringstream oss; oss << "<FwdBwdTaskImplFunction"; oss << " fwd_bwd_task_impl_function=" << x.fwd_bwd_task_impl_function;
How does this currently render x.fwd_bwd_task_impl_function
? As a pointer? A number?
lib/local-execution/src/init_task_impl_function.cc
line 17 at r13 (raw file):
std::ostringstream oss; oss << "<InitTaskImplFunction"; oss << " init_task_impl_function=" << x.init_task_impl_function;
Suggestion:
oss << " function_ptr=" << x.init_task_impl_function;
lib/local-execution/src/op_task_signature.cc
line 152 at r13 (raw file):
std::string format_as(OpTaskSignature const &x) { std::ostringstream oss; oss << "<OpTaskSignature";
Remember you can also use fmt
syntax if you prefer 🙂
lib/local-execution/src/task_registry.cc
line 38 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
The purpose of the optional is because not every operator has all tasks. So when we are doing init, for example, an operator without init can be skipped. This solution will throw an error when something you expect NOT to register a task correctly doesn't. Are you saying these maps shouldn't even contain the operators that don't have that task? (in which case why have the optional at all / why even insert weight/noop/input into the map)
Good point, forgot that at least not everything has an init
task, so adding optional
everywhere makes sense. In that case is there a reason to handle Input
, Weight
, etc. separately vs just having get_task_ids
return an empty list?
lib/local-execution/src/ops/input.h
line 0 at r13 (raw file):
Why an empty file?
lib/local-execution/src/ops/noop.h
line 11 at r13 (raw file):
namespace FlexFlow { std::vector<task_id_t> get_task_ids(InputAttrs const &);
Ideally put these in their own headers, unlike in master
their attr
types are distinct
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
The interface for allocate tensors allocates all outgoing tensors for a layer (because the assumption is that all incoming tensors have been allocated by the previous call). So it's actually much easier this way unless we want to change that interface to make it more granular (i.e. a function that takes a shape, allocator, and tensor guid and then makes the appropriate allocator call). We could do that, there'd just be a bunch of smaller functions when the current
allocate_tensors
is already only ~30 lines
My point was more--why not just call it once and check that the outgoing tensor for that layer are allocated, rather than calling it a bunch of times and then checking for a bunch of operators?
lib/local-execution/src/runtime_arg_ref.cc
line 10 at r8 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I'm not too knowledgeable about how handles work/or are passed around, but I'd assume we need to know what device it's referring to? So device specific would take care of that.
Makes sense, I convinced myself 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 155 of 186 files reviewed, 27 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/array_shape.h
line 58 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
utils/fmt/pair.h
?
Added a format fn for ArrayShape
(then pair takes care of the rest)
lib/kernels/include/kernels/device.h
line 102 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It should get added back in, the key is just only to call it from non-template code in
kernels
so the linker does the right thing. But we really do want the string version of those error messages. Of course if you wanted you could also pull parts of it out of macros and into constant functions to reduce the restrictions--for example if you just wrappedcudaGetErrorString
in kernels this wouldn't be an issue
Done.
lib/kernels/src/ff_handle.cc
line 9 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You're missing a closing
>
. You can also always usefmt
syntax if you prefer, i.e., `return fmt::format("", x.workSpaceSize, x.allowTensorOpMathConversion);
Just added the `>.
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 1 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Header guard?
Done.
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 7 at r13 (raw file):
struct FwdBwdTaskImplFunction { std::optional<float> (*fwd_bwd_task_impl_function)(TaskArgumentAccessor const &);
Done.
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 11 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally also add
operator<
Why do we need ordering? Even though we technically can order the function ptrs, there isn't a real order to the functions themselves (which is what we are actually comparing)
lib/local-execution/include/local-execution/init_task_impl_function.h
line 1 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Header guard?
Done.
lib/local-execution/include/local-execution/init_task_impl_function.h
line 8 at r13 (raw file):
struct InitTaskImplFunction { DeviceSpecificDeviceStates (*init_task_impl_function)(TaskArgumentAccessor const &);
Done.
lib/local-execution/include/local-execution/init_task_impl_function.h
line 12 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally also add
operator<
See https://reviewable.io/reviews/flexflow/FlexFlow/1418#-O3txKxiEga5ouuIbrgb
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 42 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
dtgen
wrapper would be preferable
It's only used in this one spot, I feel like dtgen for it is kind of overkill.
lib/local-execution/include/local-execution/op_task_signature.h
line 95 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Eventually this should be migrated on from
visitable
, but doesn't have to happen now
Created an issue for this: #1465
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 4 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would be good to have
hash
.ord
too if it's possible (currently there's noord
forunordered_set
andunordered_map
, so it's not always possible)
Added hash
, for similar discussion of ord
see https://reviewable.io/reviews/flexflow/FlexFlow/1418#-O3txKxiEga5ouuIbrgb
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 13 at r13 (raw file):
"local-execution/init_task_impl_function.h", "local-execution/fwd_bwd_task_impl_function.h", ]
Done.
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
dtgen
types can be modified after creation, they just don't normally have default constructors. You could define aregister_tasks_for_layer
function instead of the method. It's not a huge deal, just might automate some operations you currently have to maintain
Don't the xxx_task_ids
maps have to be const? register_tasks_for_layer
has to insert into them
lib/local-execution/include/local-execution/task_registry.h
line 26 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Oh, I meant just pull them into a separate section, didn't think of the fact that they're not currently
private
, so it would just bepublic:
instead
Done.
lib/local-execution/include/local-execution/task_signature_impl.h
line 5 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 4 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally also add at least
hash
(ord
if possible)
Currently, OpTaskSignature
is not hashable, I think we can solve that in #1465
For ord
, see https://reviewable.io/reviews/flexflow/FlexFlow/1418#-O3txKxiEga5ouuIbrgb
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 10 at r13 (raw file):
includes = [ "local-execution/task_impl_function.dtg.h", "local-execution/op_task_invocation.h",
Done.
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 17 at r13 (raw file):
std::ostringstream oss; oss << "<FwdBwdTaskImplFunction"; oss << " fwd_bwd_task_impl_function=" << x.fwd_bwd_task_impl_function;
Done.
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 17 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
How does this currently render
x.fwd_bwd_task_impl_function
? As a pointer? A number?
Currently, it is always the number 1
. Should it be (void*)x.function_ptr
? That actually prints an address.
lib/local-execution/src/init_task_impl_function.cc
line 17 at r13 (raw file):
std::ostringstream oss; oss << "<InitTaskImplFunction"; oss << " init_task_impl_function=" << x.init_task_impl_function;
Done.
lib/local-execution/src/op_task_signature.cc
line 152 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remember you can also use
fmt
syntax if you prefer 🙂
I think I just saw this format in one of the files lol. I like the readability, might just stick with it. It's not really that much of a pain.
lib/local-execution/src/task_registry.cc
line 38 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Good point, forgot that at least not everything has an
init
task, so addingoptional
everywhere makes sense. In that case is there a reason to handleInput
,Weight
, etc. separately vs just havingget_task_ids
return an empty list?
Yep, there's no more separate handling anymore. If you look at local_training_backing.cc
and task_signature_impl.cc
you'll see a function for input, noop, and weights.
lib/local-execution/src/ops/input.h
line at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why an empty file?
Had initially decided to split the get_task_ids
functions for input and weight separately. Then just clubbed it with noop. Now it's back to separate again (so instead of empty this is a filled file).
lib/local-execution/src/ops/noop.h
line 11 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally put these in their own headers, unlike in
master
theirattr
types are distinct
Done. Though it does feel a little weird to have weight.h
in op, but I guess it's the same as input
being an operator too and they're all graph nodes. So it's fine
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
My point was more--why not just call it once and check that the outgoing tensor for that layer are allocated, rather than calling it a bunch of times and then checking for a bunch of operators?
Ah, actually here we are only checking for a single operator. It's just all incoming tensors for that node have to have been allocated first since we are checking both incoming and outgoing tensors exist for this operator.
lib/local-execution/include/local-execution/per_device_state.h
line 1 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Header guard?
Done.
lib/local-execution/include/local-execution/per_device_state.variant.toml
line 2 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
These are all for operators, so
PerDeviceOpState
seems clearer--PerDeviceState
makes it seem there is one per device and it has nothing to do with operators
Yup that makes sense. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 30 of 30 files at r14, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 11 at r13 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Why do we need ordering? Even though we technically can order the function ptrs, there isn't a real order to the functions themselves (which is what we are actually comparing)
Usually to allow the thing to also be stored in a std::map
or a std::set
which require operator<
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 42 at r13 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
It's only used in this one spot, I feel like dtgen for it is kind of overkill.
The reason would be more to wrap the type so the additional utils/fmt/variant.h
, utils/fmt/vector.h
, etc. includes aren't necessary, and just to avoid the potential confusion/weirdness of a type alias. I'd lean toward dtgen
here as the overhead is pretty small and it avoids these issues, but I don't think it's worth holding up the PR over
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Don't the
xxx_task_ids
maps have to be const?register_tasks_for_layer
has to insert into them
You'd take it as TaskRegistry &
rather than TaskRegistry const &
and then you'd be allowed to modify the std::unordered_map
s
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 18 at r14 (raw file):
std::ostringstream oss; oss << "<FwdBwdTaskImplFunction"; oss << " function_ptr=" << (void *)x.function_ptr;
Suggestion:
static_cast<void *>(x.function_ptr);
lib/local-execution/src/init_task_impl_function.cc
line 16 at r14 (raw file):
std::ostringstream oss; oss << "<InitTaskImplFunction"; oss << " function_ptr=" << (void *)x.function_ptr;
Suggestion:
oss << " function_ptr=" << static_cast<void *>(x.function_ptr);
lib/local-execution/src/ops/input.cc
line 2 at r14 (raw file):
#include "input.h" #include "local-execution/op_task_invocation.h"
Why is op_task_invocation.h
needed?
lib/local-execution/src/ops/input.cc
line 3 at r14 (raw file):
#include "input.h" #include "local-execution/op_task_invocation.h" #include "utils/hash-utils.h"
Why is hash-utils.h
needed?
lib/local-execution/src/ops/noop.cc
line 17 at r14 (raw file):
#include "noop.h" #include "utils/hash-utils.h"
Why is hash-utils.h
necessary?
lib/local-execution/src/ops/weight.cc
line 2 at r14 (raw file):
#include "weight.h" #include "local-execution/op_task_invocation.h"
Why is op_task_invocation.h
necessary?
lib/local-execution/src/ops/weight.cc
line 3 at r14 (raw file):
#include "weight.h" #include "local-execution/op_task_invocation.h" #include "utils/hash-utils.h"
Why is hash-utils.h
necessary?
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Ah, actually here we are only checking for a single operator. It's just all incoming tensors for that node have to have been allocated first since we are checking both incoming and outgoing tensors exist for this operator.
Sorry, I meant "bunch of tensors" instead of "bunch of operators"--this just seems to be checking much more than what allocate_tensors
does? I'm not sure why we need to check that all incoming and outgoing tensors exist to confirm that allocate_tensors
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @lockshaw)
lib/local-execution/include/local-execution/fwd_bwd_task_impl_function.h
line 11 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Usually to allow the thing to also be stored in a
std::map
or astd::set
which requireoperator<
Ah ok. Done
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 42 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The reason would be more to wrap the type so the additional
utils/fmt/variant.h
,utils/fmt/vector.h
, etc. includes aren't necessary, and just to avoid the potential confusion/weirdness of a type alias. I'd lean towarddtgen
here as the overhead is pretty small and it avoids these issues, but I don't think it's worth holding up the PR over
I'll make an issue: #1468
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You'd take it as
TaskRegistry &
rather thanTaskRegistry const &
and then you'd be allowed to modify thestd::unordered_map
s
but when constructing TaskRegistry
with dtgen, it requires const maps. Maybe I'm misunderstanding this part.
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 18 at r14 (raw file):
std::ostringstream oss; oss << "<FwdBwdTaskImplFunction"; oss << " function_ptr=" << (void *)x.function_ptr;
Done. Actually needed to be reinterpret_cast
lib/local-execution/src/init_task_impl_function.cc
line 16 at r14 (raw file):
std::ostringstream oss; oss << "<InitTaskImplFunction"; oss << " function_ptr=" << (void *)x.function_ptr;
Done. Actually needed to be reinterpret_cast
lib/local-execution/src/ops/input.cc
line 2 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
op_task_invocation.h
needed?
Done.
lib/local-execution/src/ops/input.cc
line 3 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
hash-utils.h
needed?
Done.
lib/local-execution/src/ops/noop.cc
line 17 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
hash-utils.h
necessary?
Done.
lib/local-execution/src/ops/weight.cc
line 2 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
op_task_invocation.h
necessary?
Done.
lib/local-execution/src/ops/weight.cc
line 3 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
hash-utils.h
necessary?
Done.
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sorry, I meant "bunch of tensors" instead of "bunch of operators"--this just seems to be checking much more than what
allocate_tensors
does? I'm not sure why we need to check that all incoming and outgoing tensors exist to confirm thatallocate_tensors
works?
allocate_tensors
will allocate all outgoing tensors for a given operator. You're right that we don't have to check all incoming tensors, but seeing as those should be allocated using allocate_tensors
(or allocator.allocate_tensor
if it's an input tensor) then I think it's ok? Again, the alternative would be to make allocate_tensors
more granular OR we only check the output tensors are allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r15, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @reyna-abhyankar)
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 4 at r13 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Added
hash
, for similar discussion oford
see https://reviewable.io/reviews/flexflow/FlexFlow/1418#-O3txKxiEga5ouuIbrgb
Can we also add ord
here since FwdBwdTaskImplFunction
now supports ord
?
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
but when constructing
TaskRegistry
with dtgen, it requires const maps. Maybe I'm misunderstanding this part.
I don't see why it requires const
maps? It's totally fine to do TaskRegistry r = TaskRegistry{{}, {}, {}, {}}; r.init_task_ids.insert({...});
. dtgen
doesn't make the fields const (of course unless you declare a TaskRegistry const r
, but then it's const
because you explicitly told it to be)
lib/local-execution/src/fwd_bwd_task_impl_function.cc
line 18 at r14 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done. Actually needed to be
reinterpret_cast
Yeah, that's because apparently this operation (and printing a function pointer in general), is really not a well-defined operation C++-wise. It's fine here because this is really just for debugging, but in general if something is requiring reinterpret_cast
remember that's essentially just reinterpreting the bytes in memory and so you lose (at least a lot if not all) portability and correctness guarantees
lib/local-execution/src/init_task_impl_function.cc
line 16 at r14 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done. Actually needed to be
reinterpret_cast
See https://reviewable.io/reviews/flexflow/FlexFlow/1418#-O4r72igEzyz6-5OdpSc
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
allocate_tensors
will allocate all outgoing tensors for a given operator. You're right that we don't have to check all incoming tensors, but seeing as those should be allocated usingallocate_tensors
(orallocator.allocate_tensor
if it's an input tensor) then I think it's ok? Again, the alternative would be to makeallocate_tensors
more granular OR we only check the output tensors are allocated.
The second option is what I would recommend--I have no issue with some other test checking that all of the input tensors are allocated (this actually probably is something that should be tested), but I think that test is outside the scope of the semantics of allocate_tensors
, which is only required to allocate the output tensors (also, in that case it should probably be renamed to allocate_output_tensors
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw)
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 4 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can we also add
ord
here sinceFwdBwdTaskImplFunction
now supportsord
?
Done
lib/local-execution/include/local-execution/task_registry.h
line 12 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't see why it requires
const
maps? It's totally fine to doTaskRegistry r = TaskRegistry{{}, {}, {}, {}}; r.init_task_ids.insert({...});
.dtgen
doesn't make the fields const (of course unless you declare aTaskRegistry const r
, but then it'sconst
because you explicitly told it to be)
That makes complete sense. Changed.
lib/local-execution/test/src/test_local_slots_backing.cc
line 77 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The second option is what I would recommend--I have no issue with some other test checking that all of the input tensors are allocated (this actually probably is something that should be tested), but I think that test is outside the scope of the semantics of
allocate_tensors
, which is only required to allocate the output tensors (also, in that case it should probably be renamed toallocate_output_tensors
?)
Ok, I've fixed this and moved some of the allocation around in the test to reflect only allocating inputs and then separately only allocating outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r16, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @reyna-abhyankar)
lib/local-execution/include/local-execution/task_registry.struct.toml
line 9 at r16 (raw file):
includes = [ "local-execution/task_signature_impl.h",
This file doesn't seem to have any datatype declarations? Should this be a different file? Did you mean task_signature_impl.dtg.h
?
lib/local-execution/include/local-execution/task_registry.struct.toml
line 12 at r16 (raw file):
"local-execution/task_id_t.dtg.h", "pcg/layer_guid_t.dtg.h", ]
Suggestion:
features = [
"eq",
"hash",
"fmt",
]
includes = [
"local-execution/task_signature_impl.h",
"local-execution/task_id_t.dtg.h",
"pcg/layer_guid_t.dtg.h",
]
src_includes = [
"utils/hash/unordered_map.h",
"utils/fmt/unordered_map.h",
]
lib/local-execution/test/src/test_task_registry.cc
line 99 at r16 (raw file):
SUBCASE("equality") { TaskRegistry other_task_registry = TaskRegistry{{}, {}, {}, {}};
If you want you can also create a helper function like TaskRegistry empty_task_registry()
in task_registry.h
that just returns a TaskRegistry{{}, {}, {}, {}}
to avoid duplicating that code everywhere, but not a big deal either way
Code quote:
TaskRegistry other_task_registry = TaskRegistry{{}, {}, {}, {}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw)
lib/local-execution/include/local-execution/task_registry.struct.toml
line 9 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This file doesn't seem to have any datatype declarations? Should this be a different file? Did you mean
task_signature_impl.dtg.h
?
Yes, changed.
lib/local-execution/include/local-execution/task_registry.struct.toml
line 12 at r16 (raw file):
"local-execution/task_id_t.dtg.h", "pcg/layer_guid_t.dtg.h", ]
Done.
lib/local-execution/test/src/test_task_registry.cc
line 99 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If you want you can also create a helper function like
TaskRegistry empty_task_registry()
intask_registry.h
that just returns aTaskRegistry{{}, {}, {}, {}}
to avoid duplicating that code everywhere, but not a big deal either way
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r17, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @reyna-abhyankar)
lib/local-execution/include/local-execution/task_registry.h
line 6 at r17 (raw file):
#include "local-execution/task_registry.dtg.h" #include "local-execution/task_signature_impl.h"
Why was this added?
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 15 at r17 (raw file):
src_includes = [ "utils/hash/unordered_map.h",
Why are these necessary? I don't see any unordered_map
s or unordered_set
s in the members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw)
lib/local-execution/include/local-execution/task_registry.h
line 6 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why was this added?
oops, added to the wrong file. changed (it's in local_training_backing.cc
and task_registry.cc
now)
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 15 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why are these necessary? I don't see any
unordered_map
s orunordered_set
s in the members?
It's needed to be able to hash OpTaskSignature
. Removed the fmt
includes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r18, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @reyna-abhyankar)
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 15 at r17 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
It's needed to be able to hash
OpTaskSignature
. Removed thefmt
includes though.
Ah, because that's still using visitable
. Let's put the includes in op_task_signature.h
then rather than here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @reyna-abhyankar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @reyna-abhyankar)
lib/local-execution/include/local-execution/task_signature_impl.struct.toml
line 15 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ah, because that's still using
visitable
. Let's put the includes inop_task_signature.h
then rather than here
Done.
Description of changes:
Tests for local backing workflow
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is