Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local execution e2e training #1472

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

reyna-abhyankar
Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar commented Aug 25, 2024

Description of changes:

Implement end-to-end training loop in local execution.

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:


This change is Reviewable

@reyna-abhyankar reyna-abhyankar self-assigned this Aug 25, 2024
@reyna-abhyankar reyna-abhyankar changed the title temporary weight adjust index Local execution e2e training Aug 25, 2024
Copy link

codecov bot commented Aug 25, 2024

Codecov Report

Attention: Patch coverage is 8.09595% with 613 lines in your changes missing coverage. Please review.

Project coverage is 71.58%. Comparing base (cf96db6) to head (1d3cc94).
Report is 4 commits behind head on repo-refactor.

Files with missing lines Patch % Lines
lib/local-execution/src/optimizer.cc 0.00% 108 Missing ⚠️
lib/local-execution/test/src/test_loss_e2e.cc 1.19% 83 Missing ⚠️
lib/local-execution/test/src/test_update_e2e.cc 1.19% 83 Missing ⚠️
lib/local-execution/src/local_training_backing.cc 0.00% 57 Missing ⚠️
lib/local-execution/src/local_slots_backing.cc 46.57% 39 Missing ⚠️
lib/local-execution/src/loss_functions.cc 0.00% 30 Missing ⚠️
.../local-execution/src/generic_task_impl_function.cc 0.00% 23 Missing ⚠️
...cal-execution/src/fwd_bwd_op_task_impl_function.cc 8.69% 21 Missing ⚠️
.../local-execution/src/init_op_task_impl_function.cc 8.69% 21 Missing ⚠️
lib/local-execution/src/task_binding.cc 0.00% 21 Missing ⚠️
... and 33 more
Additional details and impacted files
@@                Coverage Diff                @@
##           repo-refactor    #1472      +/-   ##
=================================================
- Coverage          73.22%   71.58%   -1.64%     
=================================================
  Files                731      741      +10     
  Lines              23398    23950     +552     
  Branches             690      699       +9     
=================================================
+ Hits               17133    17145      +12     
- Misses              6265     6805     +540     
Flag Coverage Δ
unittests 71.58% <8.09%> (-1.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/local-execution/src/ops/attention.cc 24.64% <100.00%> (ø)
...al-execution/test/src/test_local_cost_estimator.cc 2.63% <ø> (ø)
...cal-execution/test/src/test_local_slots_backing.cc 98.40% <100.00%> (+0.04%) ⬆️
lib/local-execution/test/src/test_task_registry.cc 100.00% <ø> (ø)
lib/local-execution/src/ops/batch_matmul.cc 0.00% <0.00%> (ø)
lib/local-execution/src/ops/cast.cc 0.00% <0.00%> (ø)
lib/local-execution/src/ops/combine.cc 0.00% <0.00%> (ø)
lib/local-execution/src/ops/concat.cc 0.00% <0.00%> (ø)
lib/local-execution/src/ops/flat.cc 0.00% <0.00%> (ø)
lib/local-execution/src/ops/reduction.cc 0.00% <0.00%> (ø)
... and 37 more

... and 4 files with indirect coverage changes

@reyna-abhyankar reyna-abhyankar marked this pull request as ready for review August 27, 2024 21:03
@lockshaw lockshaw self-requested a review August 27, 2024 21:37
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 49 of 50 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @reyna-abhyankar)


lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

     LIST_DIRECTORIES False
     src/*.cc
     src/cuda/cuda_helper.cu

Why is this not just src/cuda/*.cu?


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

  std::optional<std::size_t> at_maybe(ff_dim_t) const;

  ArrayShape sub_shape(legion_dim_t start, ff_dim_t end) const;

legion_dim_t for start and ff_dim_t for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂


lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

ArrayShape ArrayShape::sub_shape(std::optional<ff_dim_t> start,
                                 std::optional<ff_dim_t> end) const {
  std::vector<size_t> new_shape;

Can probably be replaced by slice(DimOrdered, ...): https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9


lib/kernels/src/device.h line 74 at r3 (raw file):

__global__ void scale_kernel(float *ptr, size_t size, float a, float b);
__global__ void scale_kernel(float *ptr, unsigned long size, float a, float b);

Why?


lib/local-execution/include/local-execution/local_training_backing.h line 18 at r3 (raw file):

                       TensorBackingMap const &,
                       RuntimeArgConfig const &,
                       std::optional<ModelTrainingInstance> &);

Why is this passed as mutable?


lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "ModelTrainingInstance"

Why isn't the model part of ModelTrainingInstance? The members seem to have changed non-trivially since the old version:

struct ModelTrainingInstance {
  ComputationGraph computation_graph;
  Optimizer optimizer;
  EnableProfiling enable_profiling;
  TrainingPCG training_pcg;
  TensorMapping tensor_map;
  LegionBacking legion_backing;
};

lib/local-execution/include/local-execution/model_training_instance.struct.toml line 15 at r3 (raw file):

  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

Suggestion:

includes = [
  "op-attrs/ops/loss_attrs.dtg.h",
  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

lib/local-execution/include/local-execution/optimizer.h line 16 at r3 (raw file):

TaskInvocation get_update_invocation(OptimizerAttrs const &,
                                     tensor_guid_t const &weight,
                                     std::vector<tensor_guid_t> const &);

Some parameter names would be helpful here

Code quote:

                                     std::vector<tensor_guid_t> const &);

lib/local-execution/include/local-execution/optimizer.h line 22 at r3 (raw file):

TaskInvocation sgd_update(SGDOptimizerAttrs const &,
                          tensor_guid_t const &weight,
                          tensor_guid_t const &);

Some parameter names would be helpful here


lib/local-execution/include/local-execution/optimizer.h line 29 at r3 (raw file):

                           tensor_guid_t const &weight,
                           tensor_guid_t const &,
                           tensor_guid_t const &);

Some parameter names would be helpful here

Code quote:

                           tensor_guid_t const &,
                           tensor_guid_t const &);

lib/local-execution/include/local-execution/task_impl_function.variant.toml line 22 at r3 (raw file):

[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"

Would these be better renamed InitOpTaskImplFunction and FwdBwdOpTaskImplFunction, or do they make sense outside of a op context?

Code quote:

[[values]]
type = "::FlexFlow::InitTaskImplFunction"
key = "init_task_impl_function"

[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"

lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

  task_id_t task_id;
  TaskBinding binding;
};

Can be moved over to dtgen

Code quote:

struct TaskInvocation {
public:
  TaskInvocation() = delete;
  TaskInvocation(task_id_t task_id, TaskBinding const &binding)
      : task_id(task_id), binding(binding) {}

public:
  task_id_t task_id;
  TaskBinding binding;
};

lib/local-execution/include/local-execution/task_signature.h line 4 at r3 (raw file):

#define _FLEXFLOW_LOCAL_EXECUTION_TASK_SIGNATURE_H

// #include "local-execution/tensor_guid_slot_spec.dtg.h"

Delete?


lib/local-execution/include/local-execution/task_signature.h line 41 at r3 (raw file):

}

// adds arg_slot without checking is_serializable, used for arguments that are

Change to proper doxygen docstring?


lib/local-execution/include/local-execution/task_signature.h line 49 at r3 (raw file):

// adds arg_slot without checking is_serializable, used for arguments that are
// deviceSpecific

Change to proper doxygen docstring?


lib/local-execution/include/local-execution/task_signature.struct.toml line 5 at r3 (raw file):

features = [
  "eq",
  "fmt",

Why not "hash"?


lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",

Suggestion:

  "<typeindex>",

lib/local-execution/include/local-execution/task_signature.struct.toml line 12 at r3 (raw file):

  "utils/type_index.h",
  "utils/optional.h"
]

Suggestion:

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",
  "<optional>",
]

lib/local-execution/include/local-execution/task_signature.struct.toml line 29 at r3 (raw file):

[[fields]]
name = "tensor_guid_slots"
type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"

Why not a map of slot_id_t -> something?

Code quote:

type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"

lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "TensorGuidSpec"

This name seems a bit ambiguous? Maybe something clearer?


lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

                                     tensor_backing_map,
                                     this->runtime_arg_config,
                                     model_training_instance);

Would it make more sense to disaggregate LocalTrainingBacking rather than having an optional field for the training instance?

Code quote:

  std::optional<ModelTrainingInstance> model_training_instance = std::nullopt;
  LocalTrainingBacking local_backing(allocator,
                                     cg_builder.computation_graph,
                                     tensor_backing_map,
                                     this->runtime_arg_config,
                                     model_training_instance);

lib/local-execution/src/local_slots_backing.cc line 61 at r3 (raw file):

      sig.tensor_guid_slots.size() - 2; // ignore 2 (weight and weight_grad)
  std::vector<tensor_guid_t> buffer_tensors =
      get_new_tensor_guids_for_layer_without_graph_insertion(

Not a fan of this, come up with some new tensor id type


lib/local-execution/src/local_slots_backing.cc line 63 at r3 (raw file):

      get_new_tensor_guids_for_layer_without_graph_insertion(
          cg, weight_layer, num_buffer_tensors);
  for (auto const &tensor_guid : buffer_tensors) {

Suggestion:

  for (tensor_guid_t const &tensor_guid : buffer_tensors)

lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

    OpTaskBinding const &binding, layer_guid_t const &op_guid) const {
  TensorSlotsBacking mapping;
  int num_inputs = 0;

Either use filter or add a count_if function to containers or something like it


lib/local-execution/src/local_slots_backing.cc line 126 at r3 (raw file):

      default:
        throw mk_runtime_error(
            fmt::format("Invalid TensorRole")); // inserting role yields

It shouldn't, can you try again?


lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

    IsGrad is_grad = slot_grad_id.is_grad;
    GenericTensorAccessorW tensor_backing = this->get_tensor_backing(
        tensor_guids.at(weight_adjusted_idx + tensor_spec.idx), is_grad);

What's going on here with weight_adjusted_idx?


lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

                        },
                        [](ConcreteArgSpec const &s) { return s; },
                    })});

Isn't this just map_values?

Code quote:

  for (auto const &arg_binding : binding.get_arg_bindings()) {
    slot_id_t arg_slot = arg_binding.first;
    TaskArgSpec task_arg_spec = arg_binding.second;

    mapping.insert({arg_slot,
                    task_arg_spec.visit<ConcreteArgSpec>(overload{
                        [&](RuntimeArgRefSpec const &s) {
                          return this->resolve_runtime_arg_ref_spec(s);
                        },
                        [](ConcreteArgSpec const &s) { return s; },
                    })});

lib/local-execution/src/local_training_backing.cc line 26 at r3 (raw file):

  for (layer_guid_t const &node :
       topological_ordering(this->computation_graph)) {

Much appreciated 🙂


lib/local-execution/src/local_training_backing.cc line 165 at r3 (raw file):

      std::vector<tensor_guid_t> buffer_tensors =
          this->local_slots_backing.weight_optimizer_tensor_guids.at(
              weight_tensor);

Why not key this by the layer_guid_t of the WeightAttrs?


lib/local-execution/src/local_training_backing.cc line 179 at r3 (raw file):

  }

  this->training_instance = next(this->training_instance.value());

I'm not convinced that the next() abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance stays the same


lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

  add_slot(sig, LOGIT, IsGrad::NO);
  add_slot(sig, LABEL, IsGrad::NO);
  add_slot(sig, LOGIT, IsGrad::YES);

Why is LOGIT added twice? Does this mean pass both the gradient and the non-gradient into the function?


lib/local-execution/src/loss_functions.cc line 40 at r2 (raw file):

  b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::NO});
  b.bind(LABEL, TensorGuidSpec{label, IsGrad::NO});
  b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::YES});

Same here, why is LOGIT bound twice?


lib/local-execution/src/loss_functions.cc line 64 at r2 (raw file):

  if (loss_type == LossFunction::SPARSE_CATEGORICAL_CROSSENTROPY) {
    // label shape is [parallel dim(?), batch dim, 1]

Label shape is an ArrayShape, so it shouldn't have any parallel dims, right?

Code quote:

    // label shape is [parallel dim(?), batch dim, 1]

lib/local-execution/src/model_training_instance.cc line 9 at r2 (raw file):

    AdamOptimizerAttrs old =
        old_training_instance.optimizer_attrs.get<AdamOptimizerAttrs>();
    double new_beta1_t = old.beta_t * old.beta1;

Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance or would we want a separate optimizer state object/field to hold these mutable values?


lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

                                 new_attrs};
  }
  return old_training_instance;

Might be shorted to just copy the old_training_instance?

Suggestion:

  if (old_training_instance.optimizer_attrs.has<AdamOptimizerAttrs>()) {
    old_training_instance.optimizer.attrs.get<AdamOptimizerAttrs>() = AdamOptimizerAttrs{
      ...
    };
  } else {
    return old_training_instance;
}

lib/local-execution/src/optimizer.cc line 39 at r2 (raw file):

    return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::SGD_UPD_PS_TASK_ID, b};

Bit more readable

Suggestion:

  if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
    b.bind_arg(HANDLE, ff_handle());
    return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
  } else {
    return {task_id_t::SGD_UPD_PS_TASK_ID, b};
  }

lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::ADAM_UPD_PS_TASK_ID, b};

Suggestion:

  if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
    b.bind_arg(HANDLE, ff_handle());
    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  } else {
    return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
  }

lib/local-execution/src/optimizer.cc line 189 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_signature();
      }});

Suggestion:

TaskSignature get_update_signature(OptimizerAttrs const &attrs) {
  return attrs.visit<TaskSignature>(overload{
      [&](SGDOptimizerAttrs const &) { return get_sgd_update_signature(); },
      [&](AdamOptimizerAttrs const &) {
        return get_adam_update_signature();
      }});

lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

    get_update_invocation(OptimizerAttrs const &attrs,
                          tensor_guid_t const &weight,
                          std::vector<tensor_guid_t> const &buffer_tensors) {

Is buffer usually the word for these?

Code quote:

 &buffer_tensors)

lib/local-execution/src/optimizer.cc line 211 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_task_impl();
      }});

Suggestion:

TaskImplFunction get_update_task_impl(OptimizerAttrs const &attrs) {
  return attrs.visit<TaskImplFunction>(overload{
      [&](SGDOptimizerAttrs const &) { return get_sgd_update_task_impl(); },
      [&](AdamOptimizerAttrs const &) {
        return get_adam_update_task_impl();
      }});

lib/local-execution/src/task_invocation.cc line 46 at r2 (raw file):

bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) {
  // TODO: implement signature checking
  return true;

Suggestion:

  NOT_IMPLEMENTED();

lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

    tensor_backing_map.insert({input_tensor, input_backing});

    SUBCASE("SparseCategoricalCrossEntropyLossAttrs") {

The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.

It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries


lib/local-execution/test/src/test_update_e2e.cc line 0 at r3 (raw file):
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well


lib/op-attrs/include/op-attrs/ops/loss_functions.h line 8 at r3 (raw file):

#include "loss_function.dtg.h"
#include "other_loss_attrs.dtg.h"
#include "sparse_categorical_ce_loss_attrs.dtg.h"

Suggestion:

#include "op-attrs/ops/core.h"
#include "op-attrs/ops/loss_attrs.dtg.h"
#include "op-attrs/ops/loss_function.dtg.h"
#include "op-attrs/ops/other_loss_attrs.dtg.h"
#include "op-attrs/ops/sparse_categorical_ce_loss_attrs.dtg.h"

lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "OtherLossAttrs"

Maybe NonconfigurableLossAttrs or NonparameterizedLossAttrs? OtherLossAttrs is just a bit ambiguous

Code quote:

name = "OtherLossAttrs"

lib/pcg/include/pcg/computation_graph.h line 36 at r3 (raw file):

std::vector<tensor_guid_t>
    get_new_tensor_guids_for_layer_without_graph_insertion(

Remove


lib/pcg/include/pcg/optimizer_attrs.h line 8 at r3 (raw file):

namespace FlexFlow {

OptimizerAttrs make_empty_sgd_attrs();

What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.


lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

TaskSignature get_loss_bwd_signature();
TaskInvocation
    backward(LossAttrs const &, tensor_guid_t logit, tensor_guid_t label);

Why change the signature of backward?

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @lockshaw)


lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this not just src/cuda/*.cu?

There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu and embedding_kernels.cu. I think maybe Dylan was going to tackle those?


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

legion_dim_t for start and ff_dim_t for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂

It's called like this in concat_kernels.cu:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements(); where axis is an ff_dim_t

I've left it as not implemented for now because I thought it was weird too.


lib/kernels/src/device.h line 74 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

Was getting a linker error, but I realized it was for a different reason. This is removed and fixed in cuda_helper.cu


lib/local-execution/include/local-execution/model_training_instance.struct.toml line 15 at r3 (raw file):

  "pcg/tensor_guid_t.dtg.h",
  "pcg/optimizer_attrs.dtg.h",
]

Done.


lib/local-execution/include/local-execution/optimizer.h line 16 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.


lib/local-execution/include/local-execution/optimizer.h line 22 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.


lib/local-execution/include/local-execution/optimizer.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Some parameter names would be helpful here

Done.


lib/local-execution/include/local-execution/task_impl_function.variant.toml line 22 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would these be better renamed InitOpTaskImplFunction and FwdBwdOpTaskImplFunction, or do they make sense outside of a op context?

Done.


lib/local-execution/include/local-execution/task_signature.h line 4 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete?

Done.


lib/local-execution/include/local-execution/task_signature.h line 41 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to proper doxygen docstring?

Done.


lib/local-execution/include/local-execution/task_signature.h line 49 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to proper doxygen docstring?

Done.


lib/local-execution/include/local-execution/task_signature.struct.toml line 5 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not "hash"?

Done.


lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

includes = [
  "local-execution/tensor_guid_slot_spec.dtg.h",
  "utils/type_index.h",

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file


lib/local-execution/include/local-execution/task_signature.struct.toml line 12 at r3 (raw file):

  "utils/type_index.h",
  "utils/optional.h"
]

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file


lib/local-execution/src/local_slots_backing.cc line 63 at r3 (raw file):

      get_new_tensor_guids_for_layer_without_graph_insertion(
          cg, weight_layer, num_buffer_tensors);
  for (auto const &tensor_guid : buffer_tensors) {

Done.


lib/local-execution/src/local_slots_backing.cc line 126 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It shouldn't, can you try again?

Done.


lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's going on here with weight_adjusted_idx?

This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.


lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Isn't this just map_values?

That's right. Fixed. Didn't even know that existed.


lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is LOGIT added twice? Does this mean pass both the gradient and the non-gradient into the function?

Yes.


lib/local-execution/src/loss_functions.cc line 40 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Same here, why is LOGIT bound twice?

Same as above, the function needs both gradient and non-gradient tensor. It's the same as when we use bind_grad for op tasks, but I didn't feel like that specification was necessary for non-op tasks.


lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be shorted to just copy the old_training_instance?

Will be superceded by changes to the next and ModelTrainingInstance abstraction


lib/local-execution/src/optimizer.cc line 39 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bit more readable

Done.


lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

    return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
  }
  return {task_id_t::ADAM_UPD_PS_TASK_ID, b};

Done.


lib/local-execution/src/optimizer.cc line 189 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_signature();
      }});

Done.


lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is buffer usually the word for these?

I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer though I guess buffer itself is broad. I can make this grad_buffer_tensors for more clarity.


lib/local-execution/src/optimizer.cc line 211 at r2 (raw file):

      [&](AdamOptimizerAttrs const &s) {
        return get_adam_update_task_impl();
      }});

Done.


lib/local-execution/src/task_invocation.cc line 46 at r2 (raw file):

bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) {
  // TODO: implement signature checking
  return true;

Done.


lib/op-attrs/include/op-attrs/ops/loss_functions.h line 8 at r3 (raw file):

#include "loss_function.dtg.h"
#include "other_loss_attrs.dtg.h"
#include "sparse_categorical_ce_loss_attrs.dtg.h"

Done.


lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe NonconfigurableLossAttrs or NonparameterizedLossAttrs? OtherLossAttrs is just a bit ambiguous

Done.


lib/pcg/include/pcg/optimizer_attrs.h line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.

That's fair. Removed.


lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why change the signature of backward?

I thought local execution won't deal with tensor_guid_t

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 49 of 49 files at r4, all commit messages.
Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu and embedding_kernels.cu. I think maybe Dylan was going to tackle those?

Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

It's called like this in concat_kernels.cu:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements(); where axis is an ff_dim_t

I've left it as not implemented for now because I thought it was weird too.

Probably better to instead use the function that converts between an ff_dim_t and a legion_dim_t using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t) overload


lib/local-execution/include/local-execution/task_signature.struct.toml line 10 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This is fine, but I have to move the deleted part to the src_includes since the fmt is defined in that file

That's good, the more things we can put in src_includes instead of includes the better for build times as things in src_includes won't get propagated through #includes as they're only in the .dtg.cc file and not the .dtg.h file


lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.

So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR


lib/local-execution/src/local_slots_backing.cc line 189 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

That's right. Fixed. Didn't even know that existed.

🙂


lib/local-execution/src/local_slots_backing.cc line 157 at r4 (raw file):

ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing(
    OpTaskBinding const &binding, layer_guid_t const &op_guid) const {
  return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {

Prefer explicit type names over auto

Suggestion:

  return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {

lib/local-execution/src/local_slots_backing.cc line 171 at r4 (raw file):

ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing(
    TaskBinding const &binding) const {
  return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {

Prefer explicit type names over auto

Suggestion:

  return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {

lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Yes.

Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)


lib/local-execution/src/optimizer.cc line 127 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

Thanks--I usually tend to reserve the if (...) { ...; return ...; } ...; return ...; pattern for cases like error handling etc. where the behavior is more "early termination" or (probably equivalently) where the else case is very large


lib/local-execution/src/optimizer.cc line 195 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer though I guess buffer itself is broad. I can make this grad_buffer_tensors for more clarity.

Sure, that sounds good. I have no issue with buffer as long as it's standard terminology, which it sounds like it is


lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I thought local execution won't deal with tensor_guid_t

You mean parallel_tensor_guid_t?

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 65 of 82 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)


lib/kernels/CMakeLists.txt line 10 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back

Done.


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably better to instead use the function that converts between an ff_dim_t and a legion_dim_t using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t) overload

Done.


lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can probably be replaced by slice(DimOrdered, ...): https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9

Done.


lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can be moved over to dtgen

Done. Just added eq for now, created an issue for hash and fmt #1503


lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Either use filter or add a count_if function to containers or something like it

This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.


lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR

Now that PCG input/weight is merged, I can fix in this PR itself.


lib/local-execution/src/local_slots_backing.cc line 157 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit type names over auto

Done. (It's OpArgSpec here btw)


lib/local-execution/src/local_slots_backing.cc line 171 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit type names over auto

Done.


lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)

Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.


lib/local-execution/src/loss_functions.cc line 64 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Label shape is an ArrayShape, so it shouldn't have any parallel dims, right?

I think you're right, yes.


lib/local-execution/include/local-execution/loss_functions.h line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

You mean parallel_tensor_guid_t?

Yes.

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 54 of 83 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)


lib/local-execution/include/local-execution/local_training_backing.h line 18 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this passed as mutable?

Done.


lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why isn't the model part of ModelTrainingInstance? The members seem to have changed non-trivially since the old version:

struct ModelTrainingInstance {
  ComputationGraph computation_graph;
  Optimizer optimizer;
  EnableProfiling enable_profiling;
  TrainingPCG training_pcg;
  TensorMapping tensor_map;
  LegionBacking legion_backing;
};

It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.


lib/local-execution/include/local-execution/task_signature.struct.toml line 29 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not a map of slot_id_t -> something?

Done.


lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This name seems a bit ambiguous? Maybe something clearer?

Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.


lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it make more sense to disaggregate LocalTrainingBacking rather than having an optional field for the training instance?

I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.


lib/local-execution/src/local_slots_backing.cc line 61 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not a fan of this, come up with some new tensor id type

Done. It's now a non_graph_tensor_guid_t and not handled by the computation graph


lib/local-execution/src/local_slots_backing.cc line 100 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.

Done.


lib/local-execution/src/local_slots_backing.cc line 132 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Now that PCG input/weight is merged, I can fix in this PR itself.

Done.


lib/local-execution/src/local_training_backing.cc line 165 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not key this by the layer_guid_t of the WeightAttrs?

Done.


lib/local-execution/src/local_training_backing.cc line 179 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm not convinced that the next() abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance stays the same

Done.


lib/local-execution/src/model_training_instance.cc line 9 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance or would we want a separate optimizer state object/field to hold these mutable values?

Done. Now, next only changes the optimizer attrs.


lib/local-execution/src/model_training_instance.cc line 26 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Will be superceded by changes to the next and ModelTrainingInstance abstraction

Done. Now, next only changes the optimizer attrs.


lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.

It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries

I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).


lib/local-execution/test/src/test_update_e2e.cc line at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well

Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now


lib/pcg/include/pcg/computation_graph.h line 36 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 17 files at r5, 7 of 16 files at r6, 31 of 33 files at r7, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @reyna-abhyankar)


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

The signature (sub_shape(legion_dim_t, ff_dim_t)) still appears the same?


lib/kernels/include/kernels/array_shape.h line 20 at r7 (raw file):

  ArrayShape(TensorShape const &shape);
  ArrayShape(std::vector<std::size_t> const &);
  ArrayShape(LegionTensorDims const &);

Ideally these should be explicit: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-explicit

Code quote:

  ArrayShape(size_t *dims, size_t num_dims);
  ArrayShape(TensorShape const &shape);
  ArrayShape(std::vector<std::size_t> const &);
  ArrayShape(LegionTensorDims const &);

lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

Would it make more sense to just convert start and end to legion_dim_t?


lib/local-execution/include/local-execution/local_training_backing.h line 43 at r7 (raw file):

  LocalSlotsBacking local_slots_backing;
  std::optional<ModelTrainingInstance> training_instance;
  std::optional<OptimizerAttrs> optimizer_attrs;

Having so many optional parameters that affect execution semantics makes me wonder if LocalTrainingBacking (or related abstractions) should be broken up--you could have a user-facing interface that's still along the lines of execute_forward, execute_backward, execute_update, but then have each of those conditionally execute lower-level functions execute_loss, etc. that don't use optional


lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.

That makes sense. In that case what's the point of having ModelTrainingInstance at all? I'm not against having it, I'm just not sure semantically what it represents in the current code.


lib/local-execution/include/local-execution/task_binding.h line 13 at r7 (raw file):

namespace FlexFlow {

struct TaskBinding {

Add support for hashing and fmting?


lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done. Just added eq for now, created an issue for hash and fmt #1503

What's the issue with them? Isn't it as simple as just changing from features = ["eq"] to features = ["eq", "hash", "fmt"]?


lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 2 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.

I guess the concern is "tensor" is a bit overloaded--like "tensor_guid_t" represents just a forward tensor in the computation graph, but then other uses of "tensor" elsewhere are much broader. It might be nice to come up with some naming convention for these different concepts of what a "tensor" is to reduce ambiguity


lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml line 18 at r7 (raw file):

[[fields]]
name = "tensor_guid"
type = "::FlexFlow::UnifiedTensorGuid"

Does it make sense for non_graph_tensor_guid_ts to have a gradient tensor?


lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.

I'm not really arguing for making them non-optional fields, but potentially disaggregating LocalTrainingBackend (or its composite fields) so that the semantics of LocalTrainingBackend are clearer (note the high-level interface could be the same, but the core logic would be in separate pieces that are tested separately). Similar things were done in old-FlexFlow (e.g., tensors had a whole bunch of fields, and for most computations only a subset were really needed. The logic was that if you were testing a certain feature that you could only initialize the fields that were necessary, but that made it really hard to reason about the code as it meant that given a tensor you were never sure what you were allowed to do with it so there had to be a bunch of checking code and all the hidden invariants made the code really hard to get started with). Obviously LocalTrainingBackend doesn't have that degree of optional fields yet, and it's not that optional fields are always bad, but it feels like LocalTrainingBackend is slowly getting closer and closer to this


lib/local-execution/src/local_training_backing.cc line 43 at r7 (raw file):

    if (attrs.has<WeightAttrs>() && this->training_instance.has_value()) {
      assert(this->optimizer_attrs.has_value());
      TaskSignature sig = get_update_signature(this->optimizer_attrs.value());

FYI .value() will already throw an exception if there isn't a value, so the assert isn't strictly necessary


lib/local-execution/src/local_training_backing.cc line 130 at r7 (raw file):

                 unwrapped_training_instance.logit_tensor,
                 unwrapped_training_instance.label_tensor);
    // assert(is_invocation_valid(get_loss_bwd_signature(), loss_invocation));

Remove or uncomment?


lib/local-execution/src/local_training_backing.cc line 172 at r7 (raw file):

      TaskInvocation invocation = get_update_invocation(
          this->optimizer_attrs.value(), weight_tensor, grad_buffer_tensors);
      // assert(is_invocation_valid(get_update_signature(attrs), invocation));

Remove or uncomment?


lib/local-execution/src/loss_functions.cc line 29 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.

I think my core confusion here is https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tR4aD7i5LFASdtM13: "tensor" is overloaded, so what "slot = tensor" means isn't particularly clear to me. For example, if we take the runtime-definition of a tensor as a buffer of data, then what does it mean for me to pass a non-gradient tensor as a gradient, or pass something in as the gradient of a optimizer buffer? I think it would help to split out the different means of "tensor" into different types and then update the interface to use the correct types


lib/local-execution/test/src/test_local_cost_estimator.cc line 34 at r7 (raw file):

          /*vdim=*/embed_dim,
          /*dropout=*/0.0,
          /*bias=*/false,

Any reason for this change?


lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).

I disagree that "the real test is just the CPU kernels one"--there's plenty of code between ".execute_update()and the actual kernel call that has to work properly (all of the task spec abstractions, looking up all the right tensors, filling them all in properly, etc.). It feels like the lack of testability here is due to the interface provided byLocalTrainingBackend--since it's essentially end-to-end (I call .execute_update()and all the state magically updates based on that) I agree it's difficult to test more fine-grained properties, but if the interface were changed (or a new internal interface was pulled out) such that.execute_update()returned a data structure containing, say, a list of low-level tasks to execute (or something like that) then it would be testable. Obviously at some point there does need to be a high-level object that provides the user with the necessary functions to call and is responsible for just trivially tying all the internal components together, butLocalTrainingBackend` feels like it currently has arguably a bit too much code to really be that object (though I think in its current state it's a bit borderline whether it's too large or not), and so it should either be made higher-level and the bulk of the logic it contains refactored out, or it should be made lower-level and an additional interface added on top of it.

On a more specific level though, I'm not really sure what this file is supposed to be testing? Like, what properties of the code is it supposed to be checking? If all of the logic is loss-function-agnostic, why are there tests for different loss functions? Should I view this as just an integration test and that it's really just here to sanity check that things actually compose?


lib/local-execution/test/src/test_update_e2e.cc line at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now

I'm a bit confused--what is this file actually testing?


lib/pcg/include/pcg/optimizer_attrs.h line 9 at r7 (raw file):

namespace FlexFlow {

OptimizerAttrs next(OptimizerAttrs const &old);

Any way we could get a more specific name for this? I know it makes sense in local-execution, but now that it's in next it might be nice to have it be a bit clearer what it does

Code quote:

next

lib/local-execution/include/local-execution/non_graph_tensor_guid_t.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "non_graph_tensor_guid_t"

As mentioned in our meeting, might be nice to build this in a bit more of an "iterative refinement" style, where rather than simultaneously using tensor_guid_t (relatively semantically rich id, can be used for lookups in graphs, etc.) and non_graph_tensor_guid_t (relatively semantically weak, essentially just an opaque handle to an arbitrary buffer somewhere), you instead have some intermediate types (say, for representing gradients if tensors, optimization buffers of tensors, etc.), and then have functions that refine objects over tensor_guid_t -> (intermediate types) -> non_graph_tensor_guid_t, and then all of the LocalBackend code can simply reason over non_graph_tensor_guid_t. It might lead to more code LOC-wise, but might be a nice way to disaggregate LocalBackend and friends (which are getting quite large) and make them easier to test

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @lockshaw)


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The signature (sub_shape(legion_dim_t, ff_dim_t)) still appears the same?

Oh, did you mean we should not have this at all and instead the calling code should call sub_shape(legion_dim_t, legion_dim_t)?


lib/kernels/include/kernels/array_shape.h line 20 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Ideally these should be explicit: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-explicit

Done.


lib/kernels/src/array_shape.cc line 59 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it make more sense to just convert start and end to legion_dim_t?

That works too.


lib/local-execution/include/local-execution/local_training_backing.h line 43 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Having so many optional parameters that affect execution semantics makes me wonder if LocalTrainingBacking (or related abstractions) should be broken up--you could have a user-facing interface that's still along the lines of execute_forward, execute_backward, execute_update, but then have each of those conditionally execute lower-level functions execute_loss, etc. that don't use optional

Refactored. Now ModelTrainingInstance is the driver of the training loop and local backing executes code on a per-layer basis.


lib/local-execution/include/local-execution/model_training_instance.struct.toml line 2 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

That makes sense. In that case what's the point of having ModelTrainingInstance at all? I'm not against having it, I'm just not sure semantically what it represents in the current code.

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W


lib/local-execution/include/local-execution/task_binding.h line 13 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add support for hashing and fmting?

Done.


lib/local-execution/include/local-execution/task_invocation.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's the issue with them? Isn't it as simple as just changing from features = ["eq"] to features = ["eq", "hash", "fmt"]?

Done. Had to actually add the hash/fmt functions for some underlying types.


lib/local-execution/src/local_cost_estimator.cc line 80 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm not really arguing for making them non-optional fields, but potentially disaggregating LocalTrainingBackend (or its composite fields) so that the semantics of LocalTrainingBackend are clearer (note the high-level interface could be the same, but the core logic would be in separate pieces that are tested separately). Similar things were done in old-FlexFlow (e.g., tensors had a whole bunch of fields, and for most computations only a subset were really needed. The logic was that if you were testing a certain feature that you could only initialize the fields that were necessary, but that made it really hard to reason about the code as it meant that given a tensor you were never sure what you were allowed to do with it so there had to be a bunch of checking code and all the hidden invariants made the code really hard to get started with). Obviously LocalTrainingBackend doesn't have that degree of optional fields yet, and it's not that optional fields are always bad, but it feels like LocalTrainingBackend is slowly getting closer and closer to this

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W


lib/local-execution/src/local_training_backing.cc line 43 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI .value() will already throw an exception if there isn't a value, so the assert isn't strictly necessary

Done.


lib/local-execution/src/local_training_backing.cc line 130 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove or uncomment?

Currently a todo #1442 . It should be uncommented but the underlying function is NOT_IMPLEMENTED()


lib/local-execution/src/local_training_backing.cc line 172 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove or uncomment?

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8nGxOQBQbphnmK4wmJ


lib/local-execution/test/src/test_local_cost_estimator.cc line 34 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason for this change?

The PCG interface added the bias tensors, which broke the test because I only had one weight tensor. I quickly wanted to just check if the actual test worked but forgot to switch it back and actually add the bias tensors. This is now corrected.


lib/pcg/include/pcg/optimizer_attrs.h line 9 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any way we could get a more specific name for this? I know it makes sense in local-execution, but now that it's in next it might be nice to have it be a bit clearer what it does

How about get_next_iteration_optimizer_attrs?

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 68 of 102 files reviewed, 14 unresolved discussions (waiting on @lockshaw)


lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I disagree that "the real test is just the CPU kernels one"--there's plenty of code between ".execute_update()and the actual kernel call that has to work properly (all of the task spec abstractions, looking up all the right tensors, filling them all in properly, etc.). It feels like the lack of testability here is due to the interface provided byLocalTrainingBackend--since it's essentially end-to-end (I call .execute_update()and all the state magically updates based on that) I agree it's difficult to test more fine-grained properties, but if the interface were changed (or a new internal interface was pulled out) such that.execute_update()returned a data structure containing, say, a list of low-level tasks to execute (or something like that) then it would be testable. Obviously at some point there does need to be a high-level object that provides the user with the necessary functions to call and is responsible for just trivially tying all the internal components together, butLocalTrainingBackend` feels like it currently has arguably a bit too much code to really be that object (though I think in its current state it's a bit borderline whether it's too large or not), and so it should either be made higher-level and the bulk of the logic it contains refactored out, or it should be made lower-level and an additional interface added on top of it.

On a more specific level though, I'm not really sure what this file is supposed to be testing? Like, what properties of the code is it supposed to be checking? If all of the logic is loss-function-agnostic, why are there tests for different loss functions? Should I view this as just an integration test and that it's really just here to sanity check that things actually compose?

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W for the splitting out. For loss functions, it really is just a sanity test to make sure things run. The reason there's different subcases are because the actual loss function op code/kernel has conditional logic based on which loss attrs we're dealing with. But the backing is agnostic to this. So yes we could have just 1 test for loss functions in general but I just wanted to sanity check.


lib/local-execution/test/src/test_update_e2e.cc line at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm a bit confused--what is this file actually testing?

Still need to actually make this file check something useful, but it's easier now because of the refactor.

@lockshaw lockshaw changed the base branch from repo-refactor to master December 16, 2024 08:37
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 21 of 36 files at r8, 81 of 92 files at r9, all commit messages.
Reviewable status: 134 of 149 files reviewed, 36 unresolved discussions (waiting on @reyna-abhyankar)


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Oh, did you mean we should not have this at all and instead the calling code should call sub_shape(legion_dim_t, legion_dim_t)?

Yes


lib/local-execution/src/local_training_backing.cc line 130 at r7 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Currently a todo #1442 . It should be uncommented but the underlying function is NOT_IMPLEMENTED()

Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?


lib/local-execution/src/local_training_backing.cc line 172 at r7 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8nGxOQBQbphnmK4wmJ

Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?


lib/local-execution/test/src/test_loss_e2e.cc line 45 at r3 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W for the splitting out. For loss functions, it really is just a sanity test to make sure things run. The reason there's different subcases are because the actual loss function op code/kernel has conditional logic based on which loss attrs we're dealing with. But the backing is agnostic to this. So yes we could have just 1 test for loss functions in general but I just wanted to sanity check.

Can you hold this as "Working" while we resolve https://reviewable.io/reviews/flexflow/flexflow-train/1472#-OESUtvlCsQMPCJ492-X?


lib/pcg/include/pcg/optimizer_attrs.h line 9 at r7 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

How about get_next_iteration_optimizer_attrs?

Sounds great. Could also do get_optimizer_attrs_for_next_iteration if you prefer, either is fine with me


lib/kernels/include/kernels/legion_dim.h line 13 at r9 (raw file):

legion_dim_t legion_dim_from_ff_dim(ff_dim_t, int num_dimensions);

std::optional<legion_dim_t> legion_dim_from_ff_dim(std::optional<ff_dim_t>,

Probably should just use transform from containers rather than having a whole extra specialization for this.


lib/local-execution/include/local-execution/local_slots_backing.h line 25 at r9 (raw file):

    std::unordered_map<reduced_tensor_t, GenericTensorAccessorW>;

struct LocalSlotsBacking {

This object is getting rather large in terms of methods, it would be nice to factor some of them out into separate functions, even if that means some explicit argument passing is necessary (might even be an advantage, because right now it's hard to tell what these methods do because they access so much)--wasn't this going to be solved by adding an explicit lowering phase so that LocalSlotsBacking isn't responsible for any thing OpTask* and thus only has to handle Task* objects?


lib/local-execution/include/local-execution/local_slots_backing.h line 36 at r9 (raw file):

                              ComputationGraph const &,
                              Allocator &);
  void allocate_tensors_by_role(TensorRole const &,

Why is this new API necessary?


lib/local-execution/include/local-execution/local_slots_backing.h line 49 at r9 (raw file):

  TensorSlotsBacking
      construct_tensor_slots_backing(TaskBinding const &,
                                     std::optional<layer_guid_t> const &) const;

Why does this need a layer_guid_t?


lib/local-execution/include/local-execution/local_slots_backing.h line 59 at r9 (raw file):

  GenericTensorAccessorW const &
      get_tensor_backing(TensorType const &,

Why does this function care about the TensorType--shouldn't the reduced_tensor_t by itself be unambiguous?

Code quote:

      get_tensor_backing(TensorType const &,

lib/local-execution/include/local-execution/local_slots_backing.h line 61 at r9 (raw file):

      get_tensor_backing(TensorType const &,
                         reduced_tensor_t const &,
                         std::optional<layer_guid_t> const &) const;

Why is a layer_guid_t necessary here?


lib/local-execution/include/local-execution/layer_tensor_key.struct.toml line 2 at r9 (raw file):

namespace = "FlexFlow"
name = "LayerTensorKey"

What is the purpose of LayerTensorKey? I see that it's used in LocalSlotsBacking, but I'm not clear on why every it needs to be keyed by both layer_guid_t and reduced_tensor_t rather than just by reduced_tensor_t--why is the layer_guid_t necessary?


lib/local-execution/include/local-execution/op_task_invocation.h line 87 at r9 (raw file):

  bool operator!=(OpTaskBinding const &other) const;

  std::unordered_map<SlotTensorTypeId, OpTensorSpec> const &

Is it necessary for OpTaskInvocation to deal with anything other than grad and non-grad tensors? Seems like SlotGradId as the more accurate datatype for here, but maybe there's a case I'm not thinking about


lib/local-execution/include/local-execution/arg_ref.h line 72 at r9 (raw file):

    size_t result = 0;
    hash_combine(s.type_idx);
    hash_combine(s.ref_type);

Why remove ref_type? That seems like an important field?


lib/local-execution/include/local-execution/concrete_arg.h line 49 at r9 (raw file):

  std::shared_ptr<void const> ptr;

  std::tuple<decltype(type_idx) const &> tie() const;

Why does tie not include the ptr?


lib/local-execution/include/local-execution/concrete_arg.h line 63 at r9 (raw file):

  size_t operator()(::FlexFlow::ConcreteArgSpec const &s) const {
    size_t result = 0;
    ::FlexFlow::hash_combine(result, s.get_type_index());

Why not include ptr in the hash?


lib/local-execution/include/local-execution/model_training_instance.h line 10 at r9 (raw file):

using PerLayerElapsedTime =
    std::unordered_map<layer_guid_t, std::optional<float>>;

Might be nice to eventually convert this float to a wrapper type with, e.g., non-negativity checks, could be named like milliseconds_t or whatever the unit is (also to make the unit clear) along the lines of #1533

Code quote:

std::optional<float>

lib/local-execution/include/local-execution/task_registry.h line 17 at r9 (raw file):

                              ComputationGraphOpAttrs const &attrs);

bool registry_contains_op_task(TaskRegistry const &,

Suggestion:

bool registry_contains_task_for_layer(

lib/local-execution/include/local-execution/runtime_arg_ref.h line 13 at r8 (raw file):

namespace FlexFlow {

enum class RuntimeArgRefType {

Why not just make this dtgen so you don't have to implement to_string, etc. yourself?


lib/local-execution/include/local-execution/runtime_arg_ref.h line 30 at r9 (raw file):

RuntimeArgRef<FFIterationConfig> iteration_config();

// std::string format_as(RuntimeArgRefSpec const & x) {

Delete


lib/local-execution/include/local-execution/local_task_argument_accessor.h line 26 at r9 (raw file):

  ConcreteArgSpec const &get_concrete_arg(slot_id_t) const override;

  GenericTensorAccessor get_tensor(slot_id_t slot,

I know I've mentioned this before (so sorry for bringing it up again, but I appear to have forgotten the previous response) but why have a slot represent any of a number of tensors (forward, gradient, optimizer, etc.) rather than just having a slot represent a single tensor? Is it that it makes backward inference more difficult? Conceptually feels weird, as it makes it a bit confusing who actually determines what gets communicated to a task

Code quote:

  GenericTensorAccessor get_tensor(slot_id_t slot,

lib/local-execution/include/local-execution/task_binding.h line 10 at r9 (raw file):

#include "local-execution/task_id_t.dtg.h"
#include "local-execution/task_signature.dtg.h"
#include "utils/hash/unordered_map.h"

Move to .cc file along with hash definition


lib/local-execution/include/local-execution/task_binding.h line 66 at r9 (raw file):

template <>
struct hash<::FlexFlow::TaskBinding> {
  size_t operator()(::FlexFlow::TaskBinding const &s) const {

Move definition to .cc file


lib/local-execution/include/local-execution/task_binding.h line 70 at r9 (raw file):

    hash_combine(result, s.get_tensor_bindings());
    hash_combine(result, s.get_arg_bindings());
    return result;

FYI you can just return get_std_hash(result.tie()); (as long as you make std::hash<TaskBinding> a friend of TaskBinding

Code quote:

    size_t result = 0;
    hash_combine(result, s.get_tensor_bindings());
    hash_combine(result, s.get_arg_bindings());
    return result;

lib/local-execution/include/local-execution/local_training_backing.h line 37 at r9 (raw file):

                            std::optional<layer_guid_t> const &) const;
  TaskArgumentAccessor get_op_task_arg_accessor(OpTaskInvocation const &,
                                                layer_guid_t const &) const;

Why are these on LocalTrainingBacking instead of LocalSlotsBacking? They feel lower-level than the other methods present here (i.e., execute_init, execute_forward, etc. all require arg accessors to function, so it feels odd that it's at the same level of API)

Code quote:

      get_task_arg_accessor(TaskInvocation const &,
                            std::optional<layer_guid_t> const &) const;
  TaskArgumentAccessor get_op_task_arg_accessor(OpTaskInvocation const &,
                                                layer_guid_t const &) const;

lib/local-execution/src/ops/attention.cc line 0 at r9 (raw file):
FYI these should be moved to lib/local-execution/src/local-execution/ops/attention.cc to match the standard directory layout


lib/local-execution/include/local-execution/reduced_tensor_t.struct.toml line 2 at r9 (raw file):

namespace = "FlexFlow"
name = "reduced_tensor_t"

What does this represent? It probably needs a new name as "reduced" is a bit overloaded (i.e., the reduction operation)


lib/local-execution/src/tensor_reduction.cc line 7 at r9 (raw file):

reduced_tensor_t lower(tensor_guid_t const &tensor_guid) {
  return reduced_tensor_t{tensor_guid.raw_graph_output.idx};

This seems rather error-prone: how are you going to avoid assigning a non-tensor_guid_t tensor the same index? What about instead using some kind of numbering source (like NodeSource) to avoid duplicate numberings?


lib/local-execution/src/tensor_reduction.cc line 11 at r9 (raw file):

std::vector<reduced_tensor_t>
    lower(std::vector<tensor_guid_t> const &tensor_guids) {

Just use transform(..., lower) when you need this functionality (I try to avoid having functions that are just the normal function lifted through transform, as having them would significantly increase the number of functions to maintain for little value)


lib/local-execution/src/local_cost_estimator.cc line 24 at r9 (raw file):

  }
  return total_elapsed_time;
}

Suggestion:

static float get_total_elapsed_time(PerLayerElapsedTime const &fwd,
                                    PerLayerElapsedTime const &bwd) {
  return sum(values(fwd)) + sum(values(bwd));
}

lib/local-execution/src/local_cost_estimator.cc line 29 at r9 (raw file):

    : runtime_arg_config(config) {}

CostDetails LocalCostEstimator::estimate_cost(

Any way we could factor out pieces of this function? A bunch of these blocks seem conceptually self-contained and this function is getting rather large


lib/local-execution/src/local_cost_estimator.cc line 51 at r9 (raw file):

  std::vector<tensor_guid_t> input_tensor_ids;

  ComputationGraphBuilder cg_builder;

I feel like you probably just want the raw ComputationGraph interface (i.e., computation_graph.h) here, ComputationGraphBuilder is really intended to provide a lots-of-stuff-auto-inferred pytorch-style interface for constructing models, which you're not really using here


lib/local-execution/src/local_cost_estimator.cc line 84 at r9 (raw file):

                                     tensor_backing_map,
                                     this->runtime_arg_config);
  local_backing.register_and_allocate_layer(layer_added_result.layer);

Why do we have a separate API for registering layers? Is there a time when we want a LocalTrainingBacking that doesn't have its layers registered?


lib/pcg/include/pcg/computation_graph_builder.h line 260 at r9 (raw file):

  tensor_guid_t get_output(LayerAttrs const &, int idx) const;

  LayerAddedResult add_layer_and_get_layer_added_result(

Why not just use add_layer from computation_graph.h (

LayerAddedResult add_layer(ComputationGraph &computation_graph,
LayerAttrs const &attrs,
std::vector<tensor_guid_t> const &inputs,
std::vector<TensorAttrs> const &outputs);
)? Feels unnecessary and a bit too low-level for the rest of the API to put this on ComputationGraphBuilder


lib/local-execution/include/local-execution/tensor_type.enum.toml line 11 at r9 (raw file):

[[values]]
name = "NON_GRAPH"

Instead of having a miscellaneous NON_GRAPH category, instead specify the categories that it could be (e.g., LOSS, etc.)


lib/op-attrs/include/op-attrs/operator_attrs.h line 4 at r9 (raw file):

#define _OPERATOR_PARAMS_H

#include "local-execution/ops/attention.h"

These shouldn't be in op-attrs--typo?


lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph_builder.h line 183 at r9 (raw file):

ParallelTensorAttrs
    make_weight_attrs(ParallelTensorShape const &shape,

Why was this changed to be non-static? If it has to be non-static it shouldn't be here (it should probably be in either parallel_tensor_attrs.h or parallel_tensor_shape.h)

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 134 of 149 files reviewed, 35 unresolved discussions (waiting on @lockshaw)


lib/kernels/include/kernels/array_shape.h line 45 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Yes

Done.


lib/kernels/include/kernels/legion_dim.h line 13 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably should just use transform from containers rather than having a whole extra specialization for this.

Done.


lib/local-execution/include/local-execution/arg_ref.h line 72 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why remove ref_type? That seems like an important field?

Done.


lib/local-execution/include/local-execution/concrete_arg.h line 49 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why does tie not include the ptr?

In tests, I check for equality for two ConcreteArgSpec but they are ::created separately so they have different pointers for the underlying value.


lib/local-execution/include/local-execution/concrete_arg.h line 63 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not include ptr in the hash?

Done.


lib/local-execution/include/local-execution/local_slots_backing.h line 36 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this new API necessary?

It made it a lot easier to test whether different were being allocated correctly.


lib/local-execution/include/local-execution/runtime_arg_ref.h line 30 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/local-execution/include/local-execution/task_binding.h line 10 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to .cc file along with hash definition

Done.


lib/local-execution/include/local-execution/task_binding.h line 66 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move definition to .cc file

Done.


lib/local-execution/src/local_cost_estimator.cc line 84 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why do we have a separate API for registering layers? Is there a time when we want a LocalTrainingBacking that doesn't have its layers registered?

This is useful when testing. We can initialize the backing object and then register different layers in different subcases


lib/local-execution/src/local_training_backing.cc line 130 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?

Done.


lib/local-execution/src/local_training_backing.cc line 172 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?

Done.


lib/local-execution/src/ops/attention.cc line at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI these should be moved to lib/local-execution/src/local-execution/ops/attention.cc to match the standard directory layout

Done.


lib/op-attrs/include/op-attrs/operator_attrs.h line 4 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

These shouldn't be in op-attrs--typo?

Done.


lib/local-execution/include/local-execution/task_registry.h line 17 at r9 (raw file):

                              ComputationGraphOpAttrs const &attrs);

bool registry_contains_op_task(TaskRegistry const &,

Done.


lib/local-execution/src/local_cost_estimator.cc line 24 at r9 (raw file):

  }
  return total_elapsed_time;
}

Don't need this anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
runtime Runtime library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add hash and fmt for TaskBinding in local execution
2 participants