Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending your code for poses #29

Open
AmitMY opened this issue Feb 19, 2022 · 1 comment
Open

Extending your code for poses #29

AmitMY opened this issue Feb 19, 2022 · 1 comment

Comments

@AmitMY
Copy link

AmitMY commented Feb 19, 2022

This is a "support" request rather than a bug or a feature.


Idea

I have a "video" sequence that is represented as skeletal poses rather than video frames.
Each pose was extracted from a video frame, and is now a tensor of shape [frames, keypoints, dimensions] such that a [100,137,2] tensor would be a 2D pose, of 137 keypoints, over 100 frames.

As there is no consistent spatial information between strides, we can imagine the dimensions to be equivalent to channels, and apply a full-size convolution for in_channels=2, out_channels=C, kernel_size=(F, 137), stride=S.
(Where C, S, and F are hyperparameters, and can run for multiple layers)

After multiple layers of convolution, these representations will then be quantized, and be decoded in the reverse way.

Why fork from your library?

  • It seems stable
  • It works for the same sample rate - unlike audio based models

Support request:

While I can make the data loading model to load these tensors, and perform the necessary data augmentation, etc, I'm having some trouble understanding how to properly implement the convolutional encoder and decoder. (this is different as it is not downsampling, over a specially/temporally consistent input)

Could you please offer some guidance?

Thanks!

@wilson1yan
Copy link
Owner

wilson1yan commented Feb 19, 2022

A few possible options off the top of my head:

  1. You could try just treating the keypoints as "spatially consistent" and running a CNN autoencoder / dnecoder to see how well that reconstructs your keypoints.

  2. One option may be in the area of graph convolutions and graph autoencoders if you treat your keypoint structure as a graph. Though I'm not too familiar with that area to tell you exactly how you'd do it or how those architectures work.

  3. If you're aren't that bottlenecked by compute, you could just directly quantize your keypoint data into finegrained enough bins and directly model that using a transformer. e.g. your input to the transformer would be of shape (100, 137, 2, n_quantization_bins).

  4. Alternatively, there are a few papers (e.g. this one) that train video prediction models over keypoints and work pretty well. They essentially train a VAE over keypoint data, and do have encoder / decoder architecture so that may also help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants