-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Convolutional-neural-nets? #224
Comments
I'm interested. I think it would be good to first have a test case or proof-of-concept for the Epiphany. Given the limitations of the core, I would like to see a proof-of-concept parallel implementation that can efficiently handle the inter-core communication to do something useful (even though all PAL functions are currently serial). |
there's no doubt in my mind it could do it - anything a cache can do, intercore-communication can - you just sometimes need to manually discriminate what is kept. (in the case of convolutions, I think most of the time you'd want to assume the kernel was kept local, but for completeness one might have to also have the option to stream the kernels? ) Another question is how far to go with intercore communication: a single function to do multiple passes would give you the opportunity to run a whole pipeline on chip (where a GPU would be serial between layers, with shared L2 communication); however the API's would begin to explode in complexity; I got the impression PAL wants to be quite generic & straightforward. (you might need further details like describing streams of data) e.g, something like this:-
|
If you're streaming the kernel weights from off-chip memory, it seems to me that it's no longer a very good choice to use the Epiphany. Copying the weights to perform a single multiply-accumulate per weight is not an effective use of the architecture because it quickly becomes bandwidth-bound. The most effective network would include preconfigured weights and inter-core communication while streaming the input. An effective inter-core communication scheme must be accounted for pretty much any network size that performs something useful. Currently PAL can address the computation part, but the communication here is just not as straight forward as moving edges on image processing. The communication buffers may be asymmetric in size and do not necessarily translate to nearest neighbors (on the on-chip network). It's complicated. Still, I agree that the prospect of doing it on the Epiphany is interesting since it does allow you to do the entire NN without explicit synchronization to global memory between layers like on a GPU. |
for convolutions, you'd load a weightmap which is then invoked at many times across the image (e.g. a 512x512 image x 16x16x64 kernel is a 1mb image x 64k kernel), so there should still be some value. But perhaps this can be improved further by a single call to apply multiple layers to a batch of images. (upload kernel, upload image 1, classify, upload image 2, ... ).. For image recognition, they're typically ~60mb of weights? Its going to be a while before there's an epiphany chip capable of holding all that; but even then, I thought holding the intermediate layer values on chip would still be a win. I guess if they do actually make a 1024core chip with 128k per core that could do it amazingly well |
Yes, thinking toward the future, a larger chip should do quite well. For now, it would be beat to focus on toy codes and proofs of concept. This is architecture and algorithm research since it's not like performance or energy efficiency records will be smashed. I think inverting the data loading so that it's working on a batch of images doesn't solve the bandwidth limitation. It still uses the data just once. I could be wrong or misunderstanding. |
it depends on the size of the local-store. If the filter weights fit on the chip, it helps. if they don't.. it doesn't. |
Somewhere between a fully fledged neural-net library, and the existing convolution functions in pal/image - would there be any elements that are a good fit for the Pal library ?
imagine the following function:
3D x 3D -> 2D convolution, with bias, clamped output (supply a minimum, e.g. zero,-1, -FLT_MAX for no effect?) for 'ReLU', and optional max-pooling (N=1,2,3.. ? N=1 for no pooling) to reduce the output image size;
This would be a big chunk of the basic layer-evaluation of the deep-learning image recognition algorithms. You'd invoke multiple 3dx2d convolutions for a 3D result.
It would be important to include the ReLU & max-pooling since this would avoid significant memory traffic. You could provide a helper function for a 3d x 3d->2d convolution without those steps that just calls it with (min=0, pooling=1) ... or have an outright separate function if needed.
An input could be (width x height x channels) - image-planes - or a true 3D image (volume data).
Then imagine functions for training such a thing. (backpropogation and accumulating error deltas throught the weights).
I think a single step like that would go a long way to leveraging the epiphany hardware; you'd have a lot of data-reuse, perhaps uploading an entire 3D filter across multiple cores, then streaming an image through it .
This would be a stepping stone to a full neural net library which could implement pipelines between net layers. Getting some capability in the Pal library might make the epiphany chip more appealing to neural-net/Deep-learning researchers.
Short of that, are there other ways to generalize 2D convolutions to be more useful ?
e.g if the 3rd dimension was interleaved (e.g. [row0 [r0,g0,b0,r1,g1,b1...] row1[r0,g0,b0, r1,g1,b1 ]...]), could you treat it as a 2D convolution with strided input (then merely adding 'col_step', 'row_step' parameters e.g. col_step=3 for r,g,b input..). This would still require the insertion of a clamping & max-pool stage to your 2d convolution, and again if worried about parameter explosion , a simple helper could provide a streamlined interface. Thresholding/clamping is fairly common in image-processing I think (e.g. extracting certain edges from an image, bluring highlights, keeping results in a output range for bit-reduction, etc).
Stepped inputs/outputs would allow using this function for filtered image down scaling, or perhaps colour-space conversions
/*2d convolution, extended */
( also a minor point, I would have thought it more logical to order 'cols, rows' as per width/height for images stored in memory; you can still label them rows/cols for people thinking about it as a matrix in the linear-algebra sense.)
The text was updated successfully, but these errors were encountered: