Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal.randn! produces Nan #474

Open
rveltz opened this issue Oct 28, 2024 · 7 comments
Open

Metal.randn! produces Nan #474

rveltz opened this issue Oct 28, 2024 · 7 comments
Labels
upstream Out of our hands

Comments

@rveltz
Copy link

rveltz commented Oct 28, 2024

Hi,

On a M2 Max with

(code) pkg> st
  [e9467ef8] GLMakie v0.10.15
  [63c18a36] KernelAbstractions v0.9.29
  [dde4c033] Metal v1.4.2
  [d96e819e] Parameters v0.12.3
  [91a5bcdd] Plots v1.40.8
  [92933f4c] ProgressMeter v1.10.2
  [295af30f] Revise v3.6.2
  [2913bbd2] StatsBase v0.34.3
  [9a3f8284] Random

I observe:

using Metal
N = 10_000_000
X0 = MtlArray(zeros(Float32, Int64(N) ));
Metal.randn!(X0)
sum(isnan, X0) # random number, seems always > 0

I hope this is not too difficult to fix.

best regards

@maleadt
Copy link
Member

maleadt commented Oct 28, 2024

#321 is part of 1.4.2, so cc @christiangnrd.

@christiangnrd
Copy link
Contributor

christiangnrd commented Oct 28, 2024

I'll look into some more when I have more time, but this seems to be an upstream bug. See pytorch/pytorch#89283

@christiangnrd
Copy link
Contributor

Happens in Swift too. You even get the same NaNs when using the same seed.

Julia example:

julia> using Metal;N=100000000;X0=Metal.zeros(Float32, N);rng = MPS.RNG(device(), 1234); Metal.randn!(rng, X0);collect(1:N)[Array(isnan.(X0))]
9-element Vector{Int64}:
  5281172
 18887991
 23601183
 27449378
 28126551
 43658369
 53930280
 71339505
 76705941
Swift equivalent MWE
import Metal 
import MetalPerformanceShaders
 
func main(T: Float.Type = Float32.self, N: Int = 100000000, seed: Int = 1234) { 
    guard let device = MTLCreateSystemDefaultDevice(), 
          let commandQueue = device.makeCommandQueue() else { 
        fatalError("Metal device or command queue could not be created") 
          } 

    var a = [Float](repeating: 1, count: N)
 
    let aBuffer = device.makeBuffer(bytes: &a, length: MemoryLayout<Float>.size * N, options: [])
    let aVectorDescriptor = MPSVectorDescriptor(length: N, dataType: .float32) 
    let aVector = MPSVector(buffer: aBuffer!, descriptor: aVectorDescriptor)

    let randesc = MPSMatrixRandomDistributionDescriptor.normalDistributionDescriptor(withMean: 0, standardDeviation: 1)
    var rand = MPSMatrixRandomPhilox(device: device, destinationDataType: .float32, seed: seed, distributionDescriptor: randesc)

    let commandBuffer = commandQueue.makeCommandBuffer()!
    rand.encode(commandBuffer: commandBuffer, destinationVector: aVector)
    commandBuffer.commit()
    commandBuffer.waitUntilCompleted()



    // Check for NaNs in the result matrix 
    let aPointer = aBuffer!.contents().bindMemory(to: Float.self, capacity: N) 
    var j = 0
    var c = 0
    print(String(format: "Found NaNs:", c))
    while j < N {
        if aPointer[j].isNaN {
            c += 1
            print(j)
        }
        j += 1
    }
    print(String(format: "\n%d NaN in result\n", c))
}

main()

Swift Output:

Found NaNs:
5281171
18887990
23601182
27449377
28126550
43658368
53930279
71339504
76705940

9 NaN in result

@christiangnrd christiangnrd added the upstream Out of our hands label Oct 29, 2024
@tgymnich
Copy link
Member

Did you post the Swift MWE on the Apple forums before to see what they have to say? I suppose an autorelease pool will fix this as always.

@christiangnrd
Copy link
Contributor

christiangnrd commented Oct 29, 2024

Did you post the Swift MWE on the Apple forums before to see what they have to say?

https://developer.apple.com/forums/thread/767452

I suppose an autorelease pool will fix this as always.

I don't think so this time. Using a uniform generator has no issues. That and the consistent locations of the NaNs in the result lead me to believe it's a bug in their shaders

@rveltz
Copy link
Author

rveltz commented Dec 17, 2024

For what is worth:

In [4]:  import mlx.core as mx
   ...: n = 100000000
   ...: for i in range(10):
   ...:     a = mx.random.normal((1000000000,), stream = mx.gpu)
   ...:     print(mx.sum(mx.isnan(a)))
   ...: 
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)
array(0, dtype=int32)

@christiangnrd
Copy link
Contributor

I received a reply from an Apple Engineer about this today on my forum post. They just asked me to report it via feedback assistant but it's a sign they may look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Out of our hands
Projects
None yet
Development

No branches or pull requests

4 participants