Binary Incompatibility Issue between scikit-learn and numpy causing ValueError with murmurhash #2334

JGAUG26 · 2024-09-11T10:07:27Z

JGAUG26
Sep 11, 2024

I'm encountering a binary incompatibility issue between scikit-learn and numpy while running a script in a Python environment. The error occurs when trying to import CountVectorizer from sklearn.feature_extraction.text. Below is the error traceback:

Traceback (most recent call last):
File "/tmp/dolphinscheduler/exec/process/root/1/14927164157472_1/564/683/py_564_683.py", line 13, in
from sklearn.feature_extraction.text import CountVectorizer
File "/scity/miniconda3/envs/dp-pdfocr/lib/python3.9/site-packages/sklearn/init.py", line 82, in
from .base import clone
File "/scity/miniconda3/envs/dp-pdfocr/lib/python3.9/site-packages/sklearn/base.py", line 17, in
from .utils import _IS_32BIT
File "/scity/miniconda3/envs/dp-pdfocr/lib/python3.9/site-packages/sklearn/utils/init.py", line 19, in
from .murmurhash import murmurhash3_32
File "sklearn/utils/murmurhash.pyx", line 1, in init sklearn.utils.murmurhash
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Environment Details:
numpy version: 1.26.4
scikit-learn version: 1.0
Python version: 3.9
OS: Linux (running on server via dolphinscheduler)
I have tried upgrading scikit-learn and downgrading numpy to older versions, but the issue persists. I've also attempted recompiling scikit-learn using --no-binary :all: but that did not resolve the problem.

Code Snippet (for context):
Here's a simplified version of the operator I'm running, which uses whisper for audio transcription (note that the issue does not directly relate to this code but rather to the environment setup):

import os
import whisper

def read_audio_file(file_path):
"""Reads the audio file."""
with open(file_path, "rb") as file:
return file.read()

def write_text_file(file_path, text):
"""Writes the transcribed text to the output file."""
with open(file_path, "w", encoding="utf-8") as file:
file.write(text)

def clean_files(nas_source_path, nas_converted_path, config: dict):
"""Processes audio files and transcribes them using Whisper."""
files = os.listdir(nas_source_path)

for filename in files:
    if not filename.endswith(('.mp3', '.wav', '.aac')):
        continue

    input_file = os.path.join(nas_source_path, filename)
    output_file = os.path.join(nas_converted_path, filename[:-4] + '.txt')

    # Load Whisper model
    model = whisper.load_model("small")

    # Perform transcription
    result = model.transcribe(input_file)

    # Save the transcribed text
    write_text_file(output_file, result['text'])

Can you suggest how I can resolve this binary incompatibility issue between numpy and scikit-learn? Any guidance or suggestions would be greatly appreciated. Thank you!

giorgiococci · 2024-09-20T06:47:42Z

giorgiococci
Sep 20, 2024

Hi @JGAUG26, I'm having the same issue, how do you resolve it? Thanks!

1 reply

dmadea Oct 21, 2024

I upgraded to Python 3.12 and it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary Incompatibility Issue between scikit-learn and numpy causing ValueError with murmurhash #2334

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Binary Incompatibility Issue between scikit-learn and numpy causing ValueError with murmurhash #2334

JGAUG26 Sep 11, 2024

Replies: 1 comment · 1 reply

giorgiococci Sep 20, 2024

dmadea Oct 21, 2024

JGAUG26
Sep 11, 2024

Replies: 1 comment 1 reply

giorgiococci
Sep 20, 2024