Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnboundLocalError: local variable 'pipe' referenced before assignment #256

Open
SatyaRamGV opened this issue Nov 15, 2018 · 17 comments
Open

Comments

@SatyaRamGV
Copy link

text = textract.process(file, method='pdfminer')

Error:
UnboundLocalError Traceback (most recent call last)
in ()
----> 1 text = textract.process(file, method='pdfminer')

~/.local/lib/python3.6/site-packages/textract/parsers/init.py in process(filename, encoding, extension, **kwargs)
75
76 parser = filetype_module.Parser()
---> 77 return parser.process(filename, encoding, **kwargs)
78
79

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs)
44 # output encoding
45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
47 unicode_string = self.decode(byte_string)
48 return self.encode(unicode_string, encoding)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs)
29
30 elif method == 'pdfminer':
---> 31 return self.extract_pdfminer(filename, **kwargs)
32 elif method == 'tesseract':
33 return self.extract_tesseract(filename, **kwargs)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs)
46 def extract_pdfminer(self, filename, **kwargs):
47 """Extract text from pdfs using pdfminer."""
---> 48 stdout, _ = self.run(['pdf2txt.py', filename])
49 return stdout
50

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args)
94 # pipe.wait() ends up hanging on large files. using
95 # pipe.communicate appears to avoid this issue
---> 96 stdout, stderr = pipe.communicate()
97
98 # if pipe is busted, raise an error (unlike Fabric)

UnboundLocalError: local variable 'pipe' referenced before assignment

Originally posted by @SatyaRamGV in https://github.com/deanmalmgren/textract/issue_comments#issuecomment-439043876

@olivx
Copy link

olivx commented Jan 15, 2019

I'm need extract many pdf and i have same problem ...
you did fix it ? what's solution you choice ?

@absingh2019
Copy link

I have the same problem .do you have a solution for it.

@karlrobertjanicki
Copy link

Have you tried to run as sudo?
Solved it for me

@jpweytjens
Copy link
Contributor

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

This was referenced Jul 25, 2019
@SatyaRamGV
Copy link
Author

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

This is error is with 1.6.1

I think it is sloved in 1.6.2, but v1.6.2 is not available as PyPI package...you should install from git repo

@jpweytjens
Copy link
Contributor

I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.

@ewerkema
Copy link

ewerkema commented Sep 21, 2019

Same error in textract 1.6.3 on Linux from a Docker container. This error doesn't occur locally (on Windows). Maybe related to this issue on Stackoverflow.

2019-09-21T12:50:41.552392889Z Traceback (most recent call last):
2019-09-21T12:50:41.552428789Z   File "/app/src/processors/document2text.py", line 32, in process
2019-09-21T12:50:41.552441289Z     text = textract.process(document.path)
2019-09-21T12:50:41.552451789Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/__init__.py", line 77, in process
2019-09-21T12:50:41.552462289Z     return parser.process(filename, encoding, **kwargs)
2019-09-21T12:50:41.552472189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 46, in process
2019-09-21T12:50:41.552482389Z     byte_string = self.extract(filename, **kwargs)
2019-09-21T12:50:41.552492189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 20, in extract
2019-09-21T12:50:41.552502389Z     return self.extract_pdftotext(filename, **kwargs)
2019-09-21T12:50:41.552512089Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 43, in extract_pdftotext
2019-09-21T12:50:41.552522089Z     stdout, _ = self.run(args)
2019-09-21T12:50:41.552531989Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 96, in run
2019-09-21T12:50:41.552549489Z     stdout, stderr = pipe.communicate()
2019-09-21T12:50:41.552950690Z UnboundLocalError: local variable 'pipe' referenced before assignment

@jpweytjens
Copy link
Contributor

@ewerkema Thanks for the Stackoverflow link. I have no experience with Docker, but I did find this issue which might be related. Can you comment if this is the same issue?

Textract relies on the external command line tool pdftotext. Is this available in your Docker container? If it isn't available, textract catches the error and falls back on the python module pdfminer to process the pdf file. I think Docker might be raising a different kind error that we don't check for.

@jpweytjens jpweytjens reopened this Sep 24, 2019
@ewerkema
Copy link

@jpweytjens It was actually a memory problem of the Docker container. Due to insufficient memory the operation of pdftotext failed, causing the UnboundLocalError. So by following the installation instructions for the system packages using the apt-get package manager and increasing the memory solved the issue for me.

@ghost
Copy link

ghost commented May 8, 2020

I think I know where this comes from: this bit of code in ShellParser:

        # run a subprocess and put the stdout and stderr on the pipe object
        try:
            pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
        except OSError as e:
            if e.errno == errno.ENOENT:
                # File not found.
                # This is equivalent to getting exitcode 127 from sh
                raise exceptions.ShellError(
                    ' '.join(args), 127, '', '',
                )

...coupled with forking issues on Unix: https://stackoverflow.com/questions/5306075/python-memory-allocation-error-using-subprocess-popen

Since the out-of-memory error is an OSError, it gets caught in the except block, but then eaten; the program tries to continue but since the assignment to pipe failed, it's not defined, hence the error message.

This could be alleviated by adding a bare raise after the errno check, at least to make it clearer what the actual error is. I could submit a PR if necessary?

@VenkateshDharavath
Copy link

@SatyaRamGV I tried with versions textract==1.6.1, textract==1.6.2, textract==1.6.3. All these versions throw this error. I'm on my windows 10. I have enough memory to perform this task, still, I get the same error.

Traceback (most recent call last):

File "", line 1, in
text = textract.process(r"C:..\docs\Mortgage Security Agreement\Closed End PA MTG 5000.39.pdf", method='pdfminer')

File "C:..\venv\lib\site-packages\textract\parsers_init_.py", line 77, in process
return parser.process(filename, encoding, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
return self.extract_pdfminer(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
stdout, _ = self.run(['pdf2txt.py', filename])

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 96, in run
stdout, stderr = pipe.communicate()

UnboundLocalError: local variable 'pipe' referenced before assignment

@nateGeorge
Copy link

I had this problem when trying to read .doc files because I didn't have antiword properly installed. If you are on windows 10 and are trying to read .doc files, you need antiword from here: https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

https://stackoverflow.com/a/51727238/4549682

@PPGHPP
Copy link

PPGHPP commented Aug 17, 2021

I have exactly the same error as
VenkateshDharavath reported on Nov 5, 2020 .
I'm on Windows 10 and have enough memory and latest installations.

@traverseda
Copy link
Collaborator

@PPGHPP I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master?

It should be a command like pip install git+https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows.

@PPGHPP
Copy link

PPGHPP commented Aug 17, 2021 via email

@PPGHPP
Copy link

PPGHPP commented Aug 17, 2021 via email

@traverseda
Copy link
Collaborator

I think that probably has something to do with chardet. The next release should help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants