`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

SatyaRamGV · 2018-11-15T13:43:26Z

text = textract.process(file, method='pdfminer')

Error:
UnboundLocalError Traceback (most recent call last)
in ()
----> 1 text = textract.process(file, method='pdfminer')

~/.local/lib/python3.6/site-packages/textract/parsers/init.py in process(filename, encoding, extension, **kwargs)
75
76 parser = filetype_module.Parser()
---> 77 return parser.process(filename, encoding, **kwargs)
78
79

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs)
44 # output encoding
45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
47 unicode_string = self.decode(byte_string)
48 return self.encode(unicode_string, encoding)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs)
29
30 elif method == 'pdfminer':
---> 31 return self.extract_pdfminer(filename, **kwargs)
32 elif method == 'tesseract':
33 return self.extract_tesseract(filename, **kwargs)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs)
46 def extract_pdfminer(self, filename, **kwargs):
47 """Extract text from pdfs using pdfminer."""
---> 48 stdout, _ = self.run(['pdf2txt.py', filename])
49 return stdout
50

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args)
94 # pipe.wait() ends up hanging on large files. using
95 # pipe.communicate appears to avoid this issue
---> 96 stdout, stderr = pipe.communicate()
97
98 # if pipe is busted, raise an error (unlike Fabric)

UnboundLocalError: local variable 'pipe' referenced before assignment

Originally posted by @SatyaRamGV in https://github.com/deanmalmgren/textract/issue_comments#issuecomment-439043876

The text was updated successfully, but these errors were encountered:

olivx · 2019-01-15T17:40:23Z

I'm need extract many pdf and i have same problem ...
you did fix it ? what's solution you choice ?

absingh2019 · 2019-01-16T07:50:11Z

I have the same problem .do you have a solution for it.

karlrobertjanicki · 2019-06-26T10:47:11Z

Have you tried to run as sudo?
Solved it for me

jpweytjens · 2019-07-25T12:53:29Z

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

SatyaRamGV · 2019-07-25T13:23:54Z

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

This is error is with 1.6.1

I think it is sloved in 1.6.2, but v1.6.2 is not available as PyPI package...you should install from git repo

jpweytjens · 2019-08-27T09:24:47Z

I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.

ewerkema · 2019-09-21T13:10:39Z

Same error in textract 1.6.3 on Linux from a Docker container. This error doesn't occur locally (on Windows). Maybe related to this issue on Stackoverflow.

2019-09-21T12:50:41.552392889Z Traceback (most recent call last):
2019-09-21T12:50:41.552428789Z   File "/app/src/processors/document2text.py", line 32, in process
2019-09-21T12:50:41.552441289Z     text = textract.process(document.path)
2019-09-21T12:50:41.552451789Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/__init__.py", line 77, in process
2019-09-21T12:50:41.552462289Z     return parser.process(filename, encoding, **kwargs)
2019-09-21T12:50:41.552472189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 46, in process
2019-09-21T12:50:41.552482389Z     byte_string = self.extract(filename, **kwargs)
2019-09-21T12:50:41.552492189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 20, in extract
2019-09-21T12:50:41.552502389Z     return self.extract_pdftotext(filename, **kwargs)
2019-09-21T12:50:41.552512089Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 43, in extract_pdftotext
2019-09-21T12:50:41.552522089Z     stdout, _ = self.run(args)
2019-09-21T12:50:41.552531989Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 96, in run
2019-09-21T12:50:41.552549489Z     stdout, stderr = pipe.communicate()
2019-09-21T12:50:41.552950690Z UnboundLocalError: local variable 'pipe' referenced before assignment

jpweytjens · 2019-09-24T20:38:14Z

@ewerkema Thanks for the Stackoverflow link. I have no experience with Docker, but I did find this issue which might be related. Can you comment if this is the same issue?

Textract relies on the external command line tool pdftotext. Is this available in your Docker container? If it isn't available, textract catches the error and falls back on the python module pdfminer to process the pdf file. I think Docker might be raising a different kind error that we don't check for.

ewerkema · 2019-09-25T07:15:45Z

@jpweytjens It was actually a memory problem of the Docker container. Due to insufficient memory the operation of pdftotext failed, causing the UnboundLocalError. So by following the installation instructions for the system packages using the apt-get package manager and increasing the memory solved the issue for me.

ghost · 2020-05-08T16:05:09Z

I think I know where this comes from: this bit of code in ShellParser:

        # run a subprocess and put the stdout and stderr on the pipe object
        try:
            pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
        except OSError as e:
            if e.errno == errno.ENOENT:
                # File not found.
                # This is equivalent to getting exitcode 127 from sh
                raise exceptions.ShellError(
                    ' '.join(args), 127, '', '',
                )

...coupled with forking issues on Unix: https://stackoverflow.com/questions/5306075/python-memory-allocation-error-using-subprocess-popen

Since the out-of-memory error is an OSError, it gets caught in the except block, but then eaten; the program tries to continue but since the assignment to pipe failed, it's not defined, hence the error message.

This could be alleviated by adding a bare raise after the errno check, at least to make it clearer what the actual error is. I could submit a PR if necessary?

VenkateshDharavath · 2020-11-05T06:18:38Z

@SatyaRamGV I tried with versions textract==1.6.1, textract==1.6.2, textract==1.6.3. All these versions throw this error. I'm on my windows 10. I have enough memory to perform this task, still, I get the same error.

Traceback (most recent call last):

File "", line 1, in
text = textract.process(r"C:..\docs\Mortgage Security Agreement\Closed End PA MTG 5000.39.pdf", method='pdfminer')

File "C:..\venv\lib\site-packages\textract\parsers_init_.py", line 77, in process
return parser.process(filename, encoding, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
return self.extract_pdfminer(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
stdout, _ = self.run(['pdf2txt.py', filename])

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 96, in run
stdout, stderr = pipe.communicate()

UnboundLocalError: local variable 'pipe' referenced before assignment

nateGeorge · 2021-01-31T03:38:03Z

I had this problem when trying to read .doc files because I didn't have antiword properly installed. If you are on windows 10 and are trying to read .doc files, you need antiword from here: https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

https://stackoverflow.com/a/51727238/4549682

PPGHPP · 2021-08-17T12:22:45Z

I have exactly the same error as
VenkateshDharavath reported on Nov 5, 2020 .
I'm on Windows 10 and have enough memory and latest installations.

traverseda · 2021-08-17T12:29:00Z

@PPGHPP I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master?

It should be a command like pip install git+https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows.

PPGHPP · 2021-08-17T18:14:46Z

Hi, Thank you for your information. I did pip install as you asked. Only ERROR was: " ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts." It did also this: "Successfully installed pdfminer.six-20191110" Now I'm able to use it like text=textract.process("tacl_a_00344.pdf"), and the result looks OK. Thanks again! BR PirkkoP ti 17. elok. 2021 klo 15.29 traverseda ***@***.***) kirjoitti:

…

@PPGHPP <https://github.com/PPGHPP> I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master? It should be a command like pip install git+ https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#256 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGC2VVZFJ2BVSSMPZYQD2YLT5JIZRANCNFSM4GD7JIWA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

PPGHPP · 2021-08-17T19:01:00Z

Hi again, One thing I noticed. Sentence comes with textract like this: "We define improvement as the quantity\r\nmax{0, fa \xe2\x88\x92 fb }, where b is our current..." BUT OCR-based pytesseract makes it "We define improvement as the quantity\r\nmax{0, fa — fy}. where b is our current ..." From p.764 of the attachment. BR PirkkoP ti 17. elok. 2021 klo 21.14 Pirkko Pietiläinen ***@***.***) kirjoitti:

…

Hi, Thank you for your information. I did pip install as you asked. Only ERROR was: " ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts." It did also this: "Successfully installed pdfminer.six-20191110" Now I'm able to use it like text=textract.process("tacl_a_00344.pdf"), and the result looks OK. Thanks again! BR PirkkoP ti 17. elok. 2021 klo 15.29 traverseda ***@***.***) kirjoitti: > @PPGHPP <https://github.com/PPGHPP> I've made some changes that should > make the actual error clearer, they're not deployed yet though. Can you try > installing from master? > > It should be a command like pip install git+ > https://github.com/deanmalmgren/textract.git, although I'm not sure how > you installed it on windows. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#256 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGC2VVZFJ2BVSSMPZYQD2YLT5JIZRANCNFSM4GD7JIWA> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> > . >

traverseda · 2021-08-18T13:08:17Z

I think that probably has something to do with chardet. The next release should help.

This was referenced Jul 25, 2019

PDF extract failed! #248

Open

textract doesn´t work #241

Closed

jpweytjens closed this as completed Aug 27, 2019

jpweytjens reopened this Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

SatyaRamGV commented Nov 15, 2018

olivx commented Jan 15, 2019

absingh2019 commented Jan 16, 2019

karlrobertjanicki commented Jun 26, 2019

jpweytjens commented Jul 25, 2019

SatyaRamGV commented Jul 25, 2019

jpweytjens commented Aug 27, 2019

ewerkema commented Sep 21, 2019 •

edited

Loading

jpweytjens commented Sep 24, 2019

ewerkema commented Sep 25, 2019

ghost commented May 8, 2020

VenkateshDharavath commented Nov 5, 2020

nateGeorge commented Jan 31, 2021

PPGHPP commented Aug 17, 2021

traverseda commented Aug 17, 2021

PPGHPP commented Aug 17, 2021 via email

PPGHPP commented Aug 17, 2021 via email

traverseda commented Aug 18, 2021

UnboundLocalError: local variable 'pipe' referenced before assignment #256

UnboundLocalError: local variable 'pipe' referenced before assignment #256

Comments

SatyaRamGV commented Nov 15, 2018

olivx commented Jan 15, 2019

absingh2019 commented Jan 16, 2019

karlrobertjanicki commented Jun 26, 2019

jpweytjens commented Jul 25, 2019

SatyaRamGV commented Jul 25, 2019

jpweytjens commented Aug 27, 2019

ewerkema commented Sep 21, 2019 • edited Loading

jpweytjens commented Sep 24, 2019

ewerkema commented Sep 25, 2019

ghost commented May 8, 2020

VenkateshDharavath commented Nov 5, 2020

nateGeorge commented Jan 31, 2021

PPGHPP commented Aug 17, 2021

traverseda commented Aug 17, 2021

PPGHPP commented Aug 17, 2021 via email

PPGHPP commented Aug 17, 2021 via email

traverseda commented Aug 18, 2021

`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

ewerkema commented Sep 21, 2019 •

edited

Loading