How to make wagtail_textract work with Media Storage Backend #15

danieltomasku · 2018-05-09T17:25:13Z

Hi there,

I am trying to use wagtail_textract for my project. I tried previously using just textract but am interested in some of the helper utilities of wagtail_textract. I am wondering how wagtail_textract will work in production with Docker and a Media Storage backend, such as Azure.

The line here is referencing a file path:
text = textract.process(document.file.path).strip()

but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.

The text was updated successfully, but these errors were encountered:

allcaps · 2018-05-16T15:56:49Z

Hi @danieltomasku ,

Sorry, there is nothing yet for other storage backends. Since we only use the default backend for now.
What do you think is needed to make wagtail_textract play well with other storage backends?

PR's are always welcome ;).

danieltomasku · 2018-05-16T16:56:59Z

Hi @allcaps ,

Thanks for your response. I ended up using tempfile to make document indexing work with my Azure Media Storage backend. The solution I made accounts for both local development and production environments. It uses the normal file.path when in your local dev environment, but then in production uses tempfile to read the file and then close and delete. Below is the snippet I used:

import tempfile

def index_documents(self):
        """Loops through all the documents in a ResourcePage and returns a string with all the extracted text."""
        alltext = ''
        for block in self.body:
            if block.block_type == 'resource_document':
                if block.value['document'].file_extension.lower() in ['pdf', 'pptx', 'html', 'htm', 'xls', 'xlsx', 'doc', 'docx', 'rtf', '.txt']:
                    try:
                        path = block.value['document'].file.path
                        text = self.extract_text(path)
                    except NotImplementedError:
                        logger.info('Downloading for search index %s' % block.value['document'].file.url)
                        remote_file_url = block.value['document'].file.url
                        f = tempfile.NamedTemporaryFile('w+b', suffix='.%s' % block.value['document'].file_extension)
                        urlretrieve(remote_file_url, filename=f.name)
                        path = f.name
                        text = self.extract_text(path)
                        f.close()
                    alltext += text.decode("utf-8")
        return alltext

The relevant section is in the except statement. In my case, I am looping through a StreamField where users add documents to the page and indexing the individual documents. This could be extended to the way that wagtail_textract indexes documents through the transcribe_document handler.

Thoughts on the approach?

khink · 2018-05-24T09:37:15Z

Hi @danieltomasku ,

Thanks for bringing this use case to attention.

That approach looks pretty good to me.

Is there some way we can make this easier? Maybe the transcribe_document() method could have some pluggable helper method?

allcaps · 2018-05-28T07:12:57Z

Seems like Wagtail does the 'is-this-a-local-file' check in the same way: https://github.com/wagtail/wagtail/blob/7034cd131774b8971ff3c7424999a28164480f29/wagtail/documents/views/serve.py#L35

khink · 2018-06-05T08:00:34Z

Hi @danieltomasku, I'd be happy to accept a PR if that helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make wagtail_textract work with Media Storage Backend #15

How to make wagtail_textract work with Media Storage Backend #15

danieltomasku commented May 9, 2018

allcaps commented May 16, 2018

danieltomasku commented May 16, 2018

khink commented May 24, 2018

allcaps commented May 28, 2018

khink commented Jun 5, 2018

How to make wagtail_textract work with Media Storage Backend #15

How to make wagtail_textract work with Media Storage Backend #15

Comments

danieltomasku commented May 9, 2018

allcaps commented May 16, 2018

danieltomasku commented May 16, 2018

khink commented May 24, 2018

allcaps commented May 28, 2018

khink commented Jun 5, 2018