Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make wagtail_textract work with Media Storage Backend #15

Open
danieltomasku opened this issue May 9, 2018 · 5 comments
Open

Comments

@danieltomasku
Copy link

Hi there,

I am trying to use wagtail_textract for my project. I tried previously using just textract but am interested in some of the helper utilities of wagtail_textract. I am wondering how wagtail_textract will work in production with Docker and a Media Storage backend, such as Azure.

The line here is referencing a file path:
text = textract.process(document.file.path).strip()

but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.

@allcaps
Copy link
Member

allcaps commented May 16, 2018

Hi @danieltomasku ,

Sorry, there is nothing yet for other storage backends. Since we only use the default backend for now.
What do you think is needed to make wagtail_textract play well with other storage backends?

PR's are always welcome ;).

@danieltomasku
Copy link
Author

Hi @allcaps ,

Thanks for your response. I ended up using tempfile to make document indexing work with my Azure Media Storage backend. The solution I made accounts for both local development and production environments. It uses the normal file.path when in your local dev environment, but then in production uses tempfile to read the file and then close and delete. Below is the snippet I used:

import tempfile

def index_documents(self):
        """Loops through all the documents in a ResourcePage and returns a string with all the extracted text."""
        alltext = ''
        for block in self.body:
            if block.block_type == 'resource_document':
                if block.value['document'].file_extension.lower() in ['pdf', 'pptx', 'html', 'htm', 'xls', 'xlsx', 'doc', 'docx', 'rtf', '.txt']:
                    try:
                        path = block.value['document'].file.path
                        text = self.extract_text(path)
                    except NotImplementedError:
                        logger.info('Downloading for search index %s' % block.value['document'].file.url)
                        remote_file_url = block.value['document'].file.url
                        f = tempfile.NamedTemporaryFile('w+b', suffix='.%s' % block.value['document'].file_extension)
                        urlretrieve(remote_file_url, filename=f.name)
                        path = f.name
                        text = self.extract_text(path)
                        f.close()
                    alltext += text.decode("utf-8")
        return alltext

The relevant section is in the except statement. In my case, I am looping through a StreamField where users add documents to the page and indexing the individual documents. This could be extended to the way that wagtail_textract indexes documents through the transcribe_document handler.

Thoughts on the approach?

@khink
Copy link
Contributor

khink commented May 24, 2018

Hi @danieltomasku ,

Thanks for bringing this use case to attention.

That approach looks pretty good to me.

Is there some way we can make this easier? Maybe the transcribe_document() method could have some pluggable helper method?

@allcaps
Copy link
Member

allcaps commented May 28, 2018

Seems like Wagtail does the 'is-this-a-local-file' check in the same way: https://github.com/wagtail/wagtail/blob/7034cd131774b8971ff3c7424999a28164480f29/wagtail/documents/views/serve.py#L35

@khink
Copy link
Contributor

khink commented Jun 5, 2018

Hi @danieltomasku, I'd be happy to accept a PR if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants