Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVCS-488] Improve docx rendering - MFR Part #304

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

cslzchen
Copy link
Contributor

@cslzchen cslzchen commented Dec 6, 2017

Ticket

https://openscience.atlassian.net/browse/SVCS-488
OSF side PR: CenterForOpenScience/osf.io#8002

Purpose

This ticket replaces #282. Credit goes to @AddisonSchiller 🎆🎆.

.docx rendering is very intensive on the OSF. By using Microsoft's online rendering service to render publicly available .docx files, we can remove a lot of pressure from the unoconv container.

Changes

  • Added support for public renderers.
  • Added the office365 renderer under the public renderers.
  • Added support to parse for a public_file query param. This query param is optional.public_file=True denotes that the file is public (the project it belongs to is public) , while public_file=False denotes that it is private. All other values for public_file cause errors to be raised.
  • ProviderMetadata now has an is_public flag, with default value set to False.

Side Effects

  • Does Microsoft enforces a rate limit?
  • How can we handle the "a few hour" cache problem?
  • What if this external service is down? Should we detect in advance?
  • iframe sandboxing may cause issue, need to verify on staging.

QA Notes

The Office365 renderer does not use the.pdf renderer like unoconv used to, so the pdfs that get made by this renderer may not display exactly the same. More QA notes to come. There is also a README.md in the renderer with more information about testing.

Deployment Notes

  • OSF side must be deployed first (no side effects).

@coveralls
Copy link

coveralls commented Dec 6, 2017

Coverage Status

Coverage decreased (-0.1%) to 67.896% when pulling 02468e1 on cslzchen:feature/svcs-488-improve-docx into 8bb2dd4 on CenterForOpenScience:develop.

Adding support for a `public_file` query param so the OSF can request a
public renderer. Added office365 which is a public renderer. This uses office
online to do .docx file conversions.
@cslzchen cslzchen force-pushed the feature/svcs-488-improve-docx branch 2 times, most recently from e2d6404 to 16c92e3 Compare December 7, 2017 14:43
@cslzchen cslzchen changed the title [SVCS-488] [WIP] Improve docx rendering [SVCS-488] Improve docx rendering Dec 7, 2017
@cslzchen cslzchen force-pushed the feature/svcs-488-improve-docx branch 2 times, most recently from 84b7b7d to d6cb8f8 Compare December 7, 2017 15:28
@cslzchen
Copy link
Contributor Author

cslzchen commented Dec 7, 2017

Note: this PR cannot be tested locally since Office rendering service requires a public available file. Here is what shows up locally, and sandboxing does not cause any issue. src is https://view.officeapps.live.com/op/embed.aspx?src=http%3A//192.168.168.167%3A7777/v1/resources/2n3gt/providers/osfstorage/5a283b61a4a094002587bea3 as expected. The error is due to the service having no access to the local file.

screen shot 2017-12-07 at 11 22 21 am

Here is what shows up locally when hard-coding a staging download url with with iframe sandboxing turned on. src is empty for iframe due to sandboxing.

screen shot 2017-12-07 at 11 18 36 am

Here is what shows up locally when hard-coding a staging download url with iframe sandboxing turned off. src is "https://view.officeapps.live.com/op/embed.aspx?src=https%3A//staging-files.osf.io/v1/resources/swy3c/providers/osfstorage/5a29638421b9f2000d8f0a16".

screen shot 2017-12-07 at 11 17 48 am

@cslzchen cslzchen force-pushed the feature/svcs-488-improve-docx branch from d6cb8f8 to 7b3a984 Compare December 7, 2017 16:27
@cslzchen cslzchen force-pushed the feature/svcs-488-improve-docx branch 2 times, most recently from 95c2eec to 2b04f43 Compare December 7, 2017 18:24
@cslzchen cslzchen force-pushed the feature/svcs-488-improve-docx branch from 2b04f43 to 2eec820 Compare December 7, 2017 18:48
@cslzchen cslzchen changed the title [SVCS-488] Improve docx rendering [SVCS-488] Improve docx rendering - MFR Part Dec 7, 2017
@coveralls
Copy link

coveralls commented Dec 7, 2017

Coverage Status

Coverage decreased (-0.1%) to 67.894% when pulling 2eec820 on cslzchen:feature/svcs-488-improve-docx into 8bb2dd4 on CenterForOpenScience:develop.

@cslzchen cslzchen changed the title [SVCS-488] Improve docx rendering - MFR Part [SVCS-488] [Do Not Merge] Improve docx rendering - MFR Part Dec 12, 2017
@cslzchen cslzchen changed the title [SVCS-488] [Do Not Merge] Improve docx rendering - MFR Part [SVCS-488] Improve docx rendering - MFR Part Jan 10, 2018
from mfr.core.metrics import MetricsRecord
from mfr.server import settings as server_settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is settings shadowed somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is 6-month old code. Needs update. It no longer make sense to me.

from mako.lookup import TemplateLookup

from mfr.core import extension
from mfr.extensions.office365 import settings as office365_settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be qualified?

Copy link
Contributor Author

@cslzchen cslzchen May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment. We no longer do alias if there is no shadowing issues. Needs update.

}
</style>

<iframe src=${url} frameborder='0'></iframe>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing it looks like is in here is another iframe - is it possible to return a 302 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point but not sure how 302 works. OSF creates the parent iframe and we shouldn't modify it from MFR. Is embedded iframe a bad practice? For MFR itself, it should use iframe here since it loads content from an untrusted external web service.

response = await self._make_request('GET', download_url, allow_redirects=False, headers=headers)

if response.status >= 400:
if response.status >= HTTPStatus.BAD_REQUEST:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dig it.


async def download(self):
"""Download file from WaterButler, returning stream."""
download_url = await self._fetch_download_url()
headers = {settings.MFR_IDENTIFYING_HEADER: '1'}
headers = {provider_settings.MFR_IDENTIFYING_HEADER: '1'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, wondering why this needs to be qualified

Copy link
Contributor Author

@cslzchen cslzchen May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this should be pd_settings as we have been using it for both WB and MFR.

'Unable to download the requested file, please try again later.',
download_url=download_url,
response=resp_text,
provider=self.NAME,
)

self.metrics.add('download.saw_redirect', False)
if response.status in (302, 301):
if response.status in (HTTPStatus.MOVED_PERMANENTLY, HTTPStatus.FOUND):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a list rather than a tuple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure. I had the same question. However, it was a tuple and the code base uses tuple.

Copy link
Contributor Author

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@birdbrained Thanks for the review. This PR has been very old and we haven't decided whether this is the approach yet. I will rebase with the latest develop and fix the obsoleted import style.

More If You Are Interested

The blocking issues for using Microsofts rendering service are:

  • This service is not a reliable service.
  • This service cached their result for a few hours.
  • This service may have rate limiting.
    More discussion and investigation is needed before continue developing.
  • We can only render public files through this service.

from mfr.core.metrics import MetricsRecord
from mfr.server import settings as server_settings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is 6-month old code. Needs update. It no longer make sense to me.

from mako.lookup import TemplateLookup

from mfr.core import extension
from mfr.extensions.office365 import settings as office365_settings
Copy link
Contributor Author

@cslzchen cslzchen May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment. We no longer do alias if there is no shadowing issues. Needs update.


async def download(self):
"""Download file from WaterButler, returning stream."""
download_url = await self._fetch_download_url()
headers = {settings.MFR_IDENTIFYING_HEADER: '1'}
headers = {provider_settings.MFR_IDENTIFYING_HEADER: '1'}
Copy link
Contributor Author

@cslzchen cslzchen May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this should be pd_settings as we have been using it for both WB and MFR.

'Unable to download the requested file, please try again later.',
download_url=download_url,
response=resp_text,
provider=self.NAME,
)

self.metrics.add('download.saw_redirect', False)
if response.status in (302, 301):
if response.status in (HTTPStatus.MOVED_PERMANENTLY, HTTPStatus.FOUND):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure. I had the same question. However, it was a tuple and the code base uses tuple.

}
</style>

<iframe src=${url} frameborder='0'></iframe>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point but not sure how 302 works. OSF creates the parent iframe and we shouldn't modify it from MFR. Is embedded iframe a bad practice? For MFR itself, it should use iframe here since it loads content from an untrusted external web service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants