-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing structured formatted data #101
Comments
Here's a sample (fabricated) output of the docx => txt template conversion:
|
While I totally oppose this on the grounds that it moves us further and further from the true origins of the project, this is probably quite correct. We should most likely do this once we have the postgres version of torque in production, and we should probably make the api less "UPLOAD ALL OF THE THINGS" and make it more piecemeal, for instance, in expanding the As an aside, the current csv upload has a "json" type for some columns, which has been used to upload tabular data (see the financial data adder), so we're basically already there. |
One maybe vital note: I'm treating ALL of this as one big series of preprocessing steps -- CSV is the ultimate output that gets piped to torque so hopefully that's a good enough homage 😂 |
I'm thinking ArchieML is probably too much -- really I just want to go through line by line and whenever there's a new header level ( Gonna whip up a quick script for this. |
The Use Case
We rely on CSVs for passing data into torque. This makes plenty of sense, but it also limits the amount of (basic) formatting that can be assigned to that data. For instance, if a diligence report involves bulleted lists, or bolded / emphasized text within a given sentence.
There is an additional issue that some of the diligence / followup documents have lots of different sections, and asking folks to paste text into csvs feels a bit clunky (since spreadsheets aren't really intended for long form multi-paragraph text blocks).
The non-engineered solution to this involved taking PDF / Word documents, using ghostscript and pandoc to extract text, and manually inserting the data into the system. This was a fraught process (but resulted in a deeper understanding of the data which is always nice).
An Engineered Solution
What I'm working on now is just a first draft at an engineered solution. We'll iterate over time I'm sure.
I've created very simple word templates which leverage the
header
word formatting types to demark various sections and subsections.PDF and other long form data is inserted into these word "templates" manually and provided to OTS.
As part of ETL, I use pandoc to convert them to plain text (markdown or wikimedia).
Remove any random HTML that got inserted (sometimes indentation causes trouble)
Do some basic string replacement to convert the document to archieML format. Thus each section becomes semantically accessible.
Convert that structured object into the CSV torque expects.
At that point, it's all just Torque data like anything else.
The text was updated successfully, but these errors were encountered: