Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing structured formatted data #101

Open
slifty opened this issue Mar 9, 2021 · 4 comments
Open

Passing structured formatted data #101

slifty opened this issue Mar 9, 2021 · 4 comments

Comments

@slifty
Copy link
Contributor

slifty commented Mar 9, 2021

The Use Case

We rely on CSVs for passing data into torque. This makes plenty of sense, but it also limits the amount of (basic) formatting that can be assigned to that data. For instance, if a diligence report involves bulleted lists, or bolded / emphasized text within a given sentence.

There is an additional issue that some of the diligence / followup documents have lots of different sections, and asking folks to paste text into csvs feels a bit clunky (since spreadsheets aren't really intended for long form multi-paragraph text blocks).

The non-engineered solution to this involved taking PDF / Word documents, using ghostscript and pandoc to extract text, and manually inserting the data into the system. This was a fraught process (but resulted in a deeper understanding of the data which is always nice).

An Engineered Solution

What I'm working on now is just a first draft at an engineered solution. We'll iterate over time I'm sure.

  1. I've created very simple word templates which leverage the header word formatting types to demark various sections and subsections.

  2. PDF and other long form data is inserted into these word "templates" manually and provided to OTS.

  3. As part of ETL, I use pandoc to convert them to plain text (markdown or wikimedia).

  4. Remove any random HTML that got inserted (sometimes indentation causes trouble)

  5. Do some basic string replacement to convert the document to archieML format. Thus each section becomes semantically accessible.

  6. Convert that structured object into the CSV torque expects.

At that point, it's all just Torque data like anything else.

@slifty
Copy link
Contributor Author

slifty commented Mar 9, 2021

Here's a sample (fabricated) output of the docx => txt template conversion:

= Fashion Review =
== Overview ==

Overall the sweat pants and t shirt while working from home look is a tried and true outfit and we have no concerns.

== Shirt ==
===Description===
The shirt is definitely in need of ironing.

===Color===
White

===Size===
L

===Type===
It's a t shirt!

== Pants ==

The outfit involved blue sweat pants, which offered comfort and a basic amount of warmth.

@frankduncan
Copy link
Collaborator

frankduncan commented Mar 9, 2021

While I totally oppose this on the grounds that it moves us further and further from the true origins of the project, this is probably quite correct. We should most likely do this once we have the postgres version of torque in production, and we should probably make the api less "UPLOAD ALL OF THE THINGS" and make it more piecemeal, for instance, in expanding the -p option with an option that says "don't upload TOCs, but rather just proposal data" for faster turnaround time on new data to only a handful of proposals.

As an aside, the current csv upload has a "json" type for some columns, which has been used to upload tabular data (see the financial data adder), so we're basically already there.

@slifty
Copy link
Contributor Author

slifty commented Mar 9, 2021

One maybe vital note: I'm treating ALL of this as one big series of preprocessing steps -- CSV is the ultimate output that gets piped to torque so hopefully that's a good enough homage 😂

@slifty
Copy link
Contributor Author

slifty commented Mar 9, 2021

I'm thinking ArchieML is probably too much -- really I just want to go through line by line and whenever there's a new header level (= => ==) the parser should spin up a new key and assign it to some object.

Gonna whip up a quick script for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants