updated topic creation dataflow diagram #715

rahulbot · 2020-06-10T18:34:42Z

As part of the ongoing documentation / support process, I created an updated data flow diagram to chart how data flows when multi-platform topics are created. I think will be helpful as another resource to provide to our researchers, and when we roll this out more broadly.

Can you take a look at the attached and let me know if you see any errors or major omissions?

MC Topic Creation Dataflow.pdf

hroberts · 2020-06-10T20:14:18Z

This is a great start, Rahul. At a glance, it looked great, but I think it is actually missing a lot of what the topic system does. Apologies if I'm being overly critical.

Comments:

We are pulling from google web search, not google news.
We should include the csv import as well as a source.
The spidering process as described is missing the relevancy pattern matching, so it looks like we are just importing all urls into the topic (as issue crawler does).
The spidering process is not distinct from the html extraction. We actually match for relevancy against the raw html (and throw away anything that doesn't match), then extract content from the html and create an actual story in our database, then check for relevancy again and only add to the topic if the extracted content matches. We do this for performance optimization (it's much cheaper to do an html regex match than to do the html content extraction). The most important thing to make clear to the reader is that we are ultimately doing relevancy matching against the extracted content before adding a story to the topic.
We don't ever deduplicate based on content. We only dedup (or match, depending on how you see it) based on normalized urls and normalized titles.
This doesn't capture: a) the fact that we are storing hyperlinks and processing them into graph data and metrics, b) network map generation, c) subtopic generation, d) timespan based analysis, e) frozen snapshot generation, f) generation url sharing subtopics, g) generation of url sharing metrics within larger topic, h) date guessing.

rahulbot · 2020-06-10T23:15:05Z

Yeah, I know. I was trying to not get too much detail, but also include a bunch. So I think I've ended up with a rather arbitrary list of the things included vs. excluded. For instance, I intentionally didn't include any of the snapshotting process (subtopics, timespans) because I thought that wouldn't be that useful to know from a metadata/data gathering perspective. Lemme take another pass at including some of those steps and corrections that are pre-snapshotting. I'm still not sure how much detail is useful to a researcher trying to understand how the data is gathered and filtered (vs. to a developer). There's so much going on in the system that it'll take a few rounds to catch it all one a single diagram I think 👍🏽

rahulbot · 2020-06-12T19:44:43Z

I took another pass at adding in more of the features of the full topic mapper engine. Give this one a look over for errors/omissions.
MC Topic Creation Dataflow-2.pdf

hroberts · 2020-06-17T18:15:43Z

This looks great. Very impressive work, rahul!

rahulbot · 2020-06-17T18:30:10Z

Thx 👍🏽 I'm gonna share it with the rest of the Civic MC team to get feedback. After another rev or two it'll be ready to add into the repo somewhere.

rahulbot added the question label Jun 10, 2020

rahulbot assigned pypt and hroberts Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated topic creation dataflow diagram #715

updated topic creation dataflow diagram #715

rahulbot commented Jun 10, 2020

hroberts commented Jun 10, 2020

rahulbot commented Jun 10, 2020

rahulbot commented Jun 12, 2020

hroberts commented Jun 17, 2020

rahulbot commented Jun 17, 2020