Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

updated topic creation dataflow diagram #715

Open
rahulbot opened this issue Jun 10, 2020 · 5 comments
Open

updated topic creation dataflow diagram #715

rahulbot opened this issue Jun 10, 2020 · 5 comments
Assignees
Labels

Comments

@rahulbot
Copy link
Contributor

As part of the ongoing documentation / support process, I created an updated data flow diagram to chart how data flows when multi-platform topics are created. I think will be helpful as another resource to provide to our researchers, and when we roll this out more broadly.

Can you take a look at the attached and let me know if you see any errors or major omissions?

MC Topic Creation Dataflow.pdf

@hroberts
Copy link
Contributor

This is a great start, Rahul. At a glance, it looked great, but I think it is actually missing a lot of what the topic system does. Apologies if I'm being overly critical.

Comments:

  • We are pulling from google web search, not google news.
  • We should include the csv import as well as a source.
  • The spidering process as described is missing the relevancy pattern matching, so it looks like we are just importing all urls into the topic (as issue crawler does).
  • The spidering process is not distinct from the html extraction. We actually match for relevancy against the raw html (and throw away anything that doesn't match), then extract content from the html and create an actual story in our database, then check for relevancy again and only add to the topic if the extracted content matches. We do this for performance optimization (it's much cheaper to do an html regex match than to do the html content extraction). The most important thing to make clear to the reader is that we are ultimately doing relevancy matching against the extracted content before adding a story to the topic.
  • We don't ever deduplicate based on content. We only dedup (or match, depending on how you see it) based on normalized urls and normalized titles.
  • This doesn't capture: a) the fact that we are storing hyperlinks and processing them into graph data and metrics, b) network map generation, c) subtopic generation, d) timespan based analysis, e) frozen snapshot generation, f) generation url sharing subtopics, g) generation of url sharing metrics within larger topic, h) date guessing.

@rahulbot
Copy link
Contributor Author

Yeah, I know. I was trying to not get too much detail, but also include a bunch. So I think I've ended up with a rather arbitrary list of the things included vs. excluded. For instance, I intentionally didn't include any of the snapshotting process (subtopics, timespans) because I thought that wouldn't be that useful to know from a metadata/data gathering perspective. Lemme take another pass at including some of those steps and corrections that are pre-snapshotting. I'm still not sure how much detail is useful to a researcher trying to understand how the data is gathered and filtered (vs. to a developer). There's so much going on in the system that it'll take a few rounds to catch it all one a single diagram I think 👍🏽

@rahulbot
Copy link
Contributor Author

I took another pass at adding in more of the features of the full topic mapper engine. Give this one a look over for errors/omissions.
MC Topic Creation Dataflow-2.pdf

@hroberts
Copy link
Contributor

This looks great. Very impressive work, rahul!

@rahulbot
Copy link
Contributor Author

Thx 👍🏽 I'm gonna share it with the rest of the Civic MC team to get feedback. After another rev or two it'll be ready to add into the repo somewhere.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants