Smarter Data Collection for Microsoft Civic Graph

In an effort to make Civic Graph a little bit smarter, I developed four building blocks to help improve the quality (i.e. accuracy, completeness) of the data stored as well as automate aspects of the data collection process. They are:

Web Scrapers
Analytics Toolbox
Classifier
[Pipeline](/References/Pipeline Diagram.pdf) to integrate scrapers, classifier, and analysis with existing Civic Graph

Useful Resources within this Repository:

I created a Process Map to explain how everything that I built fits together. View it [here](/References/Mad Libs Visual .pdf)
I also created a Handoff Document for a Future Fellow outlining how each script works, external libraries used, and how they can fully integrate my work with the existing Civic Graph in the future. View the document [here](/References/Handoff for Future Fellow.pdf).

External Resources:

I've compiled a list of tools and resources that I used throughout the project. They cover a range of topics including:

Web Scraping
Data Analysis with Python
Text mining
Natural Language Processing
Machine Learning

Libraries, tools, APIs:

BeautifulSoup: Python library for parsing XML and HTML.
spaCy: Free, open-source Python library for fast and accurate Natural Language Processing analysis.
textacy: Python library built on top of spaCy for higher level Natural Language Processing (NLP).
nltk: Platform for writing python programs to work with human language data. Provides over 50 corpora and lexical resources. Includes text processing libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning, wrappers for industrial-strength NLP libraries. Tools are easy to use and accurate but very slow on large datasets.
scikit-learn: Machine learning library in Python built on NumPy, SciPy, and matplotlib.

Concepts:

Books:

Created by Hannah Cutler during my fellowship at

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Analysis		Analysis
Classifier		Classifier
Integration		Integration
References		References
Scrapers		Scrapers
README.md		README.md
blocks.png		blocks.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smarter Data Collection for Microsoft Civic Graph

Useful Resources within this Repository:

External Resources:

Libraries, tools, APIs:

Concepts:

Books:

About

Releases

Packages

Languages

hcutler/data-collection-civic-graph

Folders and files

Latest commit

History

Repository files navigation

Smarter Data Collection for Microsoft Civic Graph

Useful Resources within this Repository:

External Resources:

Libraries, tools, APIs:

Concepts:

Books:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages