RFC: More data #10

wchristian · 2013-05-30T19:01:44Z

I've been working in another direction on the same issue off and on, and here's a spreadsheet with a bunch of companies, as well as links to websites that list more companies:

https://docs.google.com/spreadsheet/ccc?key=0AoJOD6qxPwy6dHMwTVN6bkRQR1BjS3Y0TW1ITFZYaGc#gid=0

Additionally i made a small script that takes a file with one url per line and grabs the alexa rank of that url:

https://docs.google.com/file/d/0B4JOD6qxPwy6aW5VOWdWdXNBTlE/edit?usp=sharing

I'm not sure how to integrate those into your data, so this is more of a discussion ticket.

thaljef · 2013-05-31T05:28:39Z

I'm not sure how to integrate those into your data, so this is more of a discussion ticket.

Here's what I think should happen next:

Identify the facts that we'd like to collect about each company. It's easy to go crazy here. But since we're asking human beings to curate this data, we shouldn't burden them with lots of questions, especially if the answers are subjective or may change frequently. These are the facts that I'd like to know: Company name, website, HQ location (city & country), industry, total number of employees, number of employees working full-time with Perl (can be fractional).
Regenerate the CSV (or JSON or SQLite) file with fields for those facts, and fill in what we have from the jobs data. At the same time, we should do what we can to programatically sanitize and de-duplicate the data. At that point, we can ignore the historical job postings. We may analyze future job posts to add more records to the CSV. But as folks curate the CSV, it will start to diverge from the job data so there won't be much value in reprocessing the old job postings.
Direct people on how to amend and update the data. There seems to be some confusion on whether to modify the markdown file or the CSV. I'm not sure how it works now, but the markdown file ought to be generated from the CSV. Folks should only edit the CSV, and we need to make that clearer.
Consider other ways to enable folks to curate the data. Right now, people have to fork and make pull requests. That's not too bad, but it might be easier to use a public spreadsheet on Google Docs, or set up some kind of web form that people can use. The key is to make it ridiculously easy for people to contribute. I think that is far more important than worrying about data quality.
Eventually, make some pretty graphs and analyses. It doesn't have to be fancy, and I think we could do quite a lot by just using GitHub Pages (for hosting) and Google Graphs (for analytics).

So @vmbrasseur, @wchristian, what say you?

thaljef · 2013-05-31T05:57:56Z

I'm not sure how to integrate those into your data,

I looked at your data some more. Once we've identified the critical facts and regenerated the base CSV, then we can merge the data from your collection. I think we can probably just divide and conquer (manually, if necessary).

My point is, you don't have to be an employee of the company to put it on the list. If you know of companies that use Perl, or have a list of them from elsewhere, then go ahead and add them. If someone with better information comes along and edits it later, then that's ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: More data #10

RFC: More data #10

wchristian commented May 30, 2013

thaljef commented May 31, 2013

thaljef commented May 31, 2013

RFC: More data #10

RFC: More data #10

Comments

wchristian commented May 30, 2013

thaljef commented May 31, 2013

thaljef commented May 31, 2013