[idea] Make separate file for each program from `program-list.json` (instead of keeping all programs in one file) #346

nikitastupin · 2021-11-10T11:28:37Z

Hi 👋

The problem: now if we have more than one Pull Request to program-list.json and merge one of them then we'll most likely need to rebase the other Pull Request which isn't convenient.

I've thought about making a separate file for each program, this will ease adding new programs (no need for rebase).

What are you thoughts on this?

The text was updated successfully, but these errors were encountered:

yesnet0 · 2021-11-22T08:47:12Z

Avoiding the rebase would be good - It gets unwieldy especially if lots of additions come in at the same time. What would you suggest as a unique key?

nikitastupin · 2021-11-22T16:16:25Z

I have two options at the moment:

incrementing numerical id
company slug inferred from company name or company's slug on Crunchbase (e.g. "https://www.crunchbase.com/organization/bugcrowd" becomes "bugcrowd")

The first option is good because it's simple to issue new ids but it's hard to find the program you need without grep.

The second one is more human-friendly but it's harder to issue new ids (and there may be collisions, when a company has more than one program).

nikitastupin · 2021-11-22T16:17:12Z

We can use numerical ids but have a workflow merging them together in a single file on push to the main branch.

prodigysml · 2021-11-23T23:05:52Z

That's a really good idea! Maybe something like bucket sorting by the company's name (e.g. apple in a, bugcrowd is in b, etc.), might be a way to organise things. It also allows us to fix up the structure to allow multiple domains as we were discussing this previously. Also, the idea of a github action to collate that information is awesome too! Sounds like a good improvement to the data structure!

nikitastupin · 2021-11-24T09:18:32Z

Hi @prodigysml Do you mean something like this?

tree programs

programs
├── a
│   ├── amara.json
│   └── apple.json
└── b
    └── bugcrowd.json

where each .json file is:

cat programs/a/apple.json

{
	"program_name":"Apple",
	"policy_url":"https://developer.apple.com/security-bounty/",
	"launch_date":"",
	"offers_bounty":"yes",
	"offers_swag":false,
	"hall_of_fame":"",
	"safe_harbor":"none",
	"public_disclosure":"",
	"pgp_key":"",
	"hiring":"",
	"securitytxt_url":"",
	"preferred_languages":"",
	"policy_url_status":"alive",
	"contact_email":"[email protected]"
}

I'm not sure how to deal with collisions in such case. For example, both Android and Chrome programs hosted at https://bughunters.google.com thus both have google.json file. As a workaround we can name it google.1.json and google.2.json for example.

prodigysml · 2021-11-24T09:26:13Z

Yup that's what I meant! Hmm didn't think about the collisions honestly. Your idea sounds good but maintainers will need to make sure they merge careful and don't duplicate data.

nikitastupin · 2021-11-24T09:32:07Z

May be we can automatically deduplicate (via GitHub Actions) based on the data provided? For example, check that "policy_url" is unique or some combination of fields is unique.

nikitastupin · 2021-12-03T09:30:41Z

Hi @prodigysml, hi @yesnet0! I've just added a proof of concept for the issue in #351 😃

yesnet0 · 2021-12-04T01:15:02Z

@nikitastupin I like it so far - In context of what diodb is trying to solve (catalog all known policy URLs along with their safe harbor status and optional attributes), policy_url itself could be considered the primary key. @prodigysml and @jmanoto - Any thoughts how we could better manage collisions here?

sickcodes · 2021-12-06T10:59:33Z

Resolving conflicts on json isnt difficult IMO, do you mean doubles?

nikitastupin · 2021-12-08T08:39:27Z

Hi @sickcodes 👋

Resolving conflicts on json isnt difficult IMO ...

I agree that compared to more complex merge conflicts resolving this might be easy (though I personally don't know the easy way to do it). However, if we can avoid conflicts with little or no tradeoff - why not avoid them?

I'm aware of the following tradeoffs: (1) we should change the repo structure (short-term), (2) when someone adds a program he or she should figure out the primary key (filename) for it (long-term). We can remediate the 2nd tradeoff with providing a script that helps generating a new program (kinda like npm init helps generate package.json).

Also, storing each program in a separate file could help to avoid duplicate entries. For example, now we have 272 https://g.co/vrp policy_urls (grep 'g.co/vrp' program-list.json | wc -l) that point to the same Google VRP program. This fact isn't obvious when we store each program in one file.

... do you mean doubles?

Sorry, I didn't quite got the point. Could you elaborate what do you mean by "doubles"?

sickcodes · 2022-01-15T18:24:51Z

Doubles, meaning duplicate entries.

a58b6b2#diff-3209bee5852a8fc2dde56c367fffe517831fa18462004498955d355234899867R39-R40

Previously pandas would bring the JSON raw into a datatable, and spit out an alphabetically sorted, de-duplicated JSON, jq to pretty-print it.

nikitastupin · 2022-07-04T15:06:46Z

Another way to solve the rebase problem would be to use https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/syntax-for-issue-forms instead of PRs. So that a contributor opens an issue, fills a form. Then a maintainer checks the submission and if everything is fine it'll run a chat-ops command to merge the program. Though it's not clear how to handle changes or deletions in this case.

nikitastupin mentioned this issue Dec 3, 2021

Store each program in a separate file #351

Closed

nikitastupin mentioned this issue Jul 28, 2022

Revamp the contribution process #387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[idea] Make separate file for each program from `program-list.json` (instead of keeping all programs in one file) #346

[idea] Make separate file for each program from `program-list.json` (instead of keeping all programs in one file) #346

nikitastupin commented Nov 10, 2021

yesnet0 commented Nov 22, 2021

nikitastupin commented Nov 22, 2021

nikitastupin commented Nov 22, 2021

prodigysml commented Nov 23, 2021

nikitastupin commented Nov 24, 2021

prodigysml commented Nov 24, 2021

nikitastupin commented Nov 24, 2021

nikitastupin commented Dec 3, 2021

yesnet0 commented Dec 4, 2021

sickcodes commented Dec 6, 2021

nikitastupin commented Dec 8, 2021

sickcodes commented Jan 15, 2022

nikitastupin commented Jul 4, 2022

[idea] Make separate file for each program from program-list.json (instead of keeping all programs in one file) #346

[idea] Make separate file for each program from program-list.json (instead of keeping all programs in one file) #346

Comments

nikitastupin commented Nov 10, 2021

yesnet0 commented Nov 22, 2021

nikitastupin commented Nov 22, 2021

nikitastupin commented Nov 22, 2021

prodigysml commented Nov 23, 2021

nikitastupin commented Nov 24, 2021

prodigysml commented Nov 24, 2021

nikitastupin commented Nov 24, 2021

nikitastupin commented Dec 3, 2021

yesnet0 commented Dec 4, 2021

sickcodes commented Dec 6, 2021

nikitastupin commented Dec 8, 2021

sickcodes commented Jan 15, 2022

nikitastupin commented Jul 4, 2022

[idea] Make separate file for each program from `program-list.json` (instead of keeping all programs in one file) #346

[idea] Make separate file for each program from `program-list.json` (instead of keeping all programs in one file) #346