Add scripts for PR metrics from github API #13

mpg · 2021-10-18T07:37:58Z

These scripts allow to track a number of metrics about PRs over time, using the github API.

So far they've only be used by me on my machine, but they should be a shared team resource, so here's a PR. The current versions are not intended to be final, they're a starting point that we can evolve in future PRs as needed.

(Some previous review history at https://github.com/ARMmbed/mbedtls-test-old/pull/153 (private).)

(Previously a personal project.)

Hardcoding the path to python doesn't play nice with venv.

Force users to check whether the PR's still open, and make sure an exception is raised if end date is used when it shouldn't.

gilles-peskine-arm

I checked that I could generally understand what the scripts are doing and that it matches what they're documented to do. I also checked that I could run the scripts on my machine and that their output looks plausible.

I left a number of comments suggesting improvements to the documentation, the coding style or the code structure. The only one I consider to be a blockers is about the script docstrings not really explaining how to run the scripts. I would prefer the other documentation improvements as well. Non-comment changes are merely nice to haves or things to do in the future, but not obstacles to merging what's already there. In the future, I would like to pass the same checks on Python code as in the mbedtls repository.

gilles-peskine-arm · 2021-12-16T12:00:21Z

pr-metrics/get-pr-data.py

+
+from github import Github
+
+if "GITHUB_API_TOKEN" in os.environ:


Suggestion for later: make the token API (command line options, file storage, …) compatible with the official command line tool gh. (This is not the markedly less popular python gh.) I'd like to make my own scripts compatible too — we (you, me, and I guess at least @bensze01 as well) should coordinate and write a python module for that, if there isn't already one. (I didn't find one in a cursory search but “python github gh” are not very specific search terms.)

gilles-peskine-arm · 2021-12-16T12:04:19Z

pr-metrics/pending-mergeability.py

+#!/usr/bin/env python3
+# coding: utf-8
+
+"""Produce summary or PRs pending per branch and their mergeability status."""


Please document the requirements to run this script (run get-pr-data.py to produce pr-data.p). Goes for the other scripts as well.

gilles-peskine-arm · 2021-12-16T12:31:27Z

pr-metrics/pending-mergeability.py

+c_open = Counter()
+c_mergeable = Counter()
+c_recent = Counter()
+c_recent2 = Counter()


Non-blocker: is recent2 more recent than recent or less? Are they exclusive or staggered? Better names would help.

I'd prefer cnt_open or even count_open (etc), which I think help readability quite a bit

gilles-peskine-arm · 2021-12-16T12:32:37Z

pr-metrics/get-pr-data.py

+# coding: utf-8
+
+"""Get PR data from github and pickle it."""
+


General suggestion, not a blocker: all scripts should support --help, and should have most of their code inside functions or classes so that they can be called from another script bypassing the command line interface.

gilles-peskine-arm · 2021-12-16T12:34:31Z

pr-metrics/pr-closed.py

+last_q = quarter(last)
+
+cnt_all = Counter()
+cnt_com = Counter()


I could guess that cnt stands for “count” or “counter” even without seeing = Counter(), but what's com? “Command”? “Communication”?

(Obviously by reading the rest of the code I can tell that it's “community”. But I wouldn't have guessed.)

And actually, as I wasn't familiar with collections.Counter, I first thought that these were counters, but they're actually dictionaries of counters, or multisets. (I consider this a bad choice of name in the Python standard library. It's fine for this script to follow this naming choice.) It would help to know what the keys are. This goes in other scripts that use Counter as well.

gilles-peskine-arm · 2021-12-16T16:43:41Z

pr-metrics/pr-pending.py

+
+import matplotlib.pyplot as plt
+
+cnt_tot = Counter()


So cnt_tot[date] is the number of open PR at date, right?

gilles-peskine-arm · 2021-12-16T16:49:38Z

pr-metrics/prs.py

@@ -0,0 +1,88 @@
+#!/usr/bin/env python3


This doesn't really do anything useful when called as a standalone program. It should probably not have a shebang line. It already isn't executable.

gilles-peskine-arm · 2021-12-16T16:55:45Z

pr-metrics/prs.py

+
+def is_community(pr):
+    """Return False if the PR is from a team member or from inside Arm."""
+    # starting from 2021 we consistently label community PRs


We're not that consistent, actually. There's some ambiguity over whether PR from other Arm teams or from close contributors such as Silabs are considered community. I'm not sure “has the community label” is any better than the old heuristic.

gilles-peskine-arm · 2021-12-16T16:57:02Z

pr-metrics/prs.py

+
+
+def quarter(date):
+    """Return a string decribing this date's quarter, for example 19q3."""


Useful information: comparisons on the returned value implement chronological order.

(Our successors will deal with the Y21K problem.)

gilles-peskine-arm · 2021-12-16T16:59:29Z

pr-metrics/prs.py

+with open("pr-data.p", "rb") as f:
+    prs = pickle.load(f)


Importing a module shouldn't have side effects. All the code here should go into a class and the loading should go into the constructor.

gilles-peskine-arm · 2021-12-16T17:14:16Z

Labeling "size:S" based on review time (I would guess a size:M based on coding time, but this is relatively easy to review since checking what a library function does is easier than looking for a library function that does a given thing).

RcColes · 2022-04-11T09:39:50Z

pr-metrics/Readme.md

+- matplotlib >= 3.1 (3.0 doesn't work)
+- PyGithub >= 1.43 (any version should work, that was just the oldest tested)
+
+### Ubuntu 20.04 (and probaly 18.04)


Nit: probaly -> probably

Suggested change

### Ubuntu 20.04 (and probaly 18.04)

### Ubuntu 20.04 (and probably 18.04)

RcColes · 2022-04-11T10:06:47Z

pr-metrics/Readme.md

+- matplotlib >= 3.1 (3.0 doesn't work)
+- PyGithub >= 1.43 (any version should work, that was just the oldest tested)
+
+### Ubuntu 20.04 (and probaly 18.04)


Also works on 22.04 (tested by me) in case you want to note that for the future's sake.

RcColes · 2022-04-11T10:23:10Z

pr-metrics/get-pr-data.py

+
+prs = list()
+for p in r.get_pulls(state="all"):
+    print(p.number)


I'd be tempted to make this more informative - maybe changing it from p.number to "Fetching #" + str(p.number)

Might not be necessary for such a simple script though

RcColes · 2022-04-11T10:26:06Z

pr-metrics/pending-mergeability.py

+c_open = Counter()
+c_mergeable = Counter()
+c_recent = Counter()
+c_recent2 = Counter()


I'd prefer cnt_open or even count_open (etc), which I think help readability quite a bit

RcColes · 2022-04-11T10:32:56Z

pr-metrics/pending-mergeability.py

+
+
+print("              branch:       open,  mergeable,       <31d,        <8d")
+for b in sorted(c_open, key=lambda b: c_open[b], reverse=True):


I'd traditionally define lambda functions as lambda x: etc, notable here since this makes it seem like there's some relationship between the lambda b and for b syntactically, which there isn't.

RcColes · 2022-04-11T10:46:53Z

pr-metrics/pending-mergeability.py

+
+print("              branch:       open,  mergeable,       <31d,        <8d")
+for b in sorted(c_open, key=lambda b: c_open[b], reverse=True):
+    print("{:>20}: {: 10}, {: 10}, {: 10}, {:10}".format(


This produces hard to read tables if a branch name is > 20 chars long (see below). However, I can't immediately see a good/simple way to fix this, so it might just have to be accepted as a limitation

branch: open, mergeable, <31d, <8d development: 177, 91, 51, 41 mbedtls-2.28: 6, 6, 3, 3 dev/gilles-peskine-arm/psa-test-op-fail: 1, 1, 1, 1 mbedtls-2.16: 1, 0, 0, 0

RcColes · 2022-04-11T11:02:50Z

pr-metrics/prs.py

+
+
+first = datetime.date(2015, 1, 1)
+last = datetime.date(2099, 12, 31)


These names are tricky to guess the purpose of, particularly when imported into a different file so the value isn't known. Maybe date_range_start and date_range_end?

RcColes · 2022-04-11T11:33:13Z

pr-metrics/pr-lifetime.py

+ax.set_ylabel("median lifetime in days of PRs created that quarter")
+ax.tick_params(axis="x", labelrotation=90)
+bot, top = ax.set_ylim()
+ax.set_ylim(0, min(365, top))  # we don't care about values over 1 year


Is there a reason why we don't care about values over a year? Several quarters (for example 16q1) exceed this value

Otherwise we end up including too recent PRs in the quarterly reports, which introduces unwanted variability. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Merge today's members of the mbed-tls-reviewers into the list. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

pr-metrics/pr-lifetime.py

This reverts commit e69fb3a. This was misguided attempt: it makes the result harder to read without actually solving the underlying problem that we sometimes ask for the median of a series that's still not complete enough.

The statistics module entered the standard library in 3.4, so it should be safe to use by now. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

- default dates - defaults should correspond to typical use - one legend was not in the same place as the others Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

daverodgman · 2023-04-06T09:53:54Z

pr-metrics/pr-backlog.py

+
+import matplotlib.pyplot as plt
+
+new_days = 90


There was a bit of a push to standardise ages for what is considered recent/old in OSS a while back. The thresholds picked were 15 and 90.

It might be useful to have a few extra ranges, e.g. <15, 15-90, 90-365, >365 to align better with this? If I have time I'll push an update.

Actually, thinking more, this is a lot like the median lifetime graph, but with a couple of thresholds which are based on age rather than percentiles. Would it be better to have a graph that shows e.g., median, 75th percentile, 95th percentile? Pros and cons either way I think, maybe worth exploring it if it's quick/easy to do but I'm not sure if it would be better or not?

Oh, I didn't know about the push for standardised thresholds, indeed it would be good to align with that (and add some of our own if needed). Unfortunately the way the script is structured currently makes it a very manual change (you can't just give a list of threesholds at the top and have everything else work automagically).

Regarding adding other percentiles to the median one, that's something I've been considering for a while, but the problem is over what set. Currently it's over "PRs created this quarter", which means, considering Q1 for the sake of concreteness we can only compute the median (or an upper bound for it) after we've closed at least 50% of PRs created in Q1. Fortunately, that's usually the case at the very beginning of Q2 when we prepare our report. (Even then, we might get only a range, as the value could still get lower if we closed a lot of PRs created near the end of Q1 at the very beginning of Q2.)

But for the 4th quartile, resp. 95th percentile, we'd need to have closed 75%, resp. 95% of PRs created in Q1 by the time we produce our report in early Q2, which realistically is not going to happen most of the time.

We could avoid the problem with incomplete data entirely by considering instead the set of PRs we closed this quarter - there all the lifetimes are known for sure and we can do stats without uncertainty. But I doesn't really tell the same thing: for example, doing at lot of historical review one quarter would raise the median age of PRs closed this quarter, but that's still a good thing. So the data might become more difficult to interpret.

So, I think there are basically three ways to select PRs over which we make stats / grouping / etc for one quarter:

PRs created this quarter;

PRs still open at the end of the quarter (or any specific date);

PRs closed this quarter;
and each set will give slightly different information.

Wdyt?

Agree, there isn't an easy answer or obvious better way. So let's leave as-is for now.

mpg added 16 commits October 18, 2021 09:34

Add scripts for PR metrics from github API

b10b809

(Previously a personal project.)

Update requirements & allow use from venv

c531cae

Hardcoding the path to python doesn't play nice with venv.

Add comments about Ubuntu 20.04

a715053

Make get-pr-data 10x faster

bae9af3

Avoid potential better-than-reality lifetime figures

00d8499

Adjust pr_dates() to reduce risk of misuse

28dffa7

Force users to check whether the PR's still open, and make sure an exception is raised if end date is used when it shouldn't.

Adapt detection of community PRs

3d7880c

Add warning about making this work on 16.04

37844d4

Avoid repeating the start date in many places

cf9e41d

Update outdated comment

f06becf

Make first and last date configurable

cc05d6a

Fix flake8 warnings

ed1adea

Rotate labels for quarters

08c0b7c

Clarify community detection

b2ee775

Smarter handling of p.mergeable in get-pr-data

3feb297

Update pending-mergeability

1d58093

mpg added enhancement New feature or request needs: review needs: reviewer labels Oct 18, 2021

gilles-peskine-arm requested changes Dec 16, 2021

View reviewed changes

gilles-peskine-arm added the needs: work label Dec 16, 2021

gilles-peskine-arm added the size-s Estimated task size: small (~2d) label Dec 16, 2021

RcColes reviewed Apr 11, 2022

View reviewed changes

RcColes removed the needs: reviewer label Apr 11, 2022

mpg self-assigned this Apr 25, 2022

daverodgman added the priority-low label May 16, 2022

mpg removed the needs: review label Jun 8, 2022

mpg added 5 commits September 30, 2022 10:34

We no longer use labels for community PRs

5f6d268

Update list of core contributors

94533e1

Update Readme (PR last date)

b7f7f76

Shift one month for quarterly PR lifetime

e69fb3a

Otherwise we end up including too recent PRs in the quarterly reports, which introduces unwanted variability. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Update list of team member

cd9c1f6

Merge today's members of the mbed-tls-reviewers into the list. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

tom-cosgrove-arm reviewed Jan 11, 2023

View reviewed changes

pr-metrics/pr-lifetime.py Outdated Show resolved Hide resolved

mpg added 4 commits January 11, 2023 11:04

Revert "Shift one month for quarterly PR lifetime"

4d58ba0

This reverts commit e69fb3a. This was misguided attempt: it makes the result harder to read without actually solving the underlying problem that we sometimes ask for the median of a series that's still not complete enough.

Use statistics.median

45fa6ce

The statistics module entered the standard library in 3.4, so it should be safe to use by now. Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Handle uncertainty about lifetimes

f1b54e1

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Update Readme about incomplete results

ac21a51

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

mpg added 4 commits April 6, 2023 08:48

Update team members with current reviewers

e095fc3

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Draw error bars, don't skip uncertain quarters

ce08049

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

New script pr-backlog.py

c86237c

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

Cosmetic adjustments

b7a02f6

- default dates - defaults should correspond to typical use - one legend was not in the same place as the others Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>

daverodgman reviewed Apr 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scripts for PR metrics from github API #13

Add scripts for PR metrics from github API #13

mpg commented Oct 18, 2021

gilles-peskine-arm left a comment

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

RcColes Apr 11, 2022

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm Dec 16, 2021

gilles-peskine-arm commented Dec 16, 2021

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

RcColes Apr 11, 2022

daverodgman Apr 6, 2023

daverodgman Apr 6, 2023

mpg Apr 6, 2023

mpg Apr 6, 2023

daverodgman Apr 11, 2023


		from github import Github

		if "GITHUB_API_TOKEN" in os.environ:



		def quarter(date):
		"""Return a string decribing this date's quarter, for example 19q3."""

	### Ubuntu 20.04 (and probaly 18.04)
	### Ubuntu 20.04 (and probably 18.04)



		print(" branch: open, mergeable, <31d, <8d")
		for b in sorted(c_open, key=lambda b: c_open[b], reverse=True):



		first = datetime.date(2015, 1, 1)
		last = datetime.date(2099, 12, 31)

Add scripts for PR metrics from github API #13

Are you sure you want to change the base?

Add scripts for PR metrics from github API #13

Conversation

mpg commented Oct 18, 2021

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gilles-peskine-arm commented Dec 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment