Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize notebook validation #2053

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
29 changes: 17 additions & 12 deletions docs/notebook_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@
import subprocess
from shutil import which

if which('jupyter') is None:
print("Please install jupyter, e.g. with `pip install notebook`.")
exit(1)

def check_jupyter_installed():
if which('jupyter') is None:
print("Please install jupyter, e.g. with `pip install notebook`.")
exit(1)


def read_available_backends():
Expand All @@ -26,15 +28,16 @@ def read_available_backends():

def validate(notebook_filename, available_backends):
with open(notebook_filename) as f:
lines = f.readlines()
for notebook_content in lines:
match = re.search('set_target[\\\s\(]+"(.+)\\\\"[)]', notebook_content)
if match and (match.group(1) not in available_backends):
return False
for notebook_content in lines:
match = re.search('--target ([^ ]+)', notebook_content)
if match and (match.group(1) not in available_backends):
return False
lines = f.read()
if any(
re.search(pattern, lines) for pattern in
['set_target[\\\s\(]+"(.+)\\\\"[)]', '--target ([^ ]+)']):
matches = re.findall(
'set_target[\\\s\(]+"(.+)\\\\"[)]|--target ([^ ]+)', lines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble figuring out what this code change is doing. It looks more confusing to me with the duplicate strings in there. What exactly is this optimizing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble figuring out what this code change is doing. It looks more confusing to me with the duplicate strings in there. What exactly is this optimizing?

And I'm also curious which notebooks (if any) show a measurable difference in runtime because of this optimization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bmhowe23 for reviewing this change. I will put these patterns in a list instead.

This change avoids unnecessary computations by first ensuring that any pattern from the pattern list exists and if It it does not then it avoids further checks (if condition for a match).

If any pattern exists, then it uses re.findall to collect all matching substrings in a single operation.

Basically, the idea for this PR is to read the entire file at once (f.read()). This avoids reading the file using f.readlines() (as it creates a list of string (per line) in memory, which could use high memory).

for match in matches:
backend = match[0] if match[0] else match[1]
if backend not in available_backends:
return False
return True


Expand Down Expand Up @@ -75,6 +78,8 @@ def print_results(success, failed, skipped=[]):


if __name__ == "__main__":
check_jupyter_installed()

if len(sys.argv) > 1:
notebook_filenames = sys.argv[1:]
notebooks_success, notebooks_failed = ([] for i in range(2))
Expand Down
Loading