Skip to content

Latest commit

 

History

History
311 lines (204 loc) · 18.5 KB

JupyterCon-2017.md

File metadata and controls

311 lines (204 loc) · 18.5 KB

Watch the video

About

This file hosts a submission on rerunning Jupyter notebooks from PMC for JupyterCon, to be held in New York on August 23-25, 2017 (see my notes on the event). The deadline for submitting talks was March 14, 2017, 11:59pm Eastern Daylight Time, and I received confirmation of my submission at 11:49pm EDT that day. On April 13, I was informed that notifications of acceptance or rejection would be sent out "between now and next week (the week of April 17)." On April 22, I received notification that the talk had been selected, and I accepted the offer to present. The talk is now scheduled for 4:10pm - 4:50pm EDT (40 minutes) on August 25 in Murray Hill (see also its entry in the program).

Like the rest of the project, this was a collaborative effort. Special thanks go to Tom Pollard for help with the submission.

Outline of the talk

Timing

The slot is 40 min, so I'll be aiming for something like 25-30 min of presentation and 15-10 min of Q & A.

Title

Post-publication peer review of Jupyter Notebooks referenced in articles on PubMed Central

Definitions

Premises & comments on previous talks

Inspiration

Hackathon project

Observations

  • initial write-up
  • Jupyter notebooks
    • provide great learning experiences when they actually run
    • need to be shared in a more standardized fashion to allow reuse
    • can trigger frustration or additional learning experiences when they do not run
    • are growing in popularity on PMC

Early prototypes for notebook verification

Discussion

Recommendations

Open questions

  • multiple
    • to what extent can we reuse existing infrastructure for the purpose of
      • validating a given notebook?
        • its containerized version?
    • what standards exist or are emerging in terms of best practices for
      • mentioning notebooks
      • citing notebooks
      • expressing dependencies
      • documenting when a validation was run, and by whom
        • e.g.
          • before submission of the paper (perhaps by author)
          • during review (perhaps by reviewer)
          • upon publication of the paper (perhaps by journal)

Outlook

  • research
  • Wikimedia
  • offline
  • data journalism
  • looking for partners to pilot potential solutions

Contact

Submission

Below the horizontal line is a copy of the submission form, with my responses filled in as blockquotes. The form used a non-markdown way of markup, which is preserved in the way the hyperlinks are given.

In the "key takeaways" section, I went for just three, omitting the following two:

Research institutions and publishers should establish workflows for verifying Notebooks in a standardized fashion. We are looking into possible approaches to do this.

Researchers and others using Jupyter Notebooks to document their workflows should make use of such verification environments routinely.

I listed only myself as a speaker, since they were asking for email, bio and pic.


Submitting a proposal is a two stage process. First, enter the proposal details on this page. Because the first round will be a blind review, please do not include your name, affiliation, or any other identifying information in the title, description, or abstract of your proposal.

After submitting these, you will then be asked to input the speaker details.

Note: You will be logged out after 6 hours so please save your proposal locally if you need extra time.

Proposal title*

Post-publication peer review of Jupyter Notebooks referenced in articles on PubMed Central

Description* (brief overview for marketing purposes, max. length 400 characters—about 65 words)

Jupyter Notebooks are a popular option for sharing data science workflows. We sought to explore best practices in this regard and chose to analyze Jupyter Notebooks referenced in PubMed Central in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse). The project started at a hackathon earlier this month, is still ongoing and documented on GitHub.

Characters remaining: 400

Topic (Please choose the one topic most relevant to your proposal) *

  • Core architecture
  • Jupyter subprojects
  • Usage and application
  • JupyterHub deployments

Reproducible research and open science

  • Development and community
  • Documentation
  • Kernels
  • Extensions and customization

Session type*

40-minute session

Abstract* (Longer, more detailed description of your presentation to help the program committee understand what you will cover. If your proposal is chosen, this is the description that will appear on the website. Note that our copywriters may edit it for consistency and O'Reilly voice.) read formatting help

Jupyter Notebooks are a popular option for sharing data science workflows.

We sought to explore best practices for sharing reproducible research studies, focusing on the Jupyter Notebooks that had been referenced in published research articles. Our analysis reviewed aspects of reproducibility and usability, covering areas like quality of documentation, licensing, and ease of reuse.

Our aim was to understand and document the extent to which these publicly accessible Notebooks are reproducible, both individually and collectively. By identifying the existing barriers to reproducibility, our hope is to lower those barriers for Notebooks shared in the future.

To find research articles with associated Jupyter Notebooks, we performed a search for "ipynb OR Jupyter" on PubMed Central, a full-text database of biomedical articles. This yielded approximately 100 articles, which were then screened for mentions of actual Notebooks.

The association of articles with Notebooks took multiple forms, such as screenshots, supplementary files, links to nbviewer, GitHub repositories and individual Notebooks on GitHub. For those Notebooks that were available in an executable form, we executed them in a clean Jupyter environment.

When executing Notebooks, we recorded whether they ran through without any errors, and the first error message if there was one. Subsequently, we looked at individual errors and tried to resolve them. Most frequently, this involved fixing code and data dependencies. Lack of documentation was often a barrier to reproducibility, as was code that was dependent on the platform (e.g. shell commands) and use of non-Python software packages (e.g. Java).

Our next step is to analyze the remaining errors, on the basis of which we are deducing recommendations on how they could be avoided. In addition, for those Notebooks that we manage to execute, we are documenting whether the results match the ones originally reported in the associated papers.

Some Notebooks are provided along with a containerized version, usually through Docker. In such cases, we are taking an approach similar to analyzing the Notebooks themselves: attempting to build the container from scratch, run it and document problems encountered on the way, as well as attempts to solve them.

The project is a collaborative effort, still ongoing and documented in an open-science manner "on GitHub":(sparcopen/open-research-doathon#25), as is "this submission":(https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md). I would like to thank everyone who has contributed so far.

Suggested secondary topic(s)

Please suggest up to 3 secondary topics, separated by a comma, that will best represent your proposal. Who is this presentation for? *

Notebook validation, publishing, education

(Job titles/functions and/or experience, for instance)

researchers, data librarians, reviewers, publishers and research administrators

Audience level *

Intermediate

What's the takeaway for the audience? *

Main ideas and/or skills attendees will learn from your presentation

Efforts are needed to improve and standardize the way Jupyter Notebooks and associated containers are shared along with published research articles.

Mechanisms (e.g. badges) should be explored to signal whether a given Notebook has passed such a standardized procedure.

Reviewers of research involving Jupyter Notebooks should be made aware of the reusability issues outlined here.

Prerequisite knowledge for this presentation *

This information is crucial for attendees. Please describe what skills and knowledge attendees need to have in order to get the most from your talk.

We expect attendees to have some basic familiarity with Jupyter Notebooks and with the concept and practicalities of reproducibility in research.

Is this session more conceptual or how-to? *

How-to

Application area

(i.e. science, education, industry, etc.)

research, education

Tutorial hardware and/or installation requirements for attendees

If you are submitting a proposal for a 3-hour tutorial, what materials or downloads will attendees need in advance of your tutorial (e.g., a GitHub account, most current version of MySQL, a laptop, etc.)?

N/A

Please let us know if you have any special equipment needs for your presentation.

(Note: Each presentation room will have a podium, projector, 16:9 screen, head table, microphone, ethernet connection for speaker, and wifi.)

N/A

Video URL

While not compulsory, we strongly encourage you to provide at least one video clip of the speaker. If you don’t have video of the speaker in action at an event, please create a very short clip (2-3 minutes, see examples) proposing the session. We don’t care about the quality of the video; we care about the quality of the speakers. If your video isn’t already online, post it to a third-party site (YouTube is fine), and then share the link with us.

https://github.com/Daniel-Mietchen/events/blob/master/recordings.md

O'Reilly Author

If you are an author or content contributor at O’Reilly, please fill in the name of your main O’Reilly editorial contact.

N/A

Did someone recommend or encourage you to submit a proposal? If so, please let us know who sent you:

N/A

Diversity *

Diversity is one of the factors we seriously consider when reviewing proposals as we seek to broaden our speaker roster.

Does your presentation have the participation of a woman, person of color, person with disabilities, or member of another group often underrepresented at tech conferences?

No

Travel & other expense reimbursements *

If your proposal is chosen, would you need assistance with hotel or travel related expenses (for example, because you are self-employed)? Please note: Your answer to this question will have no impact on the committee’s review of your proposal.

Yes.

If so, please describe.

This is a spare time project, and I do not have any funding for attending JupyterCon at the moment, though I might be able to arrange for accommodation. If the submission is accepted, it would thus be helpful if you could assist with covering my return flight from Germany (ca. USD 800). If this is not possible, a fallback option would be to make use of the collaborative nature of the project and send someone for whom travel costs would be lower. One of the contributors - Tom Pollard, who is based in Boston - has signaled his availability for that.

Additional notes

Any other information you would like the program committee and/or conference organizers to know? Code of Conduct

All participants, including speakers and presenters, must follow our Code of Conduct, the core of which is this: JupyterCon should be a safe and productive environment for everyone.

By submitting a proposal, I (and any other presenters included in this speaking proposal) agree to follow the conference Code of Conduct. Read more »

You must now add one or more speakers before you can submit this proposal.

Add speaker