Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making biological network knowledge discoverable and accessible on search engines #218

Open
cannin opened this issue Jan 20, 2023 · 42 comments

Comments

@cannin
Copy link

cannin commented Jan 20, 2023

Background

A biological pathway is a network consisting of interactions between biological molecules (e.g. proteins and chemicals) in a cell that lead to a certain product or a change. Researchers commonly use such networks to summarize research results about how biological molecules interact in healthy individuals and how changes in these relationships can lead to diseases, such as arthritis and cancer.

Pathway Commons (PC) is a popular web resource that aggregates machine-readable data about biological pathways (>5 700) and interactions (>2.4 million) from 22 popular curated public databases. Users can interactively explore this resource through the PC Search page to find out how a query (e.g. gene or disease) is connected to millions of pathways and interactions in the collection.

Goal

Create a sitemap for PC Search pathway and interaction network pages.

Sub-Tasks

This work will involve modifications to the PC Search source code:

  • Identify the biological network data items in PC Search to be included in a sitemap
  • Create the sitemap referencing networks
    • Generate a sitemap referencing the individual pathway (and interaction pages), possibly including snapshot images (see Stretch goals below), all of which should conform to Google recommendations

Stretch goals

  • Include useful metadata in web pages to be indexed
    • Descriptions
    • Entities (e.g. genes, organism)
  • Index image/snapshots of each networks

Significance

This will immediately improve the indexing of PC Search pages by Google. More broadly, this project will expand the audience of researchers able to access and reuse biological research knowledge curated from publications. This will accelerate research discovery and increase the value of each knowledge item in PC Search.

How to start

Difficulty Level: Easy

Size and Length of Project

  • 175 hours
  • 12 weeks

Skills

  • JavaScript and HTML (essential), CSS

Public Repository

Potential Mentors

  • Augustin Luna ({first_name}_{last_name} AT hms.harvard.edu)

Previous post

Background

Pathway Commons (http://pathwaycommons.org/) is an aggregated database of molecular interactions of millions of interactions. Data stored in the Pathway Commons is in the BioPAX (http://biopax.org/) XML-based format. The data is aggregated from a collection of approximately 20 databases. Data from Pathway Commons is accessible here.

A previous version of Pathway Commons site included pages for each pathway providing details, but this is missing from the current site. Previous examples:

Goal

The goal is to generate a new static site from Pathway Commons content especially the visualization of pathways using the Systems Biology Graphical Notation.

Sub-Tasks

  1. Generate static images for each pathway using SyBLaRS webservice: https://gist.github.com/cannin/7e35f3fae274370bd0a70c7b1840c743
  2. Support the ease of maintenance and modification.
  • Use a classless CSS design: https://classless.de/
    * Explore client-side search: lunr.js
    * Use a well-supported static site generators (pick from one of these: Jekyll, Hugo, Gatsby, Pelican)
    * Minimize the use of Javascript
    * Build a responsive mobile-first site that is functional an larger monitors
  1. Dependent on time; Generate pathway images for other large collections of SBGN content, including BioModels.
  2. Dependent on time; Generate text description for pathways lacking a description

How to Start

Interested applicants should:

  1. Run example code for static image generation: https://gist.github.com/cannin/7e35f3fae274370bd0a70c7b1840c743
  2. Explore libraries/sites mentioned in Goal section
  3. Explore Pathway Commons API: https://www.pathwaycommons.org/pc2/home

Difficulty Level: Easy

Size and Length of Project

175 hours
12 weeks

Skills

HTML (essential), CSS, Python, Jekyll, Javascript

Public Repository

Potential Mentors

Augustin Luna ({first_name}_{last_name} AT hms.harvard.edu)

@cannin cannin self-assigned this Jan 20, 2023
@cannin cannin added the XML label Jan 20, 2023
@Simer13
Copy link

Simer13 commented Jan 21, 2023

hello.
I am a first year BTECH student and I have been trying to contribute to the open source projects and am interested in helping with this project for GSoC 2023. I am well profound in the skills you have mentioned earlier and have made some projects as well. I really hope that you assign me this project. I will do my best and fix the issue.

@deep-poharkar
Copy link

Hey @cannin, I believe I fulfil the skills with a significant experience and I would love to contribute to this if you allow me to.

@cannin
Copy link
Author

cannin commented Jan 24, 2023

@Simer13 @deep-poharkar Thanks for your interest. We are still in the process of applying as a mentoring organization; we should have an answer by Feb 23. Check for an update on that date here.

@yagyesh-bobde
Copy link

Hello, I am Yagyesh Bobde. I have experience in web dev. I am also interested in working on this project.

@khanspers
Copy link
Contributor

NRNB has been accepted as a mentoring organization for GSoC 2023! Contributor applications open on March 20. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

@cd-vishal
Copy link

@khanspers is there any slack channel or community forum where I can ask some questions related to this project?

@khanspers
Copy link
Contributor

Maybe try [email protected] or email the potential mentor @cannin directly. See email in project description.

@drstone-genius04
Copy link

hi allan here am a third year IT eng student I work in web dev and ml domain hoping to contribute to this project wanted to ask about the required documentation for this project

@drstone-genius04
Copy link

HEY @cannin @khanspers I am getting the following error when I am trying to setup the project in my pc do let me know if I have missed out on any step

@drstone-genius04
Copy link

Screenshot (308)

@drstone-genius04
Copy link

hi @cannin @khanspers actually wanted to inform that was able to solve the error and get the required data

@drstone-genius04
Copy link

Screenshot (311)

@drstone-genius04
Copy link

had a small doubt instead of Jekyll is it ok if I use pelican

@cannin
Copy link
Author

cannin commented Mar 14, 2023

@drstone-genius04 yes okay to use pelican.

@drstone-genius04
Copy link

drstone-genius04 commented Mar 14, 2023 via email

@drstone-genius04
Copy link

@cannin I am getting the following error while installing pelican plugins should I update my python version and try while note my python version is 3.10.10

@awantikamallick
Copy link

Hello, Awantika this side. I went through the project and would like to work on it, can you please guide where I can connect with the organization as in any slack channel or anything, and how where should I submit my idea?

@cannin
Copy link
Author

cannin commented Mar 21, 2023

@awantikamallick you can post questions here or email me. if you have a good draft proposal you can make a google doc; i will try to comment on it. final proposals need to be submitted to Google.

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

@duckcommit
Copy link

duckcommit commented Mar 22, 2023

Good Day to you @cannin , myself Vyshnav Ajith. The project looks interesting to me and I have few doubts regarding Pathway Commons. It would be kind enough if you could share your email address so that I could get connected to you.
Thank you

@awantikamallick
Copy link

@cannin I think there is some error in this pathway: http://identifiers.org/kegg.pathway/hsa01100, rest others worked well!

@awantikamallick
Copy link

error

@cannin
Copy link
Author

cannin commented Mar 22, 2023

@vyshaj See the project description.

@cannin
Copy link
Author

cannin commented Mar 22, 2023

@awantikamallick make a note of this in your proposal. ignore it for now as you work on the rest of your proposal.

@duckcommit
Copy link

Good Day @cannin . I had tried to contact you via email. I would like to be part of this project and learn under you. I want to be part of this project not because I am well-versed in the necessary technologies, but because of the aim to learn more in depth.
I have referred the libraries and I am ready for this. How can I be reaching out with the proposal so that you can make necessary changes to it?

Thank You.

@Murdock9803
Copy link

Hello @cannin @khanspers, I wish you a very prosperous new year ahead. Myself Ayush Sahu, an undergraduate developer from India. I was going through various projects under the NRNB organisation, and Mr. Alexander Pico suggested me this repository.
I believe my technical skills and Stack make me a good fit for this project. I searched (and presently studying) about Pathway Commons and found this project interesting.
Is this project still open to work upon ?

@maxkfranz
Copy link
Member

@Murdock9803, yes. The first steps would be to do background research and plan how to carry out the project. For instance, you could start by researching tools like SyBLaRS, Puppeteer, and Playwright.

FYI -- @jvwong

@Murdock9803
Copy link

@maxkfranz As you said to learn about the above technologies, I'm currently doing that. Also planning for LUNR.js and classless css. I will update here once I am done reading about these or get stuck somewhere. Although I think I can find resources or documentations by my own, Are there any specific resources you would suggest to learn these ?

@maxkfranz
Copy link
Member

@maxkfranz As you said to learn about the above technologies, I'm currently doing that. Also planning for LUNR.js and classless css. I will update here once I am done reading about these or get stuck somewhere. Although I think I can find resources or documentations by my own, Are there any specific resources you would suggest to learn these ?

Those are good starting points.

@Murdock9803
Copy link

Murdock9803 commented Jan 22, 2024

Hello @maxkfranz and @cannin, I hope you are doing fine. Sorry for updating late, as I had my college examination coming. I will inform you about these beforehand now onwards.

Firstly, here is the update regarding the tools you specified to know about :-

  • Puppeteer - I thoroughly researched about this and tried to perform some simple tasks too, like taking screenshots or making pdf, page dimensions, etc.. Web automation and web scraping was new to me, But the documentation was really helpful on Puppeteer website. My knowledge of async javascript also helped me.

  • Playwright - Supporting more browsers than Puppeteer (which was mainly developed for chromium), this was a bigger task and I learned about writing tests, testing a single file as well as whole application, debugging, etc..

  • Lunr.js - researched about this and found good amount of documentation, I am looking forward to apply this in a project of mine asap, So I can get experience working with this.

    • solr (written in java) is also good but it targets larger datasets, also ElasticsSearch. I just explored these too, but did not give much time in these.
  • classless css (classless.de) - Learning this was an easy task as it is a CSS framework just like others, but it offers a really minimalistic and simple design for webpage.

  • SyBLaRS and SBGN - I am from engineering background, So found it difficult (or overwhelming) to learn about systems biology at first. I am also trying to connect with some professor from my college as they can also help me in getting acquainted with this.

Next plans

  • To learn more about the biology part that is going to be used in the project.
  • To work more on my knowledge of backend development as I will have to handle databases from pathway commons.
  • As google also released the GSoC timeline, I am thinking to prepare a timeline-based goal list So that everything goes smooth.
  • To give you detailed weekly reports, apart from small updates.

Some questions I have

  • As this project was opened last year but could not make it to GSoC projects list, I would like to know that what is the present condition ? like we have to work on it from the start or some work is already done ?
  • Please tell me, to what extent should I learn about the systems biology part, and any areas where I can easily get started with this.
  • Should I start working on the project before the official GSoC coding period begins, or I should work on some other issue till then ?

Thank you very much for your time and attention, looking forward to hearing from you soon and very excited to work on this project and also further projects. I am not targeting only GSoC, but I also want to work further on projects like this as I never thought we could merge biology and programming. This looks very exciting. Thank you

@AswalGaurang
Copy link

Hi @cannin @khanspers, I'm a dual degree BE Computer Science and MSc Biological Sciences student from BITS Pilani, India. I'm excited about the projects mentioned above and have started learning the tools recommended in the comments. Looking forward to staying connected and actively participating in GSoC.

@Murdock9803
Copy link

@maxkfranz @cannin @khanspers
This is just a follow-up to my previous message. Please share your valuable insight. I'm really looking forward to complete this project in GSoC this year.

Hello @maxkfranz and @cannin, I hope you are doing fine. Sorry for updating late, as I had my college examination coming. I will inform you about these beforehand now onwards.

Firstly, here is the update regarding the tools you specified to know about :-

  • Puppeteer - I thoroughly researched about this and tried to perform some simple tasks too, like taking screenshots or making pdf, page dimensions, etc.. Web automation and web scraping was new to me, But the documentation was really helpful on Puppeteer website. My knowledge of async javascript also helped me.

  • Playwright - Supporting more browsers than Puppeteer (which was mainly developed for chromium), this was a bigger task and I learned about writing tests, testing a single file as well as whole application, debugging, etc..

  • Lunr.js - researched about this and found good amount of documentation, I am looking forward to apply this in a project of mine asap, So I can get experience working with this.

    • solr (written in java) is also good but it targets larger datasets, also ElasticsSearch. I just explored these too, but did not give much time in these.
  • classless css (classless.de) - Learning this was an easy task as it is a CSS framework just like others, but it offers a really minimalistic and simple design for webpage.

  • SyBLaRS and SBGN - I am from engineering background, So found it difficult (or overwhelming) to learn about systems biology at first. I am also trying to connect with some professor from my college as they can also help me in getting acquainted with this.

Next plans

  • To learn more about the biology part that is going to be used in the project.
  • To work more on my knowledge of backend development as I will have to handle databases from pathway commons.
  • As google also released the GSoC timeline, I am thinking to prepare a timeline-based goal list So that everything goes smooth.
  • To give you detailed weekly reports, apart from small updates.

Some questions I have

  • As this project was opened last year but could not make it to GSoC projects list, I would like to know that what is the present condition ? like we have to work on it from the start or some work is already done ?
  • Please tell me, to what extent should I learn about the systems biology part, and any areas where I can easily get started with this.
  • Should I start working on the project before the official GSoC coding period begins, or I should work on some other issue till then ?

Thank you very much for your time and attention, looking forward to hearing from you soon and very excited to work on this project and also further projects. I am not targeting only GSoC, but I also want to work further on projects like this as I never thought we could merge biology and programming. This looks very exciting. Thank you

@jvwong
Copy link

jvwong commented Jan 29, 2024

I (@jvwong) have updated this project description to:

  • Reflect the progress the PC team has made in the interim
  • Provide more specific goals and tasks

@jvwong jvwong changed the title Static Site Generation for Pathway Commons Making biological network knowledge discoverable and accessible on search engines Jan 29, 2024
@Murdock9803
Copy link

Thanks @jvwong , I was confused regarding the progress made, This surely helps clearing some doubts. I'll read the project thoroughly again and will learn the required technologies.

@Murdock9803
Copy link

Murdock9803 commented Feb 13, 2024

@jvwong @maxkfranz @cannin Update regarding the project :

  • I am presently studying the codebase, and have explored the PC search page.
  • Trying to run the project locally on my machine.
  • Working more on my tech stack to align with the project needs.
  • Will surely start working on issues in the PC repository to get experience.

@khanspers
Copy link
Contributor

khanspers commented Feb 22, 2024

NRNB has been accepted as a mentoring organization for GSoC 2024. The contributor application period is March 18 – April 2. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

@Murdock9803
Copy link

@cannin I had a small doubt regarding the planning of the project.
Apart from the main goal, are the stretch goals supposed to be completed in 12 weeks time or it will be extended ?

@Murdock9803
Copy link

@cannin I had a small doubt regarding the planning of the project. Apart from the main goal, are the stretch goals supposed to be completed in 12 weeks time or it will be extended ?

@jvwong @cannin please have a look.
Also, we have to make both HTML and XML sitemaps ?

@jvwong
Copy link

jvwong commented Mar 6, 2024

@cannin I had a small doubt regarding the planning of the project. Apart from the main goal, are the stretch goals supposed to be completed in 12 weeks time or it will be extended ?

There are prototypes for this already, so should be accomplished in period under GSOC.

@jvwong @cannin please have a look.
Also, we have to make both HTML and XML sitemaps ?

XML sitemap is a requirement.

@semsoum-712
Copy link

@cannin Given that we've got the pathway IDs all set, leveraging this file PathwayCommons12.All.hgnc.gmt.gz to generate an XML sitemap using a widely-used and user-friendly language like Python sounds like a solid plan. It will streamline our project development process and save us valuable time.

@cannin
Copy link
Author

cannin commented Mar 26, 2024

@semsoum-712 yes; application deadline is April 2 (https://developers.google.com/open-source/gsoc/timeline)

@semsoum-712
Copy link

semsoum-712 commented Apr 1, 2024

@cannin
Hi Luna,

I hope this comment finds you well. I've just completed a draft of my proposal and would greatly appreciate your expertise in reviewing it. Specifically, I'm seeking your insights to identify any potential mistakes and gather informative suggestions for improvement.

I'm eager to incorporate your feedback to refine my proposal further.

Thank you in advance for your time and support.

Best regards,
Asma

GSOC proposal 2024.pdf

@darsh609
Copy link

darsh609 commented Sep 7, 2024

Hlo I m Darsh, I m also inetrested in this project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests