Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up the Neo4j import #5

Open
veleritas opened this issue Feb 1, 2017 · 11 comments
Open

Speeding up the Neo4j import #5

veleritas opened this issue Feb 1, 2017 · 11 comments

Comments

@veleritas
Copy link
Contributor

Hi Daniel,

What was the reasoning behind importing the nodes and edges of the hetnet using the py2neo interface? I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.

From my experiments it seems like importing 20000 nodes and 22000 edges into neo4j with the current code takes roughly 45 minutes on an AWS instance with 8 cores and 32 GB RAM. At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.

Best,
Toby

@dhimmel
Copy link
Member

dhimmel commented Feb 2, 2017

At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.

Yeah it took ~10 hours.

I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.

I personally didn't spend too much time optimizing because I didn't plan on running this Neo4j import step too often. If you're running it a lot, you may want to look into solutions.

The problem with the batch TSV import is that TSV's are bad at representing properties that only exist for a single node or relationship type. However, if you don't care about properties (besides name which every node has), you could use the TSV import. Or perhaps you can make a TSV where missing values don't get written as properties. Or make several TSVs (one for each node and relationship type). Looking at the import tool doc, this could be the way to go.

Ah now I remember another reason I didn't use the import tool. I don't think I was able to fully automate the import... i.e. there was network specific commands that had to be written... therefore it would decrease the versatility of the code. I wanted hetio to work for any hetnet, not just Hetionet. Not sure if it's now possible to use the import tool for any hetio network.

@dhimmel
Copy link
Member

dhimmel commented Feb 2, 2017

If you want to use the import tool, the Hetionet TSVs could be a good place to start and get benchmarks.

@dhimmel dhimmel changed the title Neo4j Import Speeding up the Neo4j import Feb 2, 2017
@veleritas
Copy link
Contributor Author

After some testing it turns out that it is actually much faster to import edges individually into neo4j when using py2neo version 3. As described in this link, it seems that py2neo version 3 uses subgraphs in order to make updates to neo4j.

Effect of batch size on import speed:

Imported object Batch size Import speed
Node 1 ~100/s
Node 10 ~400/s
Node 100 ~500/s
Node 200 ~550/s
Node 500 ~530/s
Node 1000 ~500/s
Edge 1 ~120/s
Edge 5 ~90/s
Edge 10 ~100/s
Edge 100 ~7/s

Based on these results, it seems that updating neo4j with a subgraph containing multiple edges is actually slower than updating the graph one edge at a time. I suspect that this is because the underlying py2neo code converts the subgraph back into individual edges anyways, and therefore spends time making redundant calculations. All speed estimates are approximate, and testing was done on an AWS m4.2xlarge instance with EBS.

@dhimmel
Copy link
Member

dhimmel commented Feb 15, 2017

@veleritas your benchmarks are awesome. Let's address this after #6 is merged. The easy fix would be changing the default value for edge_queue to 10 or another small value.

But happy to consider a more dramatic code refactoring if you think it's worth it.

dhimmel pushed a commit that referenced this issue Mar 2, 2017
Intended to speed up the hetnet importation into Neo4j.
See #5
@dhimmel
Copy link
Member

dhimmel commented Mar 2, 2017

@veleritas I changed the defaults in d026d13. I made you the commit author, since you did all of the hard work!

@veleritas
Copy link
Contributor Author

I've been experimenting with the batch CSV import provided by Neo4j (version 3), and it seems so far that the batch import can be made to work with Rephetio v2.0. Current initial testing shows that a half scale Rephetio with 1.2 million edges and ~20k nodes can be imported into Neo4j in roughly 10 seconds.

@dhimmel
Copy link
Member

dhimmel commented Mar 9, 2017

batch import can be made to work with Rephetio v2.0

@veleritas awesome, I'd be really interested in getting this feature implemented in hetio. Happy to review a pull request or help out in any way you see fit.

@veleritas
Copy link
Contributor Author

At the moment the batch CSV import is implemented as a tack-on script to the integrate repository. It basically sidesteps the hetio export_neo4j() function completely.

Process:

  1. This script creates the CSV files needed after the integrate.ipynb script finishes.
  2. This script then downloads neo4j and makes the necessary configuration modifications to allow access from python with py2neo.
  3. Finally, a bash script loads everything into neo4j.

At the moment things seem to work just fine, and neo4j has had no complaints so far, but I haven't tested full compatibility with the entire network yet. I'm going to need some more time to figure out if the pipeline will work with the entire network before I'm ready to push anything back into hetio. This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.

@dhimmel
Copy link
Member

dhimmel commented Mar 10, 2017

This method also discards a lot of the metadata you put into the network, so I'm not sure if that's a concern for you.

The main reason I avoided the CSV import is that I didn't see a way to losslessly export a graph (in its entirety) like hetio.export_neo4j(). Code in hetio should be general (work for more than just Hetionet v1.0). However, it's fine to have a lossy export that documents it's limitations.

Let's revisit this at a later time when we know more. I would say, if you find yourself constantly copying and pasting the CSV import code, then it would make sense to move it upstream.

@veleritas
Copy link
Contributor Author

Hi Daniel,

Just wanted to ask if we should be revisiting the Neo4j integration code issue. I've since switched over to using the built-in Neo4j CSV loader since it is so much faster, and haven't had any issues with the loss of license metadata so far. It's been working without any issues with the full network so far. Latest code is here. The CSV import method has also been easy to adapt to the matrix DWPC calculation method by @mmayers12 .

Again we've been discussing on our end that any changes we make to the project should be integrated back upstream if it makes sense, so let us know if you're interested in these changes, or if we need to tweak it slightly further before you're willing to pull upstream.

Toby

@dhimmel
Copy link
Member

dhimmel commented Aug 3, 2017

Hey @veleritas, I see two options for incorporating CSV Neo4j import functionality into hetio.

  1. Create a function like export_neo4j_via_csv, presumably in hetio.neo4j. This would not replace export_neo4j but instead add another route to Neo4j import for users who don't require edge properties to be maintained and would like the speed increase.

  2. Add a note in the export_neo4j or in a README that references your code for CSV import. The note would direct users looking for a speedup to check out CSV import.

If you're willing to do the work to submit a PR for option 1, then this is preferable. However, we need to make sure implementations in hetio are modular... so there may be some additional work needed to convert the code from your notebooks. Anyways, I'm happy to help with review and some implementation if needed. This would be a valuable feature, and it would be nice for you to not have to maintain an independent CSV import patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants