[Request] regenerate the GitHub repo network data #34

tyn1998 · 2023-02-20T06:16:39Z

As #32 mentioned we want this repo back to life, so we should also regenerate the whole GitHub repo network for the purpose. The current one is too old, I think.

Hi @frank-zsy, can you help on this?

frank-zsy · 2023-02-20T10:09:04Z

It is difficult somehow since right now for new OpenRank, we do not have network with repo data only, the network we use contains repo and user at same time, so I need to find the export script out and modify to fit current data model.

And I need to understand it because the data is just part of the problem, I also need to generate the position of each node by 3d force-layout algorithm.

tyn1998 · 2023-02-20T10:56:23Z

Understood!

If the script for data export comes out, will it be a task of OpenDigger cron?

frank-zsy · 2023-02-20T11:55:18Z

Hopefully, but also the category information by Louvain algorithm is not in any current used database and not accurate actually, so label information is another difficult part.

frank-zsy · 2023-02-21T01:57:13Z

I found the script and I think things have been changed a lot in the past 2 years, we need to reconsider how to do this rather than simply renew the data.

The changes includes:

Type	Before	After
Network model	Repo-Repo relation network	Repo-User activity network
Time handle	Relationship is simply calculated by year	Always calculate by month
Data filter	Filter out the relationship weight smaller than 2	TBD
Category data	A top level community label by Louvain algorithm	No label data, can be derived from 2020

So right now, if we want a new dataset with a full year data of 2022, the export will be quite time consuming since we don't have yearly data directly, we need to calculate yearly data first for 40 millions nodes and 75 millions relationships.

And if we want to add the network dynamic evolving data, we should export the network month by month.

WDYT.

frank-zsy · 2023-02-21T02:06:45Z

And if we want to demonstrate network dynamic evolvement, we need to handle the continuous 3d force layout which is also a quite difficult problem.

tyn1998 · 2023-02-21T02:55:12Z

Thanks for your explanation, @frank-zsy.

I'm wondering if we set a high threshold so only a small part of repos and users are exported, then will the cost of computing the repo-repo relation network(by month) be affordable?

And if we want to demonstrate network dynamic evolvement, we need to handle the continuous 3d force layout which is also a quite difficult problem.

This feature may be a second priority.

frank-zsy · 2023-02-21T02:56:49Z

So you still want a repo-repo relationship network? How about a repo-user activity network? We don't really a lot of data to show for users, that is true.

tyn1998 · 2023-02-21T03:06:26Z

How about a repo-user activity network?

We can have a try. Is there a way to identify which node is user and which node is repo then?

tyn1998 · 2023-02-21T03:09:01Z

Do you think OpenGalaxy is a possible way to present Repo OpenRank details?

frank-zsy · 2023-02-21T03:42:24Z

Do you think OpenGalaxy is a possible way to present Repo OpenRank details?

Of course it can, but it is another model and need some refactor to fit the data.

frank-zsy · 2023-02-21T06:28:20Z

How about a repo-user activity network?

We can have a try. Is there a way to identify which node is user and which node is repo then?

Of course, the tech domain is a field in the data, type can be also a field.

frank-zsy · 2023-02-21T07:13:01Z

Can you use repo OpenRank details data to generate an evolving network in OpenGalaxy? I think it is also hard here.

tyn1998 · 2023-02-21T08:00:22Z

I can with an ECharts force-layout graph, based on the demo you wrote.

However as you pointed out, as the force layout algorithm underpinning the graph layout, animation for evolution has little meaning(visit this codepen):

2023-02-21.15.52.49.mov

Similarly, for OpenDigger with a force layout algorithm underpinning it, the animation(if we could implement it) should be meaningless too. The idea that using OpenGalaxy to present OpenRank details for repos comes in my mind because I think node enhancement can be applied to the graph. So when we hover or click on a node we can review more details. I didn't think about the evolution animation in fact.

frank-zsy · 2023-02-21T09:37:59Z

The set data implementation above does not provide a smooth change process, although I think the data export and visualization are independent, I can export the data first and try to give a continuous 3d layout. How to use the data to generate smooth network evolving animation could be a future task.

frank-zsy · 2023-02-21T10:42:57Z

I will try to export a static repo-user activity network for 202301 and upload to oss.x-lab.info/open_galaxy/v2/ , there will be no c field in the node but a t field indicate the type of the node, 'r' is repo and 'u' is user.

You can check out the data to see if it is good enough to present. The node count will be 94,789 and the edges count will be 133,960.

frank-zsy · 2023-02-21T13:49:04Z

I have uploaded the data, I can not tell if it is correct or not under render. The size or PageRank field has changed from pg to or, you can modify the code and check if the data can be used.

tyn1998 · 2023-02-21T13:52:25Z

Thank you! I will try it later.

tyn1998 · 2023-02-21T14:00:59Z

😆

frank-zsy · 2023-02-21T14:18:19Z

I think the or is right in the data but not correctly imported by galaxy. And we should modify the node size to fit current dataset.

tyn1998 · 2023-02-21T14:36:15Z

It seems that the size of a node depends on its degrees according to the codebase:

open-galaxy/src/galaxy/native/renderer.js

Lines 202 to 215 in 2e0443a

    
           function updateSizes(outLinks, inLinks) { 
        
             var maxInDegree = getMaxSize(inLinks); 
        
             var view = renderer.getParticleView(); 
        
             var sizes = view.sizes(); 
        
             for (var i = 0; i < sizes.length; ++i) { 
        
               var degree = inLinks[i]; 
        
               if (degree) { 
        
                 sizes[i] = ((200 / maxInDegree) * degree.length + 15); 
        
               } else { 
        
                 sizes[i] = 30; 
        
               } 
        
             } 
        
             view.sizes(sizes); 
        
           }

The biggest problem in the new graph in my view is the spaces between nodes.

frank-zsy · 2023-02-22T02:16:26Z

Not really, from the code you can see that the graph load process is meta -> position -> link -> label(node).

open-galaxy/src/galaxy/service/graphLoader.js

Lines 47 to 51 in 2e0443a

    
           return loadManifest() 
        
             .then(loadPositions) 
        
             .then(loadLinks) 
        
             .then(loadLabels) 
        
             .then(convertToGraph);

In the default implementation, the size of the nodes are determined by the degree which means the code you referred will be used to calculated the size. But since we need to change the size due to OpenRank value, I add the code below to re-calculate the node size and color after label data loaded.

open-galaxy/src/galaxy/native/renderer.js

Lines 152 to 174 in 2e0443a

    
           // set color 
        
           var view = renderer.getParticleView(); 
        
           var colors = view.colors(); 
        
           nodeCommunity = []; 
        
           for (var i = 0; i < labels.length; i++) { 
        
             if (!communityColorMap.has(labels[i].c)) { 
        
               var c = getColor(labels[i].c); 
        
               communityColorMap.set(labels[i].c, c); 
        
             } 
        
             colorNode(i * 3, colors, communityColorMap.get(labels[i].c)); 
        
             nodeCommunity.push(labels[i].c); 
        
           } 
        
           view.colors(colors); 
        
           // set size 
        
           var sizes = view.sizes(); 
        
           var max = parseFloat(labels[0].pg); 
        
           for (var i = 1; i < labels.length; i++) { 
        
             if (max < parseFloat(labels[i].pg)) max = parseFloat(labels[i].pg); 
        
           } 
        
           for (var i = 0; i < sizes.length; ++i) { 
        
             sizes[i] = (180 * parseFloat(labels[i].pg) / max) + 8; 
        
           } 
        
           view.sizes(sizes);

Since setLabels function is called after setLinks, so the code I add will overwrite the default node size and also add color to the nodes. You can change pg to or to read the new OpenRank value and use .t rather than .c to get type and add color to different types of nodes.

tyn1998 · 2023-02-22T02:37:50Z

Thank you for pointing that @frank-zsy! I just forgot the original code was also adpated for our needs. Now the graph looks much better (blue is repo, yellow is user/bot):

OG-newdata.mov

Overview:

I'm going to create a PR and /build-test it so you can have a try online :)

frank-zsy · 2023-02-22T02:48:20Z

Great, you can also open an issue in OpenDigger and I will add a cron task to OpenDigger to generate data for OpenGalaxy.

tyn1998 · 2023-03-02T13:16:04Z

Anyhow, we now can depend on data of 2023-01 :)

closed via X-lab2017/open-digger#1208

tyn1998 mentioned this issue Feb 20, 2023

[Refactor] bring this repo back to life #32

Closed

4 tasks

tyn1998 mentioned this issue Feb 22, 2023

refactor: adapt for the new genertaed data(v2) #37

Merged

tyn1998 mentioned this issue Feb 22, 2023

[Cron] could you generate data for OpenGalaxy? X-lab2017/open-digger#1208

Open

tyn1998 closed this as completed Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] regenerate the GitHub repo network data #34

[Request] regenerate the GitHub repo network data #34

tyn1998 commented Feb 20, 2023

frank-zsy commented Feb 20, 2023

tyn1998 commented Feb 20, 2023

frank-zsy commented Feb 20, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023 •

edited

Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023 •

edited

Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

tyn1998 commented Feb 21, 2023 •

edited

Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 22, 2023 •

edited

Loading

tyn1998 commented Feb 22, 2023 •

edited

Loading

frank-zsy commented Feb 22, 2023

tyn1998 commented Mar 2, 2023

[Request] regenerate the GitHub repo network data #34

[Request] regenerate the GitHub repo network data #34

Comments

tyn1998 commented Feb 20, 2023

frank-zsy commented Feb 20, 2023

tyn1998 commented Feb 20, 2023

frank-zsy commented Feb 20, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023 • edited Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 21, 2023

frank-zsy commented Feb 21, 2023 • edited Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

tyn1998 commented Feb 21, 2023 • edited Loading

frank-zsy commented Feb 21, 2023

tyn1998 commented Feb 21, 2023

frank-zsy commented Feb 22, 2023 • edited Loading

tyn1998 commented Feb 22, 2023 • edited Loading

frank-zsy commented Feb 22, 2023

tyn1998 commented Mar 2, 2023

tyn1998 commented Feb 21, 2023 •

edited

Loading

frank-zsy commented Feb 21, 2023 •

edited

Loading

tyn1998 commented Feb 21, 2023 •

edited

Loading

frank-zsy commented Feb 22, 2023 •

edited

Loading

tyn1998 commented Feb 22, 2023 •

edited

Loading