Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] regenerate the GitHub repo network data #34

Closed
Tracked by #32
tyn1998 opened this issue Feb 20, 2023 · 24 comments
Closed
Tracked by #32

[Request] regenerate the GitHub repo network data #34

tyn1998 opened this issue Feb 20, 2023 · 24 comments

Comments

@tyn1998
Copy link
Member

tyn1998 commented Feb 20, 2023

As #32 mentioned we want this repo back to life, so we should also regenerate the whole GitHub repo network for the purpose. The current one is too old, I think.

Hi @frank-zsy, can you help on this?

@frank-zsy
Copy link
Contributor

It is difficult somehow since right now for new OpenRank, we do not have network with repo data only, the network we use contains repo and user at same time, so I need to find the export script out and modify to fit current data model.

And I need to understand it because the data is just part of the problem, I also need to generate the position of each node by 3d force-layout algorithm.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 20, 2023

Understood!

If the script for data export comes out, will it be a task of OpenDigger cron?

@frank-zsy
Copy link
Contributor

Hopefully, but also the category information by Louvain algorithm is not in any current used database and not accurate actually, so label information is another difficult part.

@frank-zsy
Copy link
Contributor

I found the script and I think things have been changed a lot in the past 2 years, we need to reconsider how to do this rather than simply renew the data.

The changes includes:

Type Before After
Network model Repo-Repo relation network Repo-User activity network
Time handle Relationship is simply calculated by year Always calculate by month
Data filter Filter out the relationship weight smaller than 2 TBD
Category data A top level community label by Louvain algorithm No label data, can be derived from 2020

So right now, if we want a new dataset with a full year data of 2022, the export will be quite time consuming since we don't have yearly data directly, we need to calculate yearly data first for 40 millions nodes and 75 millions relationships.

And if we want to add the network dynamic evolving data, we should export the network month by month.

WDYT.

@frank-zsy
Copy link
Contributor

And if we want to demonstrate network dynamic evolvement, we need to handle the continuous 3d force layout which is also a quite difficult problem.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

Thanks for your explanation, @frank-zsy.

I'm wondering if we set a high threshold so only a small part of repos and users are exported, then will the cost of computing the repo-repo relation network(by month) be affordable?

And if we want to demonstrate network dynamic evolvement, we need to handle the continuous 3d force layout which is also a quite difficult problem.

This feature may be a second priority.

@frank-zsy
Copy link
Contributor

So you still want a repo-repo relationship network? How about a repo-user activity network? We don't really a lot of data to show for users, that is true.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

How about a repo-user activity network?

We can have a try. Is there a way to identify which node is user and which node is repo then?

image

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

Do you think OpenGalaxy is a possible way to present Repo OpenRank details?

@frank-zsy
Copy link
Contributor

Do you think OpenGalaxy is a possible way to present Repo OpenRank details?

Of course it can, but it is another model and need some refactor to fit the data.

@frank-zsy
Copy link
Contributor

How about a repo-user activity network?

We can have a try. Is there a way to identify which node is user and which node is repo then?

Of course, the tech domain is a field in the data, type can be also a field.

@frank-zsy
Copy link
Contributor

Can you use repo OpenRank details data to generate an evolving network in OpenGalaxy? I think it is also hard here.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

I can with an ECharts force-layout graph, based on the demo you wrote.

However as you pointed out, as the force layout algorithm underpinning the graph layout, animation for evolution has little meaning(visit this codepen):

2023-02-21.15.52.49.mov

Similarly, for OpenDigger with a force layout algorithm underpinning it, the animation(if we could implement it) should be meaningless too. The idea that using OpenGalaxy to present OpenRank details for repos comes in my mind because I think node enhancement can be applied to the graph. So when we hover or click on a node we can review more details. I didn't think about the evolution animation in fact.

@frank-zsy
Copy link
Contributor

The set data implementation above does not provide a smooth change process, although I think the data export and visualization are independent, I can export the data first and try to give a continuous 3d layout. How to use the data to generate smooth network evolving animation could be a future task.

@frank-zsy
Copy link
Contributor

frank-zsy commented Feb 21, 2023

I will try to export a static repo-user activity network for 202301 and upload to oss.x-lab.info/open_galaxy/v2/ , there will be no c field in the node but a t field indicate the type of the node, 'r' is repo and 'u' is user.

You can check out the data to see if it is good enough to present. The node count will be 94,789 and the edges count will be 133,960.

@frank-zsy
Copy link
Contributor

I have uploaded the data, I can not tell if it is correct or not under render. The size or PageRank field has changed from pg to or, you can modify the code and check if the data can be used.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

Thank you! I will try it later.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

image

😆

image

image

@frank-zsy
Copy link
Contributor

I think the or is right in the data but not correctly imported by galaxy. And we should modify the node size to fit current dataset.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 21, 2023

It seems that the size of a node depends on its degrees according to the codebase:

function updateSizes(outLinks, inLinks) {
var maxInDegree = getMaxSize(inLinks);
var view = renderer.getParticleView();
var sizes = view.sizes();
for (var i = 0; i < sizes.length; ++i) {
var degree = inLinks[i];
if (degree) {
sizes[i] = ((200 / maxInDegree) * degree.length + 15);
} else {
sizes[i] = 30;
}
}
view.sizes(sizes);
}

The biggest problem in the new graph in my view is the spaces between nodes.

@frank-zsy
Copy link
Contributor

frank-zsy commented Feb 22, 2023

Not really, from the code you can see that the graph load process is meta -> position -> link -> label(node).

return loadManifest()
.then(loadPositions)
.then(loadLinks)
.then(loadLabels)
.then(convertToGraph);

In the default implementation, the size of the nodes are determined by the degree which means the code you referred will be used to calculated the size. But since we need to change the size due to OpenRank value, I add the code below to re-calculate the node size and color after label data loaded.

// set color
var view = renderer.getParticleView();
var colors = view.colors();
nodeCommunity = [];
for (var i = 0; i < labels.length; i++) {
if (!communityColorMap.has(labels[i].c)) {
var c = getColor(labels[i].c);
communityColorMap.set(labels[i].c, c);
}
colorNode(i * 3, colors, communityColorMap.get(labels[i].c));
nodeCommunity.push(labels[i].c);
}
view.colors(colors);
// set size
var sizes = view.sizes();
var max = parseFloat(labels[0].pg);
for (var i = 1; i < labels.length; i++) {
if (max < parseFloat(labels[i].pg)) max = parseFloat(labels[i].pg);
}
for (var i = 0; i < sizes.length; ++i) {
sizes[i] = (180 * parseFloat(labels[i].pg) / max) + 8;
}
view.sizes(sizes);

Since setLabels function is called after setLinks, so the code I add will overwrite the default node size and also add color to the nodes. You can change pg to or to read the new OpenRank value and use .t rather than .c to get type and add color to different types of nodes.

@tyn1998
Copy link
Member Author

tyn1998 commented Feb 22, 2023

Thank you for pointing that @frank-zsy! I just forgot the original code was also adpated for our needs. Now the graph looks much better (blue is repo, yellow is user/bot):

OG-newdata.mov

Overview:

image

I'm going to create a PR and /build-test it so you can have a try online :)

@frank-zsy
Copy link
Contributor

Great, you can also open an issue in OpenDigger and I will add a cron task to OpenDigger to generate data for OpenGalaxy.

@tyn1998
Copy link
Member Author

tyn1998 commented Mar 2, 2023

Anyhow, we now can depend on data of 2023-01 :)

closed via X-lab2017/open-digger#1208

@tyn1998 tyn1998 closed this as completed Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants