Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about calculating support by considering pattern occurrences inside each graph #4

Open
w-zx opened this issue Apr 10, 2018 · 8 comments

Comments

@w-zx
Copy link

w-zx commented Apr 10, 2018

Hi, this work is great and very helpful, but I notice that the policy to calculate the support of a certain pattern is to count the same pattern for only one valid time inside each graph.

For example, if a dataset contains 2 graphs: t # 0 and t # 1, a certain pattern occurs 3 times inside graph t # 0 and occurs 4 times inside t # 1, the result of mining will be this pattern with the support of 2, not 3+4=7, which is the situation I've been trying to do.

I looked through into the code and found in gspan.py line 314

def _get_support(self, projected):
        return len(set([pdfs.gid for pdfs in projected]))

I think this function is used to calculate the support of each pattern, as set is used, only different graph(gid) will be counted, and the situation inside each graph is not considered.

In order to achieve the goal I mentioned above, I changed pdfs.gid into pdfs.edge, I suppose that by counting different edges, it will get the real support of each pattern.

Now, this part of code looks like this:

def _get_support(self, projected):
        return len(set([pdfs.edge for pdfs in projected]))

However, after several tests on dataset graph.data.simple.5 and graph.data.5, I compared the result of the algorithm with my counting result by hand, and found that the result by the algorithm is always 2 times larger than the real result(eg. 5 times by hand, but 10 times by algorithm), and this is the command I used:

python main.py -s 5 ./graphdata/graph.data.5

So I think it is not about directed or undirected graph, and I wonder if you could help me and tell me whether I adjusted the wrong code or whether this goal could be realized.

Thank you very much.

@betterenvi
Copy link
Owner

betterenvi commented Apr 11, 2018

The feature you requested is not available now in this repo, but you may try the following code to achieve your goal.

def _get_support(self, projected):
        return len(projected)

@w-zx
Copy link
Author

w-zx commented Apr 11, 2018

Thank you very much for the reply.

I tested the code you suggested on dataset graph.data.simple.5, and I found that the result is the same(for now, within limited tests) as my adjustment yesterday, so I think it will be a correct direction to work on, and thank you for your suggestion.

As for the case I mentioned, a 2 times larger result, after further inspections, I found that this only happens in one-edge subgraphs and only when this certain subgraph has two vertexes sharing the same label.

For example, one occurrence of the result pattern below will be counted twice in undirected graph mode. However, in directed graph mode, this pattern will only be counted once. So I think the code works still smoothly.

t # 0
v 0 2
v 1 2
e 0 1 5

Support: ...

In addition, I'm sorry that I have a question about how your test datasets are generated because I'd like to conduct more experiments. Did you follow the rules instructed in the Synthetic Datasets section of gSpan: Graph-Based Substructure Pattern Mining, by X. Yan and J. Han. Proc. 2002 of Int. Conf. on Data Mining (ICDM'02). , or use other data generation tools?

Finally, I notice that you didn't add a LICENSE to this project, so I wonder if I could use and adjust your code (with proper reference) as the mining process part of one of my MIT Licensed project?

Thank you very much.

@betterenvi
Copy link
Owner

betterenvi commented Apr 11, 2018

Indeed, we cannot get a correct answer only by modifying len(set([pdfs.gid for pdfs in projected])) to len(projected) directly, since there are duplicate counts (pls refer to line 303 - 312). To achieve your goal, we need to figure out how to avoid duplicate counts.

You can adjust my code with reference.

@w-zx
Copy link
Author

w-zx commented Apr 12, 2018

Thank you very much for the suggestion. I think I would do some further research on that problem.

And thanks for the permission, but if it would not be a bother, could you provide me with some more information about the datasets you used? It would be helpful to be able to conduct more tests with more different datasets.

Thank you very much.

@betterenvi
Copy link
Owner

Please refer to Section 3.1 of http://glaros.dtc.umn.edu/gkhome/fetch/papers/fsgICDM01.pdf

I don't have the code or tool to synthesize graph data now, but it is not difficult to write code to do that.

@w-zx
Copy link
Author

w-zx commented Apr 13, 2018

Thank you very much for your time and replies, they have been very helpful, and I would like to do some study about that paper now.

@caijiangyao1991
Copy link

@w-zx do you fix the problem now , can you share your code?

@Matt-81
Copy link

Matt-81 commented Mar 8, 2023

Hi, could you please tell me if you addressed the issue or if you still need support?
If you already solved it, could you please share the updates? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants