Question about calculating support by considering pattern occurrences inside each graph #4

w-zx · 2018-04-10T15:15:04Z

Hi, this work is great and very helpful, but I notice that the policy to calculate the support of a certain pattern is to count the same pattern for only one valid time inside each graph.

For example, if a dataset contains 2 graphs: t # 0 and t # 1, a certain pattern occurs 3 times inside graph t # 0 and occurs 4 times inside t # 1, the result of mining will be this pattern with the support of 2, not 3+4=7, which is the situation I've been trying to do.

I looked through into the code and found in gspan.py line 314

def _get_support(self, projected):
        return len(set([pdfs.gid for pdfs in projected]))

I think this function is used to calculate the support of each pattern, as set is used, only different graph(gid) will be counted, and the situation inside each graph is not considered.

In order to achieve the goal I mentioned above, I changed pdfs.gid into pdfs.edge, I suppose that by counting different edges, it will get the real support of each pattern.

Now, this part of code looks like this:

def _get_support(self, projected):
        return len(set([pdfs.edge for pdfs in projected]))

However, after several tests on dataset graph.data.simple.5 and graph.data.5, I compared the result of the algorithm with my counting result by hand, and found that the result by the algorithm is always 2 times larger than the real result(eg. 5 times by hand, but 10 times by algorithm), and this is the command I used:

python main.py -s 5 ./graphdata/graph.data.5

So I think it is not about directed or undirected graph, and I wonder if you could help me and tell me whether I adjusted the wrong code or whether this goal could be realized.

Thank you very much.

The text was updated successfully, but these errors were encountered:

betterenvi · 2018-04-11T07:54:40Z

The feature you requested is not available now in this repo, but you may try the following code to achieve your goal.

def _get_support(self, projected):
        return len(projected)

w-zx · 2018-04-11T13:17:42Z

Thank you very much for the reply.

I tested the code you suggested on dataset graph.data.simple.5, and I found that the result is the same(for now, within limited tests) as my adjustment yesterday, so I think it will be a correct direction to work on, and thank you for your suggestion.

As for the case I mentioned, a 2 times larger result, after further inspections, I found that this only happens in one-edge subgraphs and only when this certain subgraph has two vertexes sharing the same label.

For example, one occurrence of the result pattern below will be counted twice in undirected graph mode. However, in directed graph mode, this pattern will only be counted once. So I think the code works still smoothly.

t # 0
v 0 2
v 1 2
e 0 1 5

Support: ...

In addition, I'm sorry that I have a question about how your test datasets are generated because I'd like to conduct more experiments. Did you follow the rules instructed in the Synthetic Datasets section of gSpan: Graph-Based Substructure Pattern Mining, by X. Yan and J. Han. Proc. 2002 of Int. Conf. on Data Mining (ICDM'02). , or use other data generation tools?

Finally, I notice that you didn't add a LICENSE to this project, so I wonder if I could use and adjust your code (with proper reference) as the mining process part of one of my MIT Licensed project?

Thank you very much.

betterenvi · 2018-04-11T14:23:15Z

Indeed, we cannot get a correct answer only by modifying len(set([pdfs.gid for pdfs in projected])) to len(projected) directly, since there are duplicate counts (pls refer to line 303 - 312). To achieve your goal, we need to figure out how to avoid duplicate counts.

You can adjust my code with reference.

w-zx · 2018-04-12T05:12:03Z

Thank you very much for the suggestion. I think I would do some further research on that problem.

And thanks for the permission, but if it would not be a bother, could you provide me with some more information about the datasets you used? It would be helpful to be able to conduct more tests with more different datasets.

Thank you very much.

betterenvi · 2018-04-12T08:18:29Z

Please refer to Section 3.1 of http://glaros.dtc.umn.edu/gkhome/fetch/papers/fsgICDM01.pdf

I don't have the code or tool to synthesize graph data now, but it is not difficult to write code to do that.

w-zx · 2018-04-13T04:40:28Z

Thank you very much for your time and replies, they have been very helpful, and I would like to do some study about that paper now.

caijiangyao1991 · 2018-07-03T02:44:03Z

@w-zx do you fix the problem now , can you share your code?

Matt-81 · 2023-03-08T11:22:11Z

Hi, could you please tell me if you addressed the issue or if you still need support?
If you already solved it, could you please share the updates? Thank you

betterenvi added enhancement contributions welcome and removed contributions welcome labels Apr 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about calculating support by considering pattern occurrences inside each graph #4

Question about calculating support by considering pattern occurrences inside each graph #4

w-zx commented Apr 10, 2018 •

edited

Loading

betterenvi commented Apr 11, 2018 •

edited

Loading

w-zx commented Apr 11, 2018

betterenvi commented Apr 11, 2018 •

edited

Loading

w-zx commented Apr 12, 2018

betterenvi commented Apr 12, 2018

w-zx commented Apr 13, 2018

caijiangyao1991 commented Jul 3, 2018

Matt-81 commented Mar 8, 2023

Question about calculating support by considering pattern occurrences inside each graph #4

Question about calculating support by considering pattern occurrences inside each graph #4

Comments

w-zx commented Apr 10, 2018 • edited Loading

betterenvi commented Apr 11, 2018 • edited Loading

w-zx commented Apr 11, 2018

betterenvi commented Apr 11, 2018 • edited Loading

w-zx commented Apr 12, 2018

betterenvi commented Apr 12, 2018

w-zx commented Apr 13, 2018

caijiangyao1991 commented Jul 3, 2018

Matt-81 commented Mar 8, 2023

w-zx commented Apr 10, 2018 •

edited

Loading

betterenvi commented Apr 11, 2018 •

edited

Loading

betterenvi commented Apr 11, 2018 •

edited

Loading