-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add score for how close two terms are when using broad/narrow/close predicates #36
Comments
I think this could be a very valuable addition. Thank you @sbello for taking the time to writing this up. There are a few alternatives we could use to handle this:
I think I like S3 and S1.
I think after some contemplation, I am tending to S3, but I can convinced to do S1. I am a bit less enthusiastic about S2 because it has the disadvantages of both, and the only advantage is that the "semantic_similarity_score" metadata element is not repurposed slightly.. ideosyncractically. |
I wonder if there's any relevant literature on this topic for precisely this case. I feel there must be some existing thinking on this topic, and I feel like there are a variety of different algorithms you could use to derive such a distance score. AlgorithmsS1. Semantic similarityI think that S1 certainly sounds the easiest, but definitely semantic similarity alone doesn't really get at the extend of broadness/narrowness. S2-3.a. Distance as a flattening of existing narrow/broad hierarchiesUsing existing ontologizations On the fly ontologization S2-3b. Distance as a composite of other propertiesAs in 'depth of blueness' or 'condition of vehicle' in example below. Multi-branch averaging; car exampleMaybe terms are connected in more than one respect / via more than one branch. And maybe the branches have different lengths. Like A is narrower than B in 2 different respects, and 1 of those respects is very narrow, and the other is narrow but not as narrow. I don't know how often this actually happens. Just thinking of an example. Let's say we have 2 cars. A is 'average blue' and in general poor condition but functional. B is dark blue Like perhaps it has a flat tire. Via the 'blueness' branch/vector, maybe we could say that there is a short narrow distance. But maybe on the condition branch/vector, we could say the narrowness is farther. For example, 'flat tire' Something that would make this even more complicated is if the 'weights' of the branches differ. Like, if we're comparing the two cars above, maybe for some reason in our classification system, the color of the car is more important, so we would care about the narrowness/broadness along the branch of color as more important than the branch of condition. I think 0-1 is good. I think that the most robust distance metric would be S2 (continuous) rather than S3 (discrete), though I think that S3 could get "quantized", e.g. low=0.33, moderate=0.5, high=0.66. After thinking about this, I think semantic similarity is really just a proxy for distance. If we used semantic similarity as distance, I would recommend 2 fields: 'distance' and 'distance_algorithm', where distance algorithm would be 'semantic similarity'. Or, we could simply have 2 different fields; 1 for 'distance' (we would use some other algorithm), and we could also include a 2nd 'semantic similarity' field. And I can't think of another way to think of too many ways to think of distance. I can think of it as being representative discrete; of explicit relationships between entities, as in S2-3a/b above. And I can think of distance too on a continuous spectrum. Like in the 'average' vs 'dark blue' car example; the real color spectrum is really not discrete like that. But I think for our use cases, things are likely to be discrete. I would be interested to know if anyone has any other ideas as to what 'distance' really can represent here. |
Thank you @joeflack4 for your thoughts! I think this is all reasonable thinking; however, the main problem is that we are not that interested in "technical" distance as such, we are interested in real-world distance. This means that even if the two aligned ontologies contains only contain a single term each, both these terms could be more or less distant from each other, and there is no semantic similarity here at all. Indeed, when aligning two semantic spaces, there is not really a proper semantic similarity score ala Jaccard in most cases. There is only the conceptual model in the head of the experts (embedding space in the brain), and how distant both terms are according to that. This is what the issuer @sbello is trying to capture here. |
A the last meeting (8/3/23) @matentzn proposed adding a score to the manual mappings for how close two terms are when mapped using broad/narrow/close. Using this ticket to write up initial thoughts and track proposals for implementing this.
My initial proposal is to estimate how often the given match would result in inclusion of unwanted data when traversing from the narrower term to the broad term. Basically, how much noise is inherent in the match. The consideration has to be from narrow to broad as all annotations to the narrower term are or should be applicable to the broad term. If this is not the case then you should use related.
Proposed scale is 0-1, where 1 is an exact match; you should never use 1 as those should use the skos:exactMatch predicate.
I've essentially be treating close as almost but not quite an exact match so these should have a high score on the scale.
Given that this is at best a rough estimate I'm going to stick with 1 decimal place for now. So a score of:
0.9 = little noise, almost everything should useful, I think these should be mostly skos:closeMatch
0.5 = moderately noisy, should be broad/narrow/related
0.1 = very noisy, still broad/narrow but several steps away from each other in the hierarchy of the ontologies, often question the value of even making the mapping I would not make 'related' mappings that were this noisy
The text was updated successfully, but these errors were encountered: