-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: SqlImplementation.relationships_metadata()
#659
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmungall This draft is ready for your eyes.
You recently added ability to retrieve axiom annotations on relationships (#656), but there is a significant performance bottleneck. I noticed this during my mondo-ingest
synchronization pipeline work. This operation was taking 104 minutes to complete.
I've sped this up to 30 seconds to 10 minutes, depending on which implementation you prefer. I prefer the faster method.
I also thought of a possible implementation that uses sqlite and/or SqlAlchemy moreso than Python, but Python is easier for me.
I know that before you can merge this, you need at the very least a test case and codestyle adherence. I will meet with @matentzn and see if he wants me to complete this PR. If not, then this is here for you / your team as an example of what could / should be done.
FYI even though there's no test case, I was able to check that the code works as expected in an ad hoc way, by editing/running this code in the local version of oaklib
in the site-packages
of mondo-ingest
and using it in my synchronization PR.
@@ -1065,8 +1065,58 @@ def _rbox_relationships( | |||
def relationships_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommended implementation
n_relationships | seconds_taken |
---|---|
500 | 21 |
5000 | 23 |
37129 | 27 |
I wasn't sure if you'd like this approach because it is heavier on memory during the function's runtime, but it seems to work well. I added a simple method ._axiom_annotations_all()
which returns all annotations from the DB. Then the rest of the work is done in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if n_relationships > 1m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If n_relationships
> 1m, then this further increases memory burden and that could be too significant for some machines. The higher the n
is though, the better this implementation performs compared to the "alternative implementation", although there are other possible implementations, e.g. leaning more on sqlite as you say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in interest of keeping comments in threads / resolvable, replying to your summary comment here:
Your 2nd approach is on the right lines.
I thought you might feel that way. It's currently much less performant than this one, but you have an idea to speed that up:
I'd simplify; How about instead making a method like
_axiom_annotations_multi
that takes in a list of tuples, and let sqlite do the optimization?
I was thinking of that too, but it might take me significantly more time due to my lack of familiarity. I might not have that time, but I'll ask Nico; maybe I do. If I don't, I was wondering if you wouldn't mind finishing up what I've started?
I'm not sure we need two variations of ._axiom_annotations_multi()
. Why don't we increase its performance? Here's my ideas so far:
Details: ._axiom_annotations_multi() optimizations
These would be modifications to semsql
rather than oaklib
, unless we went with (b) and tore down those mutations at the end of the function.
a. Indexing
Are columns subject
, object
, and predicate
indexed? If not, I wonder if adding indexes might be all we need to make this performant.
b. (Temporary) lookup column
We can basically do the same thing as my first approach, but in SQL rather than Python. I am creating a lookup using RELATIONSHIP
s as dictionary keys. We might do the same in sqlite, creating either/both columns that concat sub-pred-obj
and sub-pred
:
sub-pred-obj | sub-obj | sub | pred | obj | MORE_COLS |
---|---|---|---|---|---|
M:123-is_a-M:456 | M:123-M:456 | M:123 | is_a | M:456 | ... |
M:123-is_a-M:456 | M:123-M:456 | M:123 | is_a | M:456 | ... |
Will have some duplicate rows due to multiple axioms per edge, but can still make for faster queries.
Alternatively, we could create a new table with only 1 row per edge, and collect all the axioms for that edge as a JSON serialization in a single column. That should be really fast.
for rel, anns in rel_anns.items(): | ||
yield rel, anns | ||
|
||
def relationships_metadata2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternate implementation
n_relationships | seconds_taken |
---|---|
500 | 2 |
5000 | 53 |
37129 | 516 |
This approach splits relationships
into groups based on predicates. Then it uses existing . _axiom_annotations_multi()
method to fetch all annotations for each group of relationships.
# Fetch all annotations | ||
anns: List[om.Annotation] = self._axiom_annotations_all() | ||
# Simplify objects | ||
anns: List[Dict[str, str]] = [{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor declutter refactor
This point needs the least of Chris' attention.
This function can be simplified if we move this conversion of "List[Dict[str, str]]
(annotation objects) --> Dict[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]
" from .relationships_metadata()
to the private axiom methods ._axiom_annotations_multi()
and my new ._axiom_annotations_all()
.
I did things this way because there is an inconsistency between ._axiom_annotations()
(returns List[om.Annotation]
) and ._axiom_annotations_multi()
(returns Iterator[OwlAxiomAnnotation]
), and so I wasn't sure which pattern I should adhere to.
b0c110f
to
bc3a543
Compare
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #659 +/- ##
==========================================
- Coverage 76.63% 76.56% -0.08%
==========================================
Files 251 251
Lines 29214 29250 +36
==========================================
+ Hits 22389 22395 +6
- Misses 6825 6855 +30
☔ View full report in Codecov by Sentry. |
) -> Iterator[Tuple[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]]: | ||
"""For given relationships, returns axiom annotation metadata""" | ||
# Split based on predicate | ||
preds: Set[PRED_CURIE] = {x[1] for x in relationships} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember the input is an Iterable, so if it's an iterator, not a collection, this will consume it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I added a line above adding rels
and using that variable instead for the rest of the function.
rels: List[RELATIONSHIP] = [x for x in relationships]
anns = [(ann.predicate, ann.object) for ann in self._axiom_annotations(*rel)] | ||
"""For given relationships, returns axiom annotation metadata""" | ||
# Fetch all annotations | ||
anns: List[om.Annotation] = self._axiom_annotations_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are making assumptions about the distributions of numbers of these. Consider a version of omop where every axiom has an annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By distribution, do you rather mean, "the potentially high number of annotations (returned)", as you query in this comment?:
what if n_relationships > 1m?
If that's what you mean, let's continue discussion in that thread and mark this one resolved. If you mean something else though, can you clarify?
rel_anns_all = {} | ||
for pred in preds: | ||
# Fetch annotations for given relationships | ||
subs, objs = [x[0] for x in relationships], [x[2] for x in relationships] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you want to filter this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! This is something I didn't need for my use case so I added it after and hadn't tested it.
I added:
# Filter rels
rels = [x for x in relationships if x[1] == pred]
And changed references in that for loop to use rels
instead of relationships
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the time taken here
Your 2nd approach is on the right lines.
I'd simplify
How about instead making a method like _axiom_annotations_multi
that takes in a list of tuples, and let sqlite do the optimization?
bc3a543
to
6b0be57
Compare
- Add: SqlImplementation._axiom_annotations_all() - Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n) - Update: Added docstrings for several related methods
6b0be57
to
1bfe896
Compare
Nice I didnt realise you were working on this! Cool! |
@matentzn we discussed it a little bit at one of our recent meetings. I thought you had seen the PR. But yeah basically just got a figure out from you. If you think it's worth my time to complete this. It will speed up any of our scripts that use this function by about an hour or two. I'm not actually sure now if we will even need this method for the synchronization pipeline now that I understand the work better. Chris has a suggested implementation but I think it involves using SQL alchemy a bit more. However, I think the bulk of the time on this will take a lot of back and forth with me and Chris on the implementation. I don't know if you will like the idea that I have in mind, which would be creating an on-the-fly column , basically an index. Alternative solution to that is to create additional indexes, but that would require an update to semantic SQL, not oak. And Chris hasn't gotten back to me yet on what he thinks about that. |
Updates
Optimization: SqlImplementation.relationships_metadata()
Addresses
SqlImplementation.relationships_metadata()
#660