Optimization: `SqlImplementation.relationships_metadata()` #659

joeflack4 · 2023-09-27T01:15:56Z

Updates

Optimization: SqlImplementation.relationships_metadata()

Add: SqlImplementation._axiom_annotations_all()
Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n)
Update: Added docstrings for several related methods

Addresses

Optimization: SqlImplementation.relationships_metadata() #660

joeflack4

@cmungall This draft is ready for your eyes.
You recently added ability to retrieve axiom annotations on relationships (#656), but there is a significant performance bottleneck. I noticed this during my mondo-ingest synchronization pipeline work. This operation was taking 104 minutes to complete.

I've sped this up to 30 seconds to 10 minutes, depending on which implementation you prefer. I prefer the faster method.

I also thought of a possible implementation that uses sqlite and/or SqlAlchemy moreso than Python, but Python is easier for me.

I know that before you can merge this, you need at the very least a test case and codestyle adherence. I will meet with @matentzn and see if he wants me to complete this PR. If not, then this is here for you / your team as an example of what could / should be done.

FYI even though there's no test case, I was able to check that the code works as expected in an ad hoc way, by editing/running this code in the local version of oaklib in the site-packages of mondo-ingest and using it in my synchronization PR.

joeflack4 · 2023-09-27T01:17:33Z

src/oaklib/implementations/sqldb/sql_implementation.py

@@ -1065,8 +1065,58 @@ def _rbox_relationships(
    def relationships_metadata(


Recommended implementation

n_relationships seconds_taken

500 21

5000 23

37129 27

I wasn't sure if you'd like this approach because it is heavier on memory during the function's runtime, but it seems to work well. I added a simple method ._axiom_annotations_all() which returns all annotations from the DB. Then the rest of the work is done in Python.

what if n_relationships > 1m?

If n_relationships > 1m, then this further increases memory burden and that could be too significant for some machines. The higher the n is though, the better this implementation performs compared to the "alternative implementation", although there are other possible implementations, e.g. leaning more on sqlite as you say.

Just in interest of keeping comments in threads / resolvable, replying to your summary comment here:

Your 2nd approach is on the right lines.

I thought you might feel that way. It's currently much less performant than this one, but you have an idea to speed that up:

I'd simplify; How about instead making a method like _axiom_annotations_multi that takes in a list of tuples, and let sqlite do the optimization?

I was thinking of that too, but it might take me significantly more time due to my lack of familiarity. I might not have that time, but I'll ask Nico; maybe I do. If I don't, I was wondering if you wouldn't mind finishing up what I've started?

I'm not sure we need two variations of ._axiom_annotations_multi(). Why don't we increase its performance? Here's my ideas so far:

Details: ._axiom_annotations_multi() optimizations

These would be modifications to semsql rather than oaklib, unless we went with (b) and tore down those mutations at the end of the function.

a. Indexing

Are columns subject, object, and predicate indexed? If not, I wonder if adding indexes might be all we need to make this performant.

b. (Temporary) lookup column

We can basically do the same thing as my first approach, but in SQL rather than Python. I am creating a lookup using RELATIONSHIPs as dictionary keys. We might do the same in sqlite, creating either/both columns that concat sub-pred-obj and sub-pred:

sub-pred-obj sub-obj sub pred obj MORE_COLS

M:123-is_a-M:456 M:123-M:456 M:123 is_a M:456 ...

M:123-is_a-M:456 M:123-M:456 M:123 is_a M:456 ...

Will have some duplicate rows due to multiple axioms per edge, but can still make for faster queries.

Alternatively, we could create a new table with only 1 row per edge, and collect all the axioms for that edge as a JSON serialization in a single column. That should be really fast.

joeflack4 · 2023-09-27T01:26:08Z

src/oaklib/implementations/sqldb/sql_implementation.py

+        for rel, anns in rel_anns.items():
+            yield rel, anns
+
+    def relationships_metadata2(


Alternate implementation

n_relationships seconds_taken

500 2

5000 53

37129 516

This approach splits relationships into groups based on predicates. Then it uses existing . _axiom_annotations_multi() method to fetch all annotations for each group of relationships.

joeflack4 · 2023-09-27T01:36:16Z

src/oaklib/implementations/sqldb/sql_implementation.py

+        # Fetch all annotations
+        anns: List[om.Annotation] = self._axiom_annotations_all()
+        # Simplify objects
+        anns: List[Dict[str, str]] = [{


Minor declutter refactor

This point needs the least of Chris' attention.

This function can be simplified if we move this conversion of "List[Dict[str, str]] (annotation objects) --> Dict[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]" from .relationships_metadata() to the private axiom methods ._axiom_annotations_multi() and my new ._axiom_annotations_all().

I did things this way because there is an inconsistency between ._axiom_annotations() (returns List[om.Annotation]) and ._axiom_annotations_multi() (returns Iterator[OwlAxiomAnnotation]), and so I wasn't sure which pattern I should adhere to.

codecov-commenter · 2023-09-27T02:14:31Z

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (aab87c5) 76.63% compared to head (1bfe896) 76.56%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
- Coverage   76.63%   76.56%   -0.08%     
==========================================
  Files         251      251              
  Lines       29214    29250      +36     
==========================================
+ Hits        22389    22395       +6     
- Misses       6825     6855      +30

Files	Coverage Δ
...oaklib/implementations/sqldb/sql_implementation.py	`79.77% <6.45%> (-1.70%)`	⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cmungall · 2023-09-27T03:17:25Z

src/oaklib/implementations/sqldb/sql_implementation.py

+    ) -> Iterator[Tuple[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]]:
+        """For given relationships, returns axiom annotation metadata"""
+        # Split based on predicate
+        preds: Set[PRED_CURIE] = {x[1] for x in relationships}


remember the input is an Iterable, so if it's an iterator, not a collection, this will consume it

Good catch! I added a line above adding rels and using that variable instead for the rest of the function.

rels: List[RELATIONSHIP] = [x for x in relationships]

cmungall · 2023-09-27T03:18:40Z

src/oaklib/implementations/sqldb/sql_implementation.py

-            anns = [(ann.predicate, ann.object) for ann in self._axiom_annotations(*rel)]
+        """For given relationships, returns axiom annotation metadata"""
+        # Fetch all annotations
+        anns: List[om.Annotation] = self._axiom_annotations_all()


I think you are making assumptions about the distributions of numbers of these. Consider a version of omop where every axiom has an annotation

By distribution, do you rather mean, "the potentially high number of annotations (returned)", as you query in this comment?:

what if n_relationships > 1m?

If that's what you mean, let's continue discussion in that thread and mark this one resolved. If you mean something else though, can you clarify?

cmungall · 2023-09-27T03:21:31Z

src/oaklib/implementations/sqldb/sql_implementation.py

+        rel_anns_all = {}
+        for pred in preds:
+            # Fetch annotations for given relationships
+            subs, objs = [x[0] for x in relationships], [x[2] for x in relationships]


don't you want to filter this?

Nice catch! This is something I didn't need for my use case so I added it after and hadn't tested it.

I added:

# Filter rels rels = [x for x in relationships if x[1] == pred]

And changed references in that for loop to use rels instead of relationships

cmungall

I appreciate the time taken here

Your 2nd approach is on the right lines.

I'd simplify

How about instead making a method like _axiom_annotations_multi that takes in a list of tuples, and let sqlite do the optimization?

- Add: SqlImplementation._axiom_annotations_all() - Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n) - Update: Added docstrings for several related methods

matentzn · 2023-10-19T19:44:42Z

Nice I didnt realise you were working on this! Cool!

joeflack4 · 2023-10-19T21:36:21Z

@matentzn we discussed it a little bit at one of our recent meetings. I thought you had seen the PR. But yeah basically just got a figure out from you. If you think it's worth my time to complete this. It will speed up any of our scripts that use this function by about an hour or two.

I'm not actually sure now if we will even need this method for the synchronization pipeline now that I understand the work better.

Chris has a suggested implementation but I think it involves using SQL alchemy a bit more. However, I think the bulk of the time on this will take a lot of back and forth with me and Chris on the implementation. I don't know if you will like the idea that I have in mind, which would be creating an on-the-fly column , basically an index.

Alternative solution to that is to create additional indexes, but that would require an update to semantic SQL, not oak. And Chris hasn't gotten back to me yet on what he thinks about that.

joeflack4 marked this pull request as draft September 27, 2023 01:16

joeflack4 added the enhancement New feature or request label Sep 27, 2023

joeflack4 commented Sep 27, 2023

View reviewed changes

joeflack4 mentioned this pull request Sep 27, 2023

Optimization: SqlImplementation.relationships_metadata() #660

Open

joeflack4 linked an issue Sep 27, 2023 that may be closed by this pull request

Optimization: SqlImplementation.relationships_metadata() #660

Open

joeflack4 force-pushed the sql-relationships-metadata-performance branch from b0c110f to bc3a543 Compare September 27, 2023 01:53

joeflack4 mentioned this pull request Sep 27, 2023

Synchronization: SubClassOf monarch-initiative/mondo-ingest#363

Merged

cmungall reviewed Sep 27, 2023

View reviewed changes

cmungall requested changes Sep 27, 2023

View reviewed changes

joeflack4 force-pushed the sql-relationships-metadata-performance branch from bc3a543 to 6b0be57 Compare September 27, 2023 22:48

Optimization: SqlImplementation.relationships_metadata()

1bfe896

- Add: SqlImplementation._axiom_annotations_all() - Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n) - Update: Added docstrings for several related methods

joeflack4 force-pushed the sql-relationships-metadata-performance branch from 6b0be57 to 1bfe896 Compare September 27, 2023 22:53

joeflack4 mentioned this pull request May 15, 2024

Performance: SqlImplementation.entity_metadata_map() #676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: `SqlImplementation.relationships_metadata()` #659

Optimization: `SqlImplementation.relationships_metadata()` #659

joeflack4 commented Sep 27, 2023 •

edited

Loading

joeflack4 left a comment •

edited

Loading

joeflack4 Sep 27, 2023

cmungall Sep 27, 2023

joeflack4 Sep 27, 2023 •

edited

Loading

joeflack4 Sep 27, 2023 •

edited

Loading

joeflack4 Sep 27, 2023

joeflack4 Sep 27, 2023

codecov-commenter commented Sep 27, 2023 •

edited

Loading

cmungall Sep 27, 2023

joeflack4 Sep 27, 2023

cmungall Sep 27, 2023

joeflack4 Sep 27, 2023

cmungall Sep 27, 2023

joeflack4 Sep 27, 2023

cmungall left a comment

matentzn commented Oct 19, 2023

joeflack4 commented Oct 19, 2023

		@@ -1065,8 +1065,58 @@ def _rbox_relationships(
		def relationships_metadata(

sub-pred-obj	sub-obj	sub	pred	obj	MORE_COLS
M:123-is_a-M:456	M:123-M:456	M:123	is_a	M:456	...
M:123-is_a-M:456	M:123-M:456	M:123	is_a	M:456	...

Optimization: SqlImplementation.relationships_metadata() #659

Are you sure you want to change the base?

Optimization: SqlImplementation.relationships_metadata() #659

Conversation

joeflack4 commented Sep 27, 2023 • edited Loading

Updates

Addresses

joeflack4 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Recommended implementation

Choose a reason for hiding this comment

joeflack4 Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

joeflack4 Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

a. Indexing

b. (Temporary) lookup column

Choose a reason for hiding this comment

Alternate implementation

Choose a reason for hiding this comment

Minor declutter refactor

codecov-commenter commented Sep 27, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmungall left a comment

Choose a reason for hiding this comment

matentzn commented Oct 19, 2023

joeflack4 commented Oct 19, 2023

Optimization: `SqlImplementation.relationships_metadata()` #659

Optimization: `SqlImplementation.relationships_metadata()` #659

joeflack4 commented Sep 27, 2023 •

edited

Loading

joeflack4 left a comment •

edited

Loading

joeflack4 Sep 27, 2023 •

edited

Loading

joeflack4 Sep 27, 2023 •

edited

Loading

codecov-commenter commented Sep 27, 2023 •

edited

Loading