-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define three levels of support #388
Comments
@MarcoGorelli the Ibis issue noted as a blocker has been closed since January, is there still any blocker? also, is the Narwhals API mimicking the |
it's closed in the sense that Ibis forbids that operation - so, I'd say yes 😄
both, we have both I think you can go very easily to PyArrow for Ibis's table though, right? We're ramping up PyArrow.table support, so it should be possible for anyone passing in an Ibis table to go straight to PyArrow.table, and then we can support the remaining operations |
this is the case where you want to sort a single column, while maintaining the order of the other columns? Ibis/the relational model don't have an ordered index right, so you'd have to model this by splitting into two tables and joining back (i.e. pretty sure mimicking the behavior is possible, people have built the pandas API on Ibis and presumably support this) wrote an example over on the original issue are there any other blockers? going through Arrow tables for larger datasets is going to be slow |
thanks for the example! In general it's not clear to me how to make Ibis fully compatible - but, I'm working on make Narwhals more pluggable, so then the structure should be:
and |
'metadata' level if coming along in #517 If we're adding levels, I think a "sql frontend" level could also be in scope, which Ibis and DuckDb could fit into. This is a reversal of my original stance, and I need to acknowledge this, so sorry for having originally only wanted to stick to full-api-support libraries |
I'm still confused w.r.t. Ibis, are there any APIs in the subset of the Polars APIs ( |
For example:
In Polars, that's also unsupported for pl.LazyFrame, but pl.DataFrame allows it. And For a concrete use case, check how scikit-lego is using Narwhals, especially Is it possible to rewrite that using Ibis? |
what do you mean with this example?
you achieve this with the Ibis API
I'm not sure what you mean by this or why exactly it would matter, concepts and behaviors will differ across dataframe libraries right. my understanding is the Narwhals API is to abstract these differences with a subset of the Polars API I'm fairly confident you could write that |
In the linked code, there's something like this: date_max = X_index_df["__date__"].max()
# [...]
if current_date + self.train_duration + time_shift + self.gap_duration > date_max:
break If I try evaluating this kind of operation with an ibis table, it raises: In [28]: type(tbl)
Out[28]: ibis.expr.types.relations.Table
In [29]: if 3 > tbl['b'].max():
...: print('here')
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[29], line 1
----> 1 if 3 > tbl['b'].max():
2 print('here')
File ~/scratch/.venv/lib/python3.11/site-packages/ibis/expr/types/core.py:170, in Expr.__bool__(self)
169 def __bool__(self) -> bool:
--> 170 raise ValueError("The truth value of an Ibis expression is not defined")
ValueError: The truth value of an Ibis expression is not defined
Thanks - has this been done? If so, then we can (trivially) reuse |
as I mentioned, the concepts and behaviors will differ -- with Ibis, you'd achieve that like: [ins] In [1]: import ibis
[ins] In [2]: t = ibis.examples.penguins.fetch()
[nav] In [3]: if t["bill_length_mm"].mean().to_pyarrow().as_py() > 0:
...: print("hello")
hello though there are a few other options to achieve the same behavior. this is explicit (go through PyArrow to a Python object) and we could put some syntactic sugar around this if that's something people want, but it seems sufficient for abstracting behind the Narwhals API
it's tightly coupled to a single backend and has a ton of performance overhead. like Polars, Ibis meaningfully differs from the pandas API and of course does not maintain an ordered index. you can still achieve the same behavior as an ordered index and mimic the pandas API (I'm just using this to explain why I'm fairly confident you can implement the breadth of the Narwhals API using the Ibis API) |
sure but then you're automatically triggering computation for the user? If someone does
and these comparisons are each automatically triggering Side note: automatic implicit computation was one of the major disagreements I had with some of your colleagues during the dataframe consortium project, which spurred me to take several ideas they'd rejected (expressions, separating lazy vs eager apis, |
I'm back to confused -- the origin of this issue is:
and a claim that you can't implement the full Narhwals API with Ibis. there have been 2 examples I've seen, and in both cases you could implement the API with Ibis
it depends? I'm a little confused about the use case or how you could achieve this without computation, i.e. how would you know the mean of a column without computing it? for performance reasons, you wouldn't write those as Python if/else statements using Ibis (even though you can), you'd rather use
ok. I have never been involved in the dataframe consortium project, my point is that the examples given for the justification of putting Ibis in a different class of support are all implementable with the Ibis API. I don't think I care about automatic implicit computation personally and I'm just commenting as an Ibis contributor |
I wouldn't, which is why in that class from scikit-lego they use If your table represents a DAG with some very expensive operations, then repeatedly triggering them implicitly risks getting really expensive. Whereas if you're working with an eager dataframe, then there's no risk of repeatedly triggering an expensive
In both cases we would end up doing something expensive under the hood (join in the first case, and collection + conversion to pyarrow in the second) I don't really want such performance footguns in Narwhals itself - having said that, it is possible to extend Narwhals by implementing
Right, that's exactly what I feared 😄 |
I think more assumptions are being made about performance -- I'm not sure why you'd repeatedly trigger a DAG with some very expensive operations. there are numerous ways to cache computations or mimic eager execution fully if you really want to with the Ibis API I'm not sure what performance you're optimizing for, but my understanding is out-of-core larger-than-RAM datasets will necessitate something like how Ibis with the DuckDB backend works, and I'm pretty sure you'd get performance on-par or better for smaller workloads too anyway sorry for hijacking this issue, you can obviously do whatever you want. my overall point is I'm reasonably confident the Narwhawls API is fully implementable in a performant manner with Ibis. I do not have the time to work on this but am a little concerned about the repeated assumptions being made; feel free to reach out within the Ibis community if you or anyone are working on this in the future and isn't sure how to accomplish it with Ibis. it can probably done, and it can probably be done performantly |
Thanks for your inputs, really appreciate it 🙏
Sure, then please allow me to rephrase: I'm not saying it can't be done, I'm saying - I can't do it. For now, I'll aim for partial API support - that, at least, I think I have the skills to manage. If things get more traction, we can be more ambitious |
I don't think this is the correct takeaway from my earlier statement :) the fully breadth of the pandas APIs, including all of the inherently inefficient operations that the Polars and Ibis APIs omit, is a lot different than the subset of the Polars API Narwhals implements |
It's the "automatically figure out when to trigger compute / cache in a Python API" that seems intractably complex to me, not the pandas-specific parts like row labels |
why would you need to do this? I'm still not understanding. in eager mode, you'll always immediately execute, in lazy mode you'll only do it on explicit API calls, right? it would be a great to have a high-level overview of concerns or a concrete example of something in the Narwhals API you couldn't implement performantly with the Ibis API |
Sure, please take a look at the Perhaps the easiest thing would be if you put together a POC of how you'd wrap Ibis in the Polars API such that it can be used in that class |
that is code using the Narwhals API? are there any specific Narwhals operations you're concerned can't be implemented? it looks like they could all be implemented to me. if you can implement the full Narwhals API on Ibis (the subset of the Polars API), won't scikit-lego's code just work? |
There's no operation which, in isolation, I'm concerned about But taken together, the fact that the solution you proposed requires converting to pyarrow means that, if I want to avoid repeatedly converting the same dataframe to pyarrow, you're going to have to do some caching And automatically figuring out when to do caching on behalf of the user strikes me as intractably complex - or at least, something that's beyond my skill-level. But, if anyone could show at least a POC of how it can be done, I'd gladly incorporate it into Narwhals |
I'm not convinced this is something you need to do with how the Narwhals API works, but it's probable I'm missing something -- as I understand it, you're opting into eager or lazy evaluation of expressions. you can easily simulate eager behavior with Ibis if you want, no need for complicating caching mechanics, though of course you're limiting to result sets that fit in-memory (as with any other eager API) my overarching concern here is misrepresentative statements (or takeaways from my statements) about Ibis predicated on misunderstandings: we've gone from you can't do this with the Ibis API -> it's not performant with the Ibis API -> it's too complicated to manage, with little effort put into the investigations of each claim. if performance footguns are a key concern for Narwhals, I think some effort in understanding options with Ibis and the tradeoffs should be considered before implementation as a third tier backend and claiming that's how it had to be. anyway sorry for taking so much time, again you're free to do whatever you want (and to be clear I'm just one person, my views are not necessarily representative of opinions of any open source projects I work on or companies I'm employed by :) ) -- if anyone in the Narwhals community is interested in implementing the full API for an Ibis backend that's performant, I'm sure the Ibis team is happy to help with any questions! |
right, that's enough update: Narwhals 1.1.0 adds "interchange"-level support. Which includes Ibis 🤝 🤗 Increased levels of support may come in future releases |
Alright, now that we've all slept on this, let's see if we can move forwards amicably 🤗 What rubbed me the wrong way was that I'd put a fair bit of effort into adding "interchange"-level support, which includes Ibis. Even if this isn't full-API-level support, it's enough to help fix a bug which Ibis users are currently facing when using Altair: vega/altair#3473 Narwhals can already make things better for Ibis users, even if we don't achieve full API level support I should clarify my concerns around performance. My objective with Narwhals is to make (close-to-)zero-cost-abstraction around libraries. If I was to implement That's why I suggested to start with partial API support and to consider going further later if things get more traction - that, at least, I'm reasonably confident I can implement as a zero-cost-abstraction. To then be told that, no, that's not good enough, I'm misrepresenting things, and if I don't think full API support is realistic, it's because I've put little effort in to understand Ibis - that comes across as unfriendly. Doubly-so after having put in the effort to fix a bug for Ibis users (though, admittedly, you might not have been aware of this last part) If anyone from Ibis wants to claim that it's possible to implement the full Narwhals on Ibis, as a zero-cost abstraction, then that's fine - but I'd like to invite you to provide the implementation, or at least a POC. You said you "might be missing something", and indeed, I can assure you that if you try to complete what I suggested, you'll realise what I've removed the Ibis comparison on the README, as I understand you didn't appreciate it - I'm sorry for having posted it. Conversely, I'd also like to ask that folks from Ibis stop asking me "why don't you just use Ibis" I hope we'll be on good terms going forwards 🤝 One more addendum: if anyone wants to talk anything over, my calendly is here |
it's clearly not worth continuing this discussion but as far as I'm concerned things didn't get too heated and we're on perfectly professional terms, happy to collaborate on whatever going forward. again, this is your project and you can do whatever you want. but please don't ascribe the actions/intent of others as my own as you've done for a second time here -- my intent on this issue was to convince you or whoever that you can use Ibis as a performant backend for the Narwhals API. I have tried my best but you clearly disagree, which is fine (and to your point the proof would be an implementation). however from my perspective, I have been met with shifting goalposts, false assumptions about my intent or what I think, mischaracterization of my statements, and now an ask to specifically tell people I might know to not ask you a question you don't like answering FWIW thanks for fixing a bug Ibis users run into, I fix bugs for Polars users too and would for Narwhals if anything came up that I could handle (though, admittedly, you might not have been aware and I'm not sure why this is something to point out in the context of this discussion) |
Currently all the backends we support implement the full Narwhals API, which includes expressions
But if someone just needs to inspect schemas and some other simple things, it should be possible to do so in a much simpler way
So, we could have
twothree levels of support:Metadatainterchange: other libraries which wouldn't necessarily be compatible with all of Narwhal's API. Ibis for example could fit in here (see here for a blocker to full API support - there's a few others too). Any library that implements the dataframe interchange protocol can probably fit in here - but, by being a self-standing library, we can move much quicker than the interchange protocol and work around its limitationsThe objective of "Metadata" would be to be something like an extended and more user-friendly DataFrame Interchange Protocol. Any library which already supports the interchange protocol would be in-scope
The text was updated successfully, but these errors were encountered: