Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of official async ES client into DSL library #921

Closed
vaidsu opened this issue Jun 20, 2018 · 19 comments
Closed

Integration of official async ES client into DSL library #921

vaidsu opened this issue Jun 20, 2018 · 19 comments

Comments

@vaidsu
Copy link

vaidsu commented Jun 20, 2018

Hi team,

Great effort on contributing and maintaining a great layer on top of ES access.

While there are some discussions around the async integration into the official DSL library like #704 and #556 I did not find any conclusion on that. I did my research and also found there isn't anything existing. While there are some prototypes and suggestions like here I did not see an implementation.

I was presuming, it could be due to two reasons: Either its really not there and no body has raised a PR yet, (or) this could be a trivial implementation so most users took care by overriding the parent classes.

In any case, I needed the integration of the official async python library into DSL. I was experimenting many ways, and something simple yet required to duplicate was to create async_xyz classes on top of the low level APIs that were accessing the es client functions and eventually not awaiting the coroutines from async ES client.

That said, here is a very basic working/and tested implementation in my fork master...vaidsu:master

Please consider this as a basic prototype implementation and I just started writing tests. I am not sure, if this is good to go, or not. Then if yes, need to understand how far I need to write tests, I am just planning to make all the sync tests runnable on async calls.

Please let me know. If you like the thought, but have suggestions, also feel free to share -- I am open to contribute back.

Thanks

@vaidsu
Copy link
Author

vaidsu commented Jun 21, 2018

Team, do you have any suggestions please?

@vaidsu
Copy link
Author

vaidsu commented Jun 25, 2018

Any help is appreciated

@jaddison
Copy link

I'm not sure that your changes would be compatible with Python2.

I think a good example to look at is the Python ZMQ bindings; they are structured where all asyncio-related components exist in, and are only importable from, a separate sub-module called zmq.asyncio. I believe this is what allows the overall package to remain Python2-compatible.

See PyZMQ asyncio docs and PyZMQ asyncio code; basically any non-asyncio component has an asyncio-specific counterpart in the zmq.asyncio module.

So instead of from zmq import Context (non asyncio) you do from zmq.asyncio import Context.

I'm not sure that PyZMQ approach would work here, but I think it's worth investigating it to achieve asyncio support in elasticsearch-dsl-py.

@vaidsu
Copy link
Author

vaidsu commented Jun 28, 2018

@jaddison Thanks for your response. Let me check on that and get back. 👍

@vaidsu
Copy link
Author

vaidsu commented Jun 28, 2018

@jaddison Thanks for the approach, contemplating on how I arrived at this solution, I did not take Py2 into account. I was just thinking between sync and async users and make it compatible. In fact I was trying my best to make the signature exactly the same, but async def, syntax would not allow that. Anyway, packaging it better is a better though, let me vet it further. Thanks for checking this back.

@jaddison
Copy link

@vaidsu - note that I'm not a representative from Elastic, I'm just a user like yourself. So please don't take my comments as the way to get this support merged in. 🙂

I would like to see asyncio support in the library too, but it's up to the Elastic team to give royal assent or not.

@vaidsu
Copy link
Author

vaidsu commented Jul 2, 2018

@jaddison still I have at least one person voting an interest 😄 I will try my best to see how we can fit into the import issues.

@honzakral
Copy link
Contributor

I think that the best way to do this would be to actually have a module as part of https://github.com/elastic/elasticsearch-py-async (something like elasticsearch_async.dsl) that would provide a wrapper/subclass of the Search and Document classes.

If there are any hooks or changes we need to make in this repo for it to work more cleanly we totally can, but having the code live in the async library will take care of all the compatibility/import issues. Alternatively it can be a completely separate package for the same reason.

@vaidsu
Copy link
Author

vaidsu commented Jul 6, 2018

@honzakral Thanks for your input,

Alternatively it can be a completely separate package for the same reason.

The changes are pertaining to only few places, Search -> AsyncSearch, FacetedSearch -> AsyncFacetedSearch, Document -> AsyncDocument.

  1. Basically wherever there are low level calls, we have to make them awaitable or may be even schedule into the loop. In this case, are you suggesting, it may be nice to have elasticsearch-dsl-async-py? Worry is, not sure how far it can be kept in sync with the core fixes in this library.

  2. Alternatively, thinking about intelligent importing like @jaddison mentioned, will it make sense to hold within the same library, but import them without breaking the py2 to 3 users?

  3. Or, may be can we add this as a plugin? More like how pytest, allows? So, the async users might install both, dsl-py and dsl-async-py.

Thanks for your time, please let me know.

@honzakral
Copy link
Contributor

Basically wherever there are low level calls, we have to make them awaitable or may be even schedule into the loop.....

I would be very open to changing the implementation of elasticsearch-dsl to allow for easier plug-in points. For example isolating all low level calls to separate methods etc.

Alternatively, thinking about intelligent importing like @jaddison mentioned....

The problem with that approach is that it wouldn't allow the code to use syntax and features that is only available in 3.x, thus greatly limiting what could be used and done. A separate library or a module in elasticsearch-py-async is a much better option.

Or, may be can we add this as a plugin?

I see no reason why this would have to be a plugin - it should be perfectly possible for this module to live outside of the library codebase and extend and wrap the functionality. As you mentioned before it is only several API calls (.execute(), .count() on Search, save() and init() on Document etc) that need to be wrapped, rest can remain as is.

Overall I see there is demand for this functionality so I propose a following scenario:

Let us start with a set of simple helper methods as a proof-of-concept - simple funtctions like async_execute(search) that will, using elasticsearch-py-async execute the search and asynchronously return proper Response object. That should be straightforwards and give us an idea of what is needed and how we can restructure the code in Search to make it even simpler.

Once we have that code we can have another round of discussion about where should the code live - on its own, in elasticsearch-dsl, or in elasticsearch-async-py.

Unfortunately I have neither bandwidth nor expertise to work on this; I am, however, available for any questions or feedback if someone chooses to take this on and remain open to include the resulting code in any of the official libraries if that is what would work best.

Does that make sense?

@honzakral
Copy link
Contributor

Also /cc @fxdgear since he is currently maintaining elasticsearch-py-async.

@vaidsu
Copy link
Author

vaidsu commented Jul 22, 2018

@honzakral I can work on that, since I am currently using that aspect of the code integration from my fork at work. And the core heart of our systems are purely async, so we cannot afford sync whatsoever.

That said, really appreciate taking your time and provide valuable comments to move forward.

Taking your example,

You want to start with async_execute which will be an additional function on top of execute?

Current execute is here and the async_execute is here

What I have done is., while the changes are very minimal, I abstracted the async methods into its own class called AsyncSearch instead of Search. Just wanted to clarify are you saying you want to localize the async_execute within the Search class already?

OR

For example isolating all low level calls to separate methods etc.

Or should we provide full control to users to take care of save/delete low level APIs on their own?

Sure, I think this is doable, only concern is when it comes to async Vs. sync due to its syntax differences, even having an abstract interface may be difficult.

Former option seems feasible, how do you want me to proceed to give a try.

[EDIT]

We may still need to balance out the Python3 and 2 differences in localizing in the same place.

Thanks 👍

@braunsonm
Copy link
Contributor

Is there anything that I can do to help make this a reality? Glad to see your progress on this already @vaidsu !

@vaidsu
Copy link
Author

vaidsu commented Aug 13, 2018

@Chaosca whatever code I have posted in the fork is what I am using internally in the internal infrastructure. Due to immediate stake holder MVP demos I did not complete the proper separation of concerns in the code. I might close in the weeks to come.

Do you want this feature urgently? Or do you have any ideas to quickly upgrade my fork?

@braunsonm
Copy link
Contributor

I'm not sure what is all needed to make it complete. However it's for sure something that I would like to use on a project that I'm working on.

Does your changes also use the elasticsearch-async-py wrapper for things that the DSL can't do?

For example I use the DSL create_connection to have a regular elasticsearch-py object so i can run the update method with a inline script. It would be nice if that also used the async version.

@vaidsu
Copy link
Author

vaidsu commented Aug 14, 2018

@Chaosca sorry for delay. The idea is to use the same library, but when it comes to async you use the async APIs with async/await.

# SYNC
# HelloWorld is the DocType 
hw = HelloWorld(
        hello='foo',
        timestamp=time.time())
    hw.save()
    hw.update(hello="anotherworld", timestamp=int(time.time() * 1000))

And when it comes to async, the same is reused (by using using=<new async es wrapper obj>

Picking a snippet from the internal documentation template I wrote with the fork.

# ASYNC
async def async_fn(future):
    print('Executing faceted search')
    # Now lets try a faceted search
    for i in ['some', 'cool']:
        hw = HelloWorld(hello=i, timestamp=int(time.time() * 1000))
        await hw.async_save(using=es, refresh=True)
    await hw.async_refresh(using=es)
    adv = await AdvancedSearch().execute()
    for hello, count, selected in adv.facets.hello:
        print('Got {0} for {1} counts'.format(hello, count))
    future.set_result('type3')

Two main differences to note:

  1. using the new object
  2. save Vs. async_save different dialect due to async API.

I was given nice options from folks to modify the current state of implementation to slightly different version to attempt to merge. I was also given an option to use as a separate library, but that is difficult considering that most audience come here.

Here is the current state. I am sad to not being able to spend much immediately to close this right away, but at least I have the code running internally.

Am I clear? Please let me know.

@braunsonm
Copy link
Contributor

braunsonm commented Aug 14, 2018

Thanks for the info @vaidsu

Is there a way to access the async es wrapper after initializing the DSL with create_connection? If so that would be useful as well for using the regular elasticsearch py syntax as well.

from elasticsearch_dsl.connections import connections

es = connections.create_connection(["hosts"])  # Way to create the connection with the async wrapper?
hw = HelloWorld(
        hello='foo',
        timestamp=time.time())
await hw.async_save()
await hw.async_update(hello="anotherworld", timestamp=int(time.time() * 1000))

# Async update?
await es.update(body={
    "script": {
        ...
    }
}, index='myindex', doc_type='hello_world', id=hw.meta.id)

Since there isn't a way to run update scripts in the DSL right now there are some points where the underlying API has to be used.

Seems like your implementation idea is well thought out though and I hope it can be merged in the near future :)

@timothycrosley
Copy link

timothycrosley commented Sep 3, 2019

Now that Python2 is imminently reaching EOL, would a Python3 solution be accepted? If not, what would need to be changed from @vaidsu's solution to make it acceptable?

Thanks!

~Timothy

@sethmlarson
Copy link
Contributor

Closing this as a duplicate of #1355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants