-
Notifications
You must be signed in to change notification settings - Fork 76
Scalable and testable support for Django #235
Comments
some reply would be nice;) |
In skimming this, this is a huge project with a lot of bits and it seems like it rearchitects portions of ElasticUtils, too. I don't have time to work through this right now. I don't know offhand when I'll be able to. |
It does not break elasticutils in any way (unless we assume BatchIndexable would replace Indexable), it builds upon it and reuses most what has already been done in terms of django integration. Maybe some other collaborators would be willing to work it through? Delaying it would only cause a merge to be much harder once elasticutils and my code progress. |
Hi there, I've just noticed some asynchronity issues (came up in production) in the current version so you can refrain from reviewing it until I fix it. Best of luck |
An updated version (working in production, fully distributed, with heartbeat support for flushing document queues into elasticsearch) can be found here to define heartbeat tasks you have to:
@app.task(base=Task)
def index_users(lock_timeout, async=False):
index_documents_for_mapping_type(UserMapping, lock_timeout, async=async)
@app.task(base=Task)
def index_items(lock_timeout, async=False):
index_documents_for_mapping_type(ItemMapping, lock_timeout, async=async)
CELERYBEAT_SCHEDULE = {
'index_items': {
'task': 'recommendation.tasks.index_items',
'schedule': datetime.timedelta(seconds=2),
'args': (8,)
},
'index_users': {
'task': 'recommendation.tasks.index_users',
'schedule': datetime.timedelta(seconds=1),
'args': (4,)
},
} Hope you guys like it, in our tests its super fast:) |
Will any chance for some comments? This has been hanging here for quite some time. |
It has been sitting around for a while. I haven't had time to spend on this. Maybe in July, but I can't make any promises. Right now my priorities are to get 0.10 out asap because that's totally fux0ring everyone. I have no idea whether the issues you're fixing here affect other people or not. It'd be nice to find out if anyone else is affected and if so, by what specific aspects. Knowing that might adjust my priorities. |
Ok, thanks for an honest reply. We have already advanced it to allow In terms of interest I have a consulting enterprise project on the side 2014-06-23 23:06 GMT+02:00 Will Kahn-Greene [email protected]:
|
Hi there,
Our project is quite a strech for elasticutlis since documents are very frequently updated and inserted and everything has to be testable, including things done via elasticutils in Elasticsearch. That's why took some time to extend elasticutils with a proper infrastructure that meets these requirements, that we would like to give back to the community.
The purpose of this issue is to discuss our approach and suggest areas that have to be improved before we do a legitimate pull request with this (major) chunk of functionality.
Here you can find all our code with quite thorough coverage.
Once we reach an agreement I'll remove all dependencies (mainly utils, and our fork of hot_redis) and do a proper PR.
Below you will find a quick overview of our approach and references that explain some architectural decisions
BatchIndexable
This class is intended to be a replacement of using
MappingType
andIndexable
when integrating with Django. It:BatchIndexable
, which is inspired bydjango.models.Model
DjangoElasticsearchTestCase
is aware of all indices and can prefix them for tests so tests can be run on the same es instance as production or development and won't break anythingBatchIndexable
can have custom index settingsDjangoElasticsearchCase
This is a very convenient
TransactionCase
, that prefixes all indexes defined byBatchIndexable
s, makes sure they exist during tests and cleans up after itself.When you need to make sure all documents have been indexed before moving forward in you test it provides
refesh_all()
method that will make sure everything you did withBatchIndexable
s will be reflected in elasticsearch.es_setup
Is a collection of helper methods, that are used by DjangoElasticsearchCase and a command that can be used when deploying your application to production:
indexing_queues
Consist of a regular queue used for handling inserts and an id set used for handling frequent updates. They are based on hot_redis datatypes but can be easily abstracted to use any persistence/cache layer as long as it supports queue and set datatypes.
The architecure is base on this post.
In short:
Inserting is handled by queues that keep pickled documents (the return of
BatchIndexable.extract_document(obj_id, obj=None)
) since the full model usually would be available in thepost_save
signal and by extracting it in the receiver we save an extra db request.Updating is handled by sets that keep ids of models that have been updated. This is a robust way of handling denormalized data (this is what you usually end up with in elasticsearch) like counters, arrays, or nested documents.
This is best explained with use of an example. Lets say we have a
post
mapping type which has acomments
counter (used for boosting) and a lot of text. Normally you would receive apost_save
signal when a comment is created and call extract_document() on the post which would mean analyzing all the text just to increment the comments counter by one. Now imagine each post gets hundreds of comments per second. With our approach theid that is alleady available in the
post_save
of the comment is added to a set (marked as dirty). The set is processed periodically (for example each 30 seconds) and all dirty documents are extracted and indexed using the batch api. We get from possibly hundreds of thousands of indexing requests to few elasticsearch batch requests and from hundreds of thousands additional db requests to just a few.The text was updated successfully, but these errors were encountered: