Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline Django Field Tests #62

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

Conversation

NoiSek
Copy link
Collaborator

@NoiSek NoiSek commented May 5, 2023

This PR adds testing and confirms functionality for all of Django's ORM coverage and field types. mixer has been added as a dev dependency for test fuzzing.

I've opened this as a draft for discussion on a few points.

@NoiSek NoiSek requested a review from r4fek May 5, 2023 21:28
from django.db.models.sql import compiler

from datetime import timedelta
from cassandra.util import Duration


def unique_rowid():
# TODO: guarantee that this is globally unique
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused a failure in the test for SmallAutoField because the number generated was beyond the 32767 limit for short ints. Which leads me to:

I don't think supporting SmallAutoField is very important, but I do think that models should be able to generate unique identifiers consistently without risk of collision. I think that row IDs generated with this driver should use the standard UUID, since Scylla does not actually support the paradigm of "unique counters".

Realistically, the only reason you would specifically want to preserve the tradition of keeping row_ids as ints is because it gives you an idea of when an object was created chronologically and because it makes URL slugs cleaner, but if you are consciously choosing to use Scylla I think that's an acceptable tradeoff and you can model that requirement in better ways.

@r4fek are you opposed to the idea of making these UUID values by default?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. Let's use uuids by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update to this:

There are actually many major obstacles to using UUIDs as the default for AutoFields. Django assumes by default that all AutoFields are subclasses of IntegerField [1], and the workarounds for that require a lot of monkeypatching. The canonical approach is to have an AutoID field + a separate UUID field, but that sort of defeats the purpose since Scylla can't really do that anyway...

Transparently overriding AutoFields into UUIDs doesn't work so well because the generated UUID gets cast as an integer regardless internally due to field validation in many places (and UUIDs translate to 128 bit integers, much larger than even bigint's range).

I did try creating a UUIDAutoField that subclasses the existing AutoField so it could be used by setting DEFAULT_AUTO_FIELD to UUIDAutoField. With that I was able to override the existing validation methods and prevent it from being cast to an integer, but even in that situation it would lead to column conflict errors since internal models still expect a BigAutoField value-- meaning it would generate a UUID value and pass every validation check internally but fail once the actual query executed.

Django devs seem to be aware of this and working on it, so I figured at this point it is probably best to wait for full support with an official UUIDAutoField and double down on the original approach (all AutoFields as BigInts, generate suitably random number). I've reverted back to the original approach and added a few additional steps so that generated values are guaranteed to be more random.

[1] https://code.djangoproject.com/ticket/32577
[2] https://code.djangoproject.com/ticket/33450


# Ensure that the other side of this relationship has been created and is selectable
for library in libraries:
assert library.book_set is not None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the SmallAutoField test, this is currently the only failure.

Creation works as expected, but the ORM's internal JOIN behavior is hard to work around when these values are being selected afterward.

pid = getpid()
sys_random = SystemRandom().randint(0, 65535)

return timestamp + counter['value'] + pid + sys_random
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this approach means that given the same code running simultaneously: each worker will return a different pid, each thread of that worker will return a different counter, and each call will return a different sys_random value.

The counter in particular is worth noting because it uses a little known "pitfall" of Python, which is that default values specified as dicts are persistent between runs.

e.g.

>>> def foo(counter={'value':0}):
...   counter['value'] += 1
...   return counter['value']
...
>>> foo()
1
>>> foo()
2
>>> foo()
3
>>> foo()
4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants