Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable advisory unique_content_id #1583

Open
keshav-space opened this issue Sep 4, 2024 · 0 comments
Open

Unstable advisory unique_content_id #1583

keshav-space opened this issue Sep 4, 2024 · 0 comments

Comments

@keshav-space
Copy link
Member

keshav-space commented Sep 4, 2024

Current implementation of unique_content_id is unstable because the order of keys is not preserved in JsonField.
On PostgreSQL by default JsonField uses jsonb, which does not preserve the order or whitespace.
For more information, see https://docs.djangoproject.com/en/5.1/ref/models/fields/#django.db.models.JSONField and https://www.postgresql.org/docs/16/datatype-json.html.

This can lead to widespread duplication of advisory data, resulting in increased storage usage.

Below is a snippet to reproduce the bug, where the same advisory data leads to different unique_content_id values:

In [1]: from vulnerabilities import importer

In [2]: from packageurl import PackageURL

In [3]: from univers.version_range import VersionRange

In [4]: from django.utils import timezone

In [5]: from vulnerabilities.pipes.advisory import insert_advisory

In [6]: from vulnerabilities.importer import AdvisoryData

In [7]: advisory_data = importer.AdvisoryData(
   ...:     aliases=["CVE-2020-13371337"],
   ...:     summary="vulnerability description here",
   ...:     affected_packages=[
   ...:         importer.AffectedPackage(
   ...:             package=PackageURL(type="pypi", name="dummy"),
   ...:             affected_version_range=VersionRange.from_string("vers:pypi/>=1.0.0|<=2.0.0"),
   ...:         )
   ...:     ],
   ...:     references=[importer.Reference(url="https://example.com/with/more/info/CVE-2020-13371337")],
   ...:     date_published=timezone.now(),
   ...:     url="https://test.com",
   ...: )

In [8]: r = insert_advisory(advisory_data, "test")

In [9]: r.unique_content_id
Out[9]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [10]: k = Advisory.objects.get(unique_content_id=r.unique_content_id)

In [11]: k.unique_content_id
Out[11]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [12]: k.date_imported = None # Change any field not used for computing content id

In [12]: k.save()

In [13]: k.unique_content_id
Out[13]: 'bf83e58fc8f7eb54d04a59c27f0680f8'

In [14]: assert k.unique_content_id == r.unique_content_id
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[14], line 1
----> 1 assert k.unique_content_id == r.unique_content_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants