Packages #478

ctron · 2024-06-28T15:29:08Z

ctron
Jun 28, 2024
Maintainer

I am sorry for re-iterating over this. But maybe we can find a better way to name and do things. So I'll try to take a step back, maybe it needs a bit more than just renaming. Maybe not.

Assuming the intention is to ingest all kinds of stuff, and then collect/aggregate that data into a model that grows, the more we ingest.

Right now, we have the following tables:

Qualified Package -> Versioned Package -> Package

However, these tables only store fragments of PURLs. In a way, that we can easily reference stuff inside the
database. Better names would IMO be:

Qualified PURL -> Versioned PURL -> Base PURL

This set of information grows with each SBOM we ingest, because we extract PURLs from SBOMs. We also extract PURLs from other sources. But let's ignore that for now.

A simplified view on the SBOMs looks like this:

SBOM:
  sbom: Uuid

SBOM Package:
  sbom: Uuid
  node: Uuid
  name: String
  version: Option<String>

SBOM -[0..*]-> SBOM Packages -[0..*]-> Qualfied PURL
                             -[0..*]-> CPE

So SBOMs contain packages (SBOM packages) and may (or may not) declare an alternative name for their packages.

We can browse through SBOM packages, get them by ID. Get relationships between them. All without ever touching PURLs.

In some cases (for RH data for most cases), these have PURLs attached. Which we can use to reference with other
documents that use PURLs.

Going through the conversations again, I think we might actually miss another "package".

SBOM -[0..*]-> SBOM Packages -[1..1]-> THIS_PACKAGE -[0..*]-> Qualfied PURL
                                                    -[0..*]-> CPE

THIS_PACKAGE:
  purls: Vec<>,
  cpes: Vec<>,
  …

THIS_PACKAGE is independent of an SBOM. And collects (grows) with each package that gets ingested into the system.

The question to me is: how do we identify this package?

By name (from the SBOM) won't work. By hash of an artifact? Ingesting a new SBOM package, how would we now to
which THIS_PACKAGE it would need to contribute its information to?

And if we move references (like purls and CPEs) from the SBOM package to THIS_PACKAGE, how would know what the
SBOM contributed? How the SBOM named that package (aside from the SBOM package name)?

On the other side, do we really need to store THIS_PACKAGE in the database? All the information is there via the SBOM packages anyway. Using SBOM packages also doesn't cause the issue of not knowing where it came from or finding an identifier that would be required to aggregate information.

Maybe THIS_PACKAGE is just a virtual construct, returned by some APIs, based on the PURL tables and the SBOM packages?

We could call that "package". Maybe there's a better name for that too?

bobmcwhirter · 2024-06-28T17:22:50Z

bobmcwhirter
Jun 28, 2024
Maintainer

Thanks for this.

First, I'm unclear on what THIS_PACKAGE ... is? So any further comments are based on a giant black-hole of my understanding.

It sounds like SBOMs aggregate/reference a multitude of $things, which may be addressed by various names.

Perhaps if we take @JimFuller-RedHat's thought of calling those $things Components, we can avoid other thing-or-name-of-thing issues.

Given that a pURL is a "package URL", it feels like "package" may indeed be "things you can assign a pURL to".

Likewise, CPEs are names of $things, but tend to name "products" for want of a better word.

Like pURLs, CPEs can be roughly ambiguous, in that they are based around pattern-matching. While every product identifiable by a CPE should have a canonical CPE, an arbitrary CPE (with possible wildcards) may not canonically point to a single product.

So, thinking, in human prose (not DB DDL)...

An SBOM references 0-or-more components
A component may be a reference to a package or a product (or a third thing we've not yet discovered).

That being said, I think you're 100% right in that our current package-related tables are indeed pURL-centric.

I also agree that those tables could/should be better named, such as base_purl versioned_purl and qualified_purl.

Likewise, we have a cpe table that may or may not be in play at the moment (ignorance on my part).

Jumping up from human prose to APIs, connecting through the DB DDL...

A human wants to find packages and products, and just so happens to need to use a pURL or CPE to communicate their desires.

If we stick the the idea that a "package" is "anything addressible using a pURL", then /api/v1/package/... still continues to make sense to me. Sometimes we want to speak about log4j. Sometimes we want to speak about [email protected]. But a human is still talking about "the package generally known as Apache Log4j".

Likewise, a human may want to understand things about a product (RHEL, RHEL8.2, RHEL8.2 on Sparc), and may end up using a CPE to do so. Ergo, still /api/v1/product/... endpoints.

So if we separate the human-facing prose-centric desires from the DB DDL implementation details, I think my proposal is...

SBOM "packages" should be DDLd as "components" name-wise
Our "package" tables which are pURL-centric should be renamed to be pURL-centric.
Our "package" API makes sense, where human-provided keys are pURL-centric.
Our "product" API makes sense, where human-provided keys are CPE-centric.
Both of the above APIs could/should have our UUID-based escape hatch for simpler URLs and determinism.

Another analogy would be the use-case of "I'd like to call Jim on the telephone". You certainly dial his phone number (the key used to indicate your desires), but you're not talking to the phone number. You're talking to Jim on the other end of the connection.

"I liked to call Jim" -> "I'd like information about log4j, version 1.2.3"
<dials +1-404-xxx-xxxx> -> /api/v1/package/{purlish}
Chats with Jim -> learns about log4j version 1.2.3

0 replies

bobmcwhirter · 2024-06-28T17:25:18Z

bobmcwhirter
Jun 28, 2024
Maintainer

wrt THIS_PACKAGE...

Perhaps this is ... another set of tables?

Just like a product can be associated with a cpe, maybe we need another package table that can reference to a qualified_purl table?

Or possible 1+ pURLs/CPEs, depending?

Keep the distinction between $things and $one_of_possibly_many_names_for_a_thing.

0 replies

JimFuller-RedHat · 2024-06-28T18:30:33Z

JimFuller-RedHat
Jun 28, 2024
Collaborator

A few (parachiol and maybe obvious) random comments of a DBA nature ... please dont let any of these comments put us off what we have right now which I think is right and good ... offering more as inspiration:

In the beginning, the Oracle said there shall be a logical model and a conceptual model ... the conceptual model is for humans (or other machines) , the logical model is for system machine (and for the humans who manage this machine) ... we might consider the REST API the conceptual model ... naming things in the logical model is for developer team consumption... naming things in the conceptual model is for consumer consumption... a lot of times developers try to make logical model 1:1 with conceptual model (as well as the interchange format) which has productivity gains in the short term, then the passage of time reveals its hard to keep data (and nomenclature) in sync everywhere.

SBOMs are an interchange format ... if we choose to represent SBOMs in the logical model that is fine ... but interchange format will change over time and 'pouring concrete' on any specific notion of interchange might cause churn later on ... one could argue that SBOM packages are just 'packages' regardless of its membership in a SBOM... it is common for a package to exist in many SBOMs eg. that relationship can be resolved by consulting the SBOM and the package is blissfully unaware. Of course the details matter for performance if one wants to enumerate all the SBOM a particular package exists in. Maybe corgi got it wrong by not making SBOM's central to the logical model ... but rather we used builds which was a surrogate sbom (and also directly referred to advisories). A build coming from a build system we own has some predictability .. sboms are distributed and coming from everywhere.

Product is just another container = a set of packages ... though Product also has a hierarchy in terms of release version which has to be catered for (this translates to queries one might want to perform like 'does curl exist in Ansible ProductVersion, ProductStream, et). Its unclear how that will be represented in trustify ... in corgi we had a product taxonomy (graph) and normal product entity tables (product, version, stream, channel).

PURL is just a unique id ... we really want a PURL to point to a single 'bag of bits' (in spacetime). It simplifies everything. Reality dictates that probably need to have purl alias or more purls, etc, and associated logical machinery to manage all the complexity but a component really should have one identity (and a bunch of additional labels, with some of those labels being purls). In corgi we choose the purl to be used internally with sbom and product container because it meant whatever change in the system (uuids for example) the relationship would continue to make sense internally.

A package could be a set of components and/or just a single component - we did not have this convention in corgi ... as we derived via depends child relationship - maybe we should have ... and in fact I believe we had a ticket to do just that at some point.

For me, the challenge is that we need to know what questions are going to be asked of the conceptual model to some level of detail to get a stable logical model in place that has the performance characteristics to answer such questions, efficiently, in a reasonable amount of time - otherwise bolting on things after the logical model has calcified can be painful.

For inspiration - here are some questions prodsec wanted to answer with component registry (with OSIDB) ... notice that none of them asked about a specific SBOM:

Retrieve a specific component by purl
Retrieve a specific Component dependencies
Retrieve a specific Component root component:
Retrieve/Search list of Components by name
Retrieve latest component used by product stream eg. a list of product streams listing latest components (could be narrowed down to specific component name/version etc) ... note this was the most difficult query to do efficiently and reason for a graph (though we never got around to implementing a real graph)
Search for Components by regular expression name (and version):
Retrieve/Search a Product Stream
List Product Streams
Retrieve a Product Stream manifest/sbom

then the following (with osidb, but I think trustify gets from vex):

Given a CVE ID, what products are affected ?
Given a CVE ID, what components are affected ?
What products + version + stream contain a given component (e.g. full text search) ?
Which unfixed CVE are affecting a component ?
Which unfixed CVE are affecting a product + version + stream ?
What are the fixed CVE of this a product + version + stream?
What are the fixed CVEs for a component?
What are the WONTFIX CVEs for a component?
What are the WONTFIX CVEs for a product?
How many CVE’s are filed against a product + version

2 replies

ctron Jul 1, 2024
Maintainer Author

I think expect for the product stream (which might be solved with labels now) we are not far away from that. What is fun again is that you query for a "component" using a "package URL 😁 … so I am not sure just using "component" instead of "package" is a big win.

So I think we "simply" need to pay more attention to our naming. And reconsider names along the changes we make. Which is what the PR and this discussion try to accomplish and which feels like it's working, because we talk about that stuff now.

ctron Jul 1, 2024
Maintainer Author

Added a tracking issue for "product streams": #481

bobmcwhirter · 2024-06-28T18:35:12Z

bobmcwhirter
Jun 28, 2024
Maintainer

Part of our ambiguity in pURLs, and our table layout is to support "pointing to more than a single bag of bits".

An advisory says log4j@[2.0,3.0) is affected, let's say. So we point to the versionless purl table, joined to a version-range table. And then we want to know which concrete fully-qualified pURLs fall inside that assertion.

Ultimately, to answer the questions you posited above.

1 reply

JimFuller-RedHat Jun 28, 2024
Collaborator

ya, in that respect we are using purls as a query ... instead of an id ... doing both might create challenges ... my assertion for corgi was a purl is an ID ... of course we could make a little query language out of purls ... but we did not build the system on that basis - not saying its problematic ... but it does mean that purls must be perfect in construction. At scale that might be problematic eg. a badly constructed purl can still be perfectly good and unique id though if the purl as query is parsed as the basis of resolving relationships (instead of concrete relationships) it can be problematic - also each ecosystem has its own rules and begets complexity and exceptions.

I think its entirely fine to use a purl query as a lookup to get to concrete packages but when we started corgi ... purls were not stable enough to be considered for lookup table duty ... maybe things have changed.

ctron · 2024-07-01T05:54:06Z

ctron
Jul 1, 2024
Maintainer Author

I'd wish there would be a mindmap mode for discussion. Branching off individual topics :)

So I'll start slow and try to separate this:

The idea of THIS_PACKAGE was to have (without having a proper name) a counterpart to an "SBOM package" but on a global/universal level. An SBOM package is a resource/thing owned by an SBOM. A global package (this_package) is a package which exists outside of any SBOM. There is a reference from an SBOM package to that THIS_PACKAGE. Probably more than one SBOM package points to that global THIS_PACKAGE.

Then again, that might just be a virtual thing, as the same can be achieved by doing the lookups we do today.

0 replies

ctron · 2024-07-01T06:04:33Z

ctron
Jul 1, 2024
Maintainer Author

I think I mostly agree with your initial comment on this @bobmcwhirter … I am not sure if "component" is a better pick for a name, because it feels like a term that is equally ambiguous and overloaded. Aside from that, SPDX uses "package", CDX uses component. But mostly mean the same thing.

What might make sense, is to come up with a mapping glossary: Trustify / SPDX, Trustify / CDX, Trustify / Klingon.

And as you said, the endpoints still deal with "packages", that's why I think it makes sense keeping that prefix. But part of this interaction is based on PURLs. So it might make sense to have that by-purl indicator. And we can extend this with by-cpe or by-digest to make clear when we do operations by other identifiers. I think this pattern works.

And it might be, that those endpoints go to the same service functions internally, just with different enums or ID types. But I think having a /api/v1/package/by-purl/{purl} is much clearer than having something like /api/v1/package/{anything}. Because it makes it clear on the API what to expect.

I am not sure how to name things internally. I think what would help a lot is to add code comments, to explain what the idea of a function or structure is. Renaming all struct and functions might be overkill.

Renaming the database tables is the right thing to do IMO.

5 replies

jcrossley3 Jul 1, 2024
Maintainer

And it might be, that those endpoints go to the same service functions internally, just with different enums or ID types. But I think having a /api/v1/package/by-purl/{purl} is much clearer than having something like /api/v1/package/{anything}. Because it makes it clear on the API what to expect.

Though I agree that /api/v1/package/{anything} is too clever, I feel like our API design is being driven by our DDL rather than the UI. If the common way the UI requests a package is by purl, then we need /api/v1/package/{purl}. If the UI less commonly needs to request a single package by {cpe}, {uuid} or {digest}, then we can add those .../by-xxx/{xxx} path segments. But let's not do it just because we can. We control the UI, so we can remove the ambiguity in our responses (why return multiple identifiers for a single resource?) and simplify our API.

ctron Jul 2, 2024
Maintainer Author

I agree that we should have an optimized API for the UI. However, I don't think that the UI is the only client to the API.

Having a /api/v1/package/{purl} just for the UI sake, might still make it confusing for all others. On the other side, for the UI, it's "just carlos" that needs to code it once. So the user will never interact with that API directly. So I think that the path actually shouldn't matter.

carlosthe19916 Jul 2, 2024
Maintainer

IMHO generally speaking we can better in 2 things:

Single way of fetching entities: It does not matter if the "key" of a package is a PURL, UUID, Name, etc, what matters is that there must be a single consistent way of identifying and fetching that entity; If besides the "main" "key" we add other ways of fetching that entity then that is a bonus. I mentioned "Package" but the same concept applies to every single entity
- This issue is a clear/concrete example of not being able to be consistent while defining the main key for fetching Packages /api/v1/package/by-purl/{id} requires an UUID rather than a PURL #490
DTO models: A "Package" should be a "Package" regardless of where it is. If I hit Endpoint1 and the response tells me EntityA has field1,field2,field3 then if for some reason I hit another endpoint Endpoint2 and the EntityA is somehow also part of that response then also there the EntityA should have field1, field2, field3.

@jcrossley3 @ctron I don't mean to be negative but I am having a difficult time dealing with the current endpoints, sometimes it confuses me what we are actually doing. If there are other clients of the API I hope they are not having the same difficulties as me.

ctron Jul 2, 2024
Maintainer Author

Just quoting from above:

I think its entirely fine to use a purl query as a lookup to get to concrete packages but when we started corgi ... purls were not stable enough to be considered for lookup table duty ... maybe things have changed.

I still think that's the case. So I think having the ability to search by an actual purl is fine, but in most cases we return the UUID of a known PURL. And we know we can resolve this. So I think we should not make our life more complicated. And I don't see any benefit for the user in this case. The UI gets and ID and forwards it to another call. The user never interacts with that.

there must be a single consistent way of identifying and fetching that entity

And that's exactly the problem. If there is a single way, it will be hard to stay consistent. Because it makes a difference if its a base PURL, qualified PURLs, UUID of an SBOM package, name of an SBOM package, name of a PURL. Having dedicated endpoints for this makes it predictable what the operation is. Many of the issues opened lately are exactly around that confusion.

jcrossley3 Jul 2, 2024
Maintainer

I agree that we should have an optimized API for the UI. However, I don't think that the UI is the only client to the API.

It's our only client now. We should incorporate feedback we get from it. Other clients will likely give the same feedback.

Having a /api/v1/package/{purl} just for the UI sake, might still make it confusing for all others. On the other side, for the UI, it's "just carlos" that needs to code it once. So the user will never interact with that API directly. So I think that the path actually shouldn't matter.

Show me the client who is less confused by /api/v1/package/by-purl/{uuid} than /api/v1/package/{purl}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packages #478

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Packages #478

ctron Jun 28, 2024 Maintainer

Replies: 6 comments · 8 replies

bobmcwhirter Jun 28, 2024 Maintainer

bobmcwhirter Jun 28, 2024 Maintainer

JimFuller-RedHat Jun 28, 2024 Collaborator

ctron Jul 1, 2024 Maintainer Author

ctron Jul 1, 2024 Maintainer Author

bobmcwhirter Jun 28, 2024 Maintainer

JimFuller-RedHat Jun 28, 2024 Collaborator

ctron Jul 1, 2024 Maintainer Author

ctron Jul 1, 2024 Maintainer Author

jcrossley3 Jul 1, 2024 Maintainer

ctron Jul 2, 2024 Maintainer Author

carlosthe19916 Jul 2, 2024 Maintainer

ctron Jul 2, 2024 Maintainer Author

jcrossley3 Jul 2, 2024 Maintainer

ctron
Jun 28, 2024
Maintainer

Replies: 6 comments 8 replies

bobmcwhirter
Jun 28, 2024
Maintainer

bobmcwhirter
Jun 28, 2024
Maintainer

JimFuller-RedHat
Jun 28, 2024
Collaborator

ctron Jul 1, 2024
Maintainer Author

ctron Jul 1, 2024
Maintainer Author

bobmcwhirter
Jun 28, 2024
Maintainer

JimFuller-RedHat Jun 28, 2024
Collaborator

ctron
Jul 1, 2024
Maintainer Author

ctron
Jul 1, 2024
Maintainer Author

jcrossley3 Jul 1, 2024
Maintainer

ctron Jul 2, 2024
Maintainer Author

carlosthe19916 Jul 2, 2024
Maintainer

ctron Jul 2, 2024
Maintainer Author

jcrossley3 Jul 2, 2024
Maintainer