Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve: Parse POM from maven repo to extract and save license information in Java DB #21
base: main
Are you sure you want to change the base?
Improve: Parse POM from maven repo to extract and save license information in Java DB #21
Changes from 2 commits
c3c9e2c
e5e60fa
eb56cc3
ecab1d4
8ff0829
6ea37a5
239ddc8
4c23e78
302e619
153a133
4dbd769
9bc1dbb
28ef2b3
ff74564
9a557a4
e214d84
21885b1
3a4156e
afd7dbb
c5b40fd
c7fca19
6608af3
560980f
0608b3e
5ec049b
2467ee3
30bc298
d483d27
7be6cf4
2f4c07c
1640f0a
512dba5
c4108cd
7cc48c1
33f2473
b001b9f
029dcfe
176caa8
87c7b41
a0769b5
d7a9bdf
3f3344b
e991f76
674534e
8a924f3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to think about this, because some licenses (e.g.
The Apache License, Version 2.0
) can contain commas. In this case will be confusion if we will split by commas (e.g.The Apache License, Version 2.0,The 2-Clause BSD License
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the
,
separator, we can probably use|
Gave license classifier a try.
It expects a license file . Does not result in matches using just the license name in POM in most cases.
We essentially need to extract license URL from POM, fetch the contents of URL, dump it in a file and then use license classifier for the right results.
It will add an additional overhead. We might have to add an in memory cache as well in order to avoid redundant processing.
In cases where URL is missing in POM, we'll have to rely on the license name in POM.
Effectively a 3 step process:
In all cases where license classifier is being used, confidence threshold used could be 80%
Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.
Still need to figure out the specifics.
=========
The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really problem. We can add new step before crawler: parse all
pom.xml
files (as withsha1
files) as you said (Effectively a 3 step process:
) and save to map/file (url -> license name). And use this map/file in crawler.But there may be problem with CI/CD (work time, memory, etc...).
I think it is good solution. At least we can start with this
Maven docs say(https://maven.apache.org/pom.html#Licenses):
Using an SPDX identifier as the license name is recommended.
It is not required, but we can refer on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give this a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB size now is just 50MB more with license information(800MB) vs 750MB without license information.
| separated list of unique licenses :
licenses.csv
The only problem seems to be time taken for building java db. https://github.com/aquasecurity/trivy-java-db/pull/21/files#diff-526a7142f7b15a72eab215dcfa540f5b6d502951a2f9f3b34dd900013e82c675R426 seems to be a major contributor other than additional POM parsing.
Don't think it should be a problem though since we refresh the database once a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good way for us. And changes of place look good. 50mb is not much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should create a license table and store foreign keys here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.
Querying the data would involve JOINS which would mean a change in all queries. Moreever the effort involved didn't seem to reap proportionate benefits since the db size with the current change is just 50MB more than normal and queries are simpler.
Intended to keep the changes to existing code as minimal as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is RDBMS. I don't think this change would be complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Since there is no use case of filtering by license strings, I thought normalization can be deferred and queries can be kept simpler where a direct access to the row in the indices table will have all required info.
It does affect the database size since information is replicated, but it didn't seem to have a major impact hence decided to stick to it.
But I agree, in the long run it might blow up and cause issues w.r.t database size. Don't foresee query performance issues since indexes will help efficiently access the artifact metadata. Will let you guys take a call.