Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process information from Artifact POM files #47

Closed
wants to merge 19 commits into from

Conversation

johannesduesing
Copy link

Reason for this PR
According to #15, the Delphi crawler does not process any artifact information stored in the respective POM file yet. This means that potentially interesting data fields (including project name, description, etc..) are not accessible when querying Delphi. In addition to that, the publication date of an artifact is not processed either (see #37).

Changes in this PR

  • Extended the MavenArtifact class with optional attributes publicationDate and metadata of type ArtifactMetadata
  • Introduced new type ArtifactMetadata that is supposed to hold information parsed from POM files, currently name, description and system name & URL of the issueManagement
  • Publication date of artifacts is extracted from HTTP header in MavenDownloadActor and set accordingly
  • Introduced PomFileReadActor. Reads POM file for a given MavenArtifact and sets the ArtifactMetadata accordingly. Currently triggered in the MavenDiscoveryProcess as part of preprocessing. Uses Apache Xpp3Reader for POM file processing.

Open for discussion

  • What other attributes shall be parsed from the POM file?
  • Is it sensible to have POM file processing as part of the 'preprocessing', or does it belong in the 'processing' phase?
  • Currently, when POM processing fails, the artifact will be removed from the list of artifacts to process, ie will not be passed to Hermes. What is the desired behavior for when POM processing fails?

@bhermann , what's your opinion on these questions?

@johannesduesing johannesduesing added this to the 0.9.6 milestone Jul 13, 2020
@johannesduesing johannesduesing self-assigned this Jul 13, 2020
@bhermann bhermann marked this pull request as draft July 14, 2020 17:17
@bhermann
Copy link
Member

A partial answer:

  • I would rather see them in the processing package than in the preprocessing package.
  • When POM processing fails it should not affect processing of the Java package.

@johannesduesing
Copy link
Author

I fixed the two points you addressed in the latest commit. Now there is the issue of storing the data. Currently the ElasticStoreQueries trait supports storing a MavenIdentifier and a HermesResult, however the publication date and metadata is only available in the MavenArtifact class.

My plan would be to write an additional method that stores a MavenArtifact by extracting its MavenIdentifier and writing the publication date and metadata, if available, to the database (similar to what is being done for HermesResult). I would then attach this as a sink to the "Processing" stage using the .alsoTo operator, similar to the current implementation for storing MavenIdentifiers.

Do you agree with that plan? And if so, do you want me to implement the whole thing or make it a skeleton implementation until we dicussed the elastic data model changes in depth?

Johannes Düsing added 11 commits August 2, 2020 15:29
…g, but no parent processing yet. Also no storage yet
…the whole parent hierarchy. Parents are only downloaded once, however, currently for every POM, not on-demand.
…least one version / attribute failed to resolve locally. However, if any parent is required the whole hierarchy will be downloaded! Fixed a bug in test shutdown.
@johannesduesing
Copy link
Author

Here's the latest update to this PR:

  • POM file processing now extracts the parent (optional) and packaging
  • POM file processing now extracts dependencies. If variables are used (e.g. ${foo.version}) they are attempted to be resolved. Resolving variables starts in the local POM, but downloads and processes parent-POMs if required and available. Same goes for dependencies without a version, the implementation will recurse through all parents to find the matching version definition. Also the scope of dependencies is being extracted.

I tested the application on my machine using a fresh elasticsearch instance (version 5.6.9), and POM file processing seems to work fine. For me, the only thing left to discuss is a suitable data model for storing the data. Using the current implementation, a search query to ElasticSearch yields the following result:

[...]
"identifier" : {
            "groupId" : "xom",
            "artifactId" : "xom",
            "version" : "1.2.5"
          },
          "discovered" : "2020-09-21T15:10:34.824+02:00",
          "published" : "2010-05-12T06:22:10.000Z",
          "pom" : {
            "parent" : "None",
            "licenses" : [
              {
                "name" : "The GNU Lesser General Public License, Version 2.1",
                "url" : "http://www.gnu.org/licenses/lgpl-2.1.html"
              }
            ],
            "issueManagement" : "None",
            "developers" : "elharo",
            "name" : "XOM",
            "description" : "The XOM Dual Streaming/Tree API for Processing XML",
            "packaging" : "jar",
            "dependencies" : [
              {
                "groupId" : "xml-apis",
                "scope" : "default",
                "artifactId" : "xml-apis",
                "version" : "1.3.03"
              },
              {
                "groupId" : "xerces",
                "scope" : "default",
                "artifactId" : "xercesImpl",
                "version" : "2.8.0"
              },
              {
                "groupId" : "xalan",
                "scope" : "default",
                "artifactId" : "xalan",
                "version" : "2.7.0"
              }
            ]
          }
        }

I am unsure whether or not this is the correct way to deal with lists (for dependencies and licenses) in ElasticSearch. @bhermann what is your opinion on that ?

@johannesduesing johannesduesing marked this pull request as ready for review October 8, 2020 09:38
@johannesduesing johannesduesing changed the title WIP: Process information from Artifact POM files Process information from Artifact POM files Oct 8, 2020
@sonarcloud
Copy link

sonarcloud bot commented Oct 8, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@johannesduesing
Copy link
Author

Closed as this functionality is now part of the redesign proposed in #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants