Process information from Artifact POM files #47

johannesduesing · 2020-07-13T10:13:50Z

Reason for this PR
According to #15, the Delphi crawler does not process any artifact information stored in the respective POM file yet. This means that potentially interesting data fields (including project name, description, etc..) are not accessible when querying Delphi. In addition to that, the publication date of an artifact is not processed either (see #37).

Changes in this PR

Extended the MavenArtifact class with optional attributes publicationDate and metadata of type ArtifactMetadata
Introduced new type ArtifactMetadata that is supposed to hold information parsed from POM files, currently name, description and system name & URL of the issueManagement
Publication date of artifacts is extracted from HTTP header in MavenDownloadActor and set accordingly
Introduced PomFileReadActor. Reads POM file for a given MavenArtifact and sets the ArtifactMetadata accordingly. Currently triggered in the MavenDiscoveryProcess as part of preprocessing. Uses Apache Xpp3Reader for POM file processing.

Open for discussion

What other attributes shall be parsed from the POM file?
Is it sensible to have POM file processing as part of the 'preprocessing', or does it belong in the 'processing' phase?
Currently, when POM processing fails, the artifact will be removed from the list of artifacts to process, ie will not be passed to Hermes. What is the desired behavior for when POM processing fails?

@bhermann , what's your opinion on these questions?

…Name.

bhermann · 2020-07-14T17:20:40Z

A partial answer:

I would rather see them in the processing package than in the preprocessing package.
When POM processing fails it should not affect processing of the Java package.

…lure

johannesduesing · 2020-07-15T10:06:26Z

I fixed the two points you addressed in the latest commit. Now there is the issue of storing the data. Currently the ElasticStoreQueries trait supports storing a MavenIdentifier and a HermesResult, however the publication date and metadata is only available in the MavenArtifact class.

My plan would be to write an additional method that stores a MavenArtifact by extracting its MavenIdentifier and writing the publication date and metadata, if available, to the database (similar to what is being done for HermesResult). I would then attach this as a sink to the "Processing" stage using the .alsoTo operator, similar to the current implementation for storing MavenIdentifiers.

Do you agree with that plan? And if so, do you want me to implement the whole thing or make it a skeleton implementation until we dicussed the elastic data model changes in depth?

…nses and developers

…g, but no parent processing yet. Also no storage yet

…nd NOT optimized, but working

…ified in file itself, but in parent

…the whole parent hierarchy. Parents are only downloaded once, however, currently for every POM, not on-demand.

…least one version / attribute failed to resolve locally. However, if any parent is required the whole hierarchy will be downloaded! Fixed a bug in test shutdown.

johannesduesing · 2020-09-21T14:04:19Z

Here's the latest update to this PR:

POM file processing now extracts the parent (optional) and packaging
POM file processing now extracts dependencies. If variables are used (e.g. ${foo.version}) they are attempted to be resolved. Resolving variables starts in the local POM, but downloads and processes parent-POMs if required and available. Same goes for dependencies without a version, the implementation will recurse through all parents to find the matching version definition. Also the scope of dependencies is being extracted.

I tested the application on my machine using a fresh elasticsearch instance (version 5.6.9), and POM file processing seems to work fine. For me, the only thing left to discuss is a suitable data model for storing the data. Using the current implementation, a search query to ElasticSearch yields the following result:

[...]
"identifier" : {
            "groupId" : "xom",
            "artifactId" : "xom",
            "version" : "1.2.5"
          },
          "discovered" : "2020-09-21T15:10:34.824+02:00",
          "published" : "2010-05-12T06:22:10.000Z",
          "pom" : {
            "parent" : "None",
            "licenses" : [
              {
                "name" : "The GNU Lesser General Public License, Version 2.1",
                "url" : "http://www.gnu.org/licenses/lgpl-2.1.html"
              }
            ],
            "issueManagement" : "None",
            "developers" : "elharo",
            "name" : "XOM",
            "description" : "The XOM Dual Streaming/Tree API for Processing XML",
            "packaging" : "jar",
            "dependencies" : [
              {
                "groupId" : "xml-apis",
                "scope" : "default",
                "artifactId" : "xml-apis",
                "version" : "1.3.03"
              },
              {
                "groupId" : "xerces",
                "scope" : "default",
                "artifactId" : "xercesImpl",
                "version" : "2.8.0"
              },
              {
                "groupId" : "xalan",
                "scope" : "default",
                "artifactId" : "xalan",
                "version" : "2.7.0"
              }
            ]
          }
        }

I am unsure whether or not this is the correct way to deal with lists (for dependencies and licenses) in ElasticSearch. @bhermann what is your opinion on that ?

sonarcloud · 2020-10-08T14:20:24Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
0.0% Duplication

johannesduesing · 2021-10-21T10:50:20Z

Closed as this functionality is now part of the redesign proposed in #50

Johannes Düsing added 6 commits July 9, 2020 15:04

First shot at POM reading, extracts PublicationDate, Description and …

3ecf6bd

…Name.

Made style of pom file processing more inline with rest of application.

e96b08e

Add test for pom file reader

b3a2dfa

Proper error handling in POM file reading actor

cc9ffda

Added processing of issue management system to POM reader.

ece0b24

Revert unnecessary changes (whitespaces)

d254798

johannesduesing added enhancement question actor labels Jul 13, 2020

johannesduesing added this to the 0.9.6 milestone Jul 13, 2020

johannesduesing self-assigned this Jul 13, 2020

bhermann marked this pull request as draft July 14, 2020 17:17

Moved PomFileReadActor to processing package, changed behavior on fai…

f063cac

…lure

Johannes Düsing added 11 commits August 2, 2020 15:29

Added storage trait for POM file properties. Now also extracting lice…

2740992

…nses and developers

Add dependency extraction for POM files. Some basic variable resolvin…

7c84ead

…g, but no parent processing yet. Also no storage yet

Recursively resolve POM variables in parents if possible. On-Demand a…

82748c9

…nd NOT optimized, but working

Fix code smell

7440e8a

Remove code duplication in test

24aed36

PomReadActor now also resolves dependencies where version is not spec…

577f543

…ified in file itself, but in parent

Optimized dependency resolving. Versions are now resolved throughout …

5e535d2

…the whole parent hierarchy. Parents are only downloaded once, however, currently for every POM, not on-demand.

Optimization: Parent hierarchy is now lazy, ie only downloaded if at …

8853ed7

…least one version / attribute failed to resolve locally. However, if any parent is required the whole hierarchy will be downloaded! Fixed a bug in test shutdown.

Now extracting scopes for dependencies from POM files

0992b64

Now extracting parent and packaging. Fixed some storage issues

221ff7d

Fixed a bug in actor communication. Code style improvements

1b0e71f

johannesduesing marked this pull request as ready for review October 8, 2020 09:38

johannesduesing changed the title ~~WIP: Process information from Artifact POM files~~ Process information from Artifact POM files Oct 8, 2020

johannesduesing requested a review from bhermann October 8, 2020 09:38

Adapt tests to latest actor api change

9ae5e3c

johannesduesing removed the question label Oct 8, 2020

johannesduesing mentioned this pull request Oct 20, 2020

Processing Errors are now Stored in the ElasticSearch Index #48

Closed

This was referenced Oct 21, 2021

Merge redesign into develop johannesduesing/delphi-crawler#2

Merged

Merge redesign into develop #50

Merged

johannesduesing closed this Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process information from Artifact POM files #47

Process information from Artifact POM files #47

johannesduesing commented Jul 13, 2020

bhermann commented Jul 14, 2020

johannesduesing commented Jul 15, 2020

johannesduesing commented Sep 21, 2020

sonarcloud bot commented Oct 8, 2020

johannesduesing commented Oct 21, 2021

Process information from Artifact POM files #47

Process information from Artifact POM files #47

Conversation

johannesduesing commented Jul 13, 2020

bhermann commented Jul 14, 2020

johannesduesing commented Jul 15, 2020

johannesduesing commented Sep 21, 2020

sonarcloud bot commented Oct 8, 2020

johannesduesing commented Oct 21, 2021