Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding JPO-Deduplicator to jpo-utils #18

Merged
merged 9 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Docker build

on:
pull_request:
types: [opened, synchronize, reopened]

jobs:
jpo-deduplicator:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Build
uses: docker/build-push-action@v3
with:
context: jpo-deduplicator
build-args: |
MAVEN_GITHUB_TOKEN_NAME=${{ vars.MAVEN_GITHUB_TOKEN_NAME }}
MAVEN_GITHUB_TOKEN=${{ secrets.MAVEN_GITHUB_TOKEN }}
MAVEN_GITHUB_ORG=${{ github.repository_owner }}
secrets: |
MAVEN_GITHUB_TOKEN: ${{ secrets.MAVEN_GITHUB_TOKEN }}
39 changes: 39 additions & 0 deletions .github/workflows/dockerhub.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: "DockerHub Build and Push"

on:
push:
branches:
- "develop"
- "master"
- "release/*"

jobs:
dockerhub-jpo-deduplicator:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Replcae Docker tag
id: set_tag
run: echo "TAG=$(echo ${GITHUB_REF##*/} | sed 's/\//-/g')" >> $GITHUB_ENV

- name: Build
uses: docker/build-push-action@v3
with:
context: jpo-deduplicator
push: true
tags: usdotjpoode/jpo-deduplicator:${{ env.TAG }}
build-args: |
MAVEN_GITHUB_TOKEN_NAME=${{ vars.MAVEN_GITHUB_TOKEN_NAME }}
MAVEN_GITHUB_TOKEN=${{ secrets.MAVEN_GITHUB_TOKEN }}
MAVEN_GITHUB_ORG=${{ github.repository_owner }}
secrets: |
MAVEN_GITHUB_TOKEN: ${{ secrets.MAVEN_GITHUB_TOKEN }}
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ The JPO ITS utilities repository serves as a central location for deploying open
- [Configuration](#configuration)
- [Configure Kafka Connector Creation](#configure-kafka-connector-creation)
- [Quick Run](#quick-run-2)
- [5. Deduplicator](#5-jpo-Deduplicator)
- [Deduplication Configuration](#deduplication-config)
- [Github Token Generation](#generate-a-github-token)
- [Quick Run](#quick-run-3)


<a name="base-configuration"></a>
Expand Down Expand Up @@ -185,4 +189,70 @@ The following environment variables can be used to configure Kafka Connectors:
3. Click `OdeBsmJson`, and now you should see your message!
8. Feel free to test this with other topics or by producing to these topics using the [ODE](https://github.com/usdot-jpo-ode/jpo-ode)


<a name="deduplicator"></a>

## 5. jpo-deduplicator
The JPO-Deduplicator is a Kafka Java spring-boot application designed to reduce the number of messages stored and processed in the ODE system. This is done by reading in messages from an input topic (such as topic.ProcessedMap) and outputting a subset of those messages on a related output topic (topic.DeduplicatedProcessedMap). Functionally, this is done by removing deduplicate messages from the input topic and only passing on unique messages. In addition, each topic will pass on at least 1 message per hour even if the message is a duplicate. This behavior helps ensure messages are still flowing through the system. The following topics currently support deduplication.

- topic.ProcessedMap -> topic.DeduplicatedProcessedMap
- topic.ProcessedMapWKT -> topic.DeduplicatedProcessedMapWKT
- topic.OdeMapJson -> topic.DeduplicatedOdeMapJson
- topic.OdeTimJson -> topic.DeduplicatedOdeTimJson
- topic.OdeRawEncodedTIMJson -> topic.DeduplicatedOdeRawEncodedTIMJson
- topic.OdeBsmJson -> topic.DeduplicatedOdeBsmJson
- topic.ProcessedSpat -> topic.DeduplicatedProcessedSpat

### Deduplication Config

When running the jpo-deduplication as a submodule in jpo-utils, the deduplicator will automatically turn on deduplication for a topic when that topic is created. For example if the KAFKA_TOPIC_CREATE_GEOJSONCONVERTER environment variable is set to true, the deduplicator will start performing deduplication for ProcessedMap, ProcessedMapWKT, and ProcessedSpat data.

To manually configure deduplication for a topic, the following environment variables can also be used.

| Environment Variable | Description |
|---|---|
| `ENABLE_PROCESSED_MAP_DEDUPLICATION` | `true` / `false` - Enable ProcessedMap message Deduplication |
| `ENABLE_PROCESSED_MAP_WKT_DEDUPLICATION` | `true` / `false` - Enable ProcessedMap WKT message Deduplication |
| `ENABLE_ODE_MAP_DEDUPLICATION` | `true` / `false` - Enable ODE MAP message Deduplication |
| `ENABLE_ODE_TIM_DEDUPLICATION` | `true` / `false` - Enable ODE TIM message Deduplication |
| `ENABLE_ODE_RAW_ENCODED_TIM_DEDUPLICATION` | `true` / `false` - Enable ODE Raw Encoded TIM Deduplication |
| `ENABLE_PROCESSED_SPAT_DEDUPLICATION` | `true` / `false` - Enable ProcessedSpat Deduplication |
| `ENABLE_ODE_BSM_DEDUPLICATION` | `true` / `false` - Enable ODE BSM Deduplication |

### Generate a Github Token

A GitHub token is required to pull artifacts from GitHub repositories. This is required to obtain the jpo-deduplicator jars and must be done before attempting to build this repository.

1. Log into GitHub.
2. Navigate to Settings -> Developer settings -> Personal access tokens.
3. Click "New personal access token (classic)".
1. As of now, GitHub does not support `Fine-grained tokens` for obtaining packages.
4. Provide a name and expiration for the token.
5. Select the `read:packages` scope.
6. Click "Generate token" and copy the token.
7. Copy the token name and token value into your `.env` file.

For local development the following steps are also required
8. Create a copy of [settings.xml](jpo-deduplicator/jpo-deduplicator/settings.xml) and save it to `~/.m2/settings.xml`
9. Update the variables in your `~/.m2/settings.xml` with the token value and target jpo-ode organization.

### Quick Run
1. Create a copy of `sample.env` and rename it to `.env`.
2. Update the variable `MAVEN_GITHUB_TOKEN` to a github token used for downloading jar file dependencies. For full instructions on how to generate a token please see here:
3. Set the password for `MONGO_ADMIN_DB_PASS` and `MONGO_READ_WRITE_PASS` environmental variables to a secure password.
4. Set the `COMPOSE_PROFILES` variable to: `kafka,kafka_ui,kafka_setup, jpo-deduplicator`
5. Navigate back to the root directory and run the following command: `docker compose up -d`
6. Produce a sample message to one of the sink topics by using `kafka_ui` by:
1. Go to `localhost:8001`
2. Click local -> Topics
3. Select `topic.OdeMapJson`
4. Select `Produce Message`
5. Copy in sample JSON for a Map Message
6. Click `Produce Message` multiple times
7. View the synced message in `kafka_ui` by:
1. Go to `localhost:8001`
2. Click local -> Topics
3. Select `topic.DeduplicatedOdeMapJson`
4. You should now see only one copy of the map message sent.

[Back to top](#toc)
44 changes: 44 additions & 0 deletions docker-compose-deduplicator.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
services:
deduplicator:
profiles:
- all
- deduplicator
build:
context: jpo-deduplicator
dockerfile: Dockerfile
args:
MAVEN_GITHUB_TOKEN: ${MAVEN_GITHUB_TOKEN:?error}
MAVEN_GITHUB_ORG: ${MAVEN_GITHUB_ORG:?error}
image: jpo-deduplicator:latest
restart: ${RESTART_POLICY}
environment:
DOCKER_HOST_IP: ${DOCKER_HOST_IP}
KAFKA_BOOTSTRAP_SERVERS: ${KAFKA_BOOTSTRAP_SERVERS:?error}
spring.kafka.bootstrap-servers: ${KAFKA_BOOTSTRAP_SERVERS:?error}
enableProcessedMapDeduplication: ${ENABLE_PROCESSED_MAP_DEDUPLICATION}
enableProcessedMapWktDeduplication: ${ENABLE_PROCESSED_MAP_WKT_DEDUPLICATION}
enableOdeMapDeduplication: ${ENABLE_ODE_MAP_DEDUPLICATION}
enableOdeTimDeduplication: ${ENABLE_ODE_TIM_DEDUPLICATION}
enableOdeRawEncodedTimDeduplication: ${ENABLE_ODE_RAW_ENCODED_TIM_DEDUPLICATION}
enableProcessedSpatDeduplication: ${ENABLE_PROCESSED_SPAT_DEDUPLICATION}
enableOdeBsmDeduplication: ${ENABLE_ODE_BSM_DEDUPLICATION}



healthcheck:
test: ["CMD", "java", "-version"]
interval: 10s
timeout: 10s
retries: 20
logging:
options:
max-size: "10m"
max-file: "5"
deploy:
resources:
limits:
memory: 3G
depends_on:
kafka:
condition: service_healthy
required: false
3 changes: 2 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
include:
- docker-compose-connect.yml
- docker-compose-mongo.yml
- docker-compose-kafka.yml
- docker-compose-kafka.yml
- docker-compose-deduplicator.yml
4 changes: 2 additions & 2 deletions jikkou/kafka-connectors-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,8 @@ apps:
collectionName: OdeTimJson
generateTimestamp: true
connectorName: DeduplicatedOdeTimJson
- topicName: topic.OdeRawEncodedTIMJson
collectionName: OdeTimJson
- topicName: topic.DeduplicatedOdeRawEncodedTIMJson
collectionName: OdeRawEncodedTIMJson
generateTimestamp: true
connectorName: DeduplicatedOdeRawEncodedTIMJson
- topicName: topic.DeduplicatedOdeBsmJson
Expand Down
1 change: 1 addition & 0 deletions jpo-deduplicator/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
jpo-deduplicator/target/
57 changes: 57 additions & 0 deletions jpo-deduplicator/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
FROM maven:3.8-eclipse-temurin-21-alpine AS builder

WORKDIR /home

ARG MAVEN_GITHUB_TOKEN
ARG MAVEN_GITHUB_ORG

ENV MAVEN_GITHUB_TOKEN=$MAVEN_GITHUB_TOKEN
ENV MAVEN_GITHUB_ORG=$MAVEN_GITHUB_ORG

# COPY ./jpo-conflictmonitor/pom.xml ./jpo-conflictmonitor/
# COPY ./settings.xml ./jpo-conflictmonitor/

# # Copy and Build Conflict Monitor
# # Download dependencies alone to cache them first
# WORKDIR /home/jpo-conflictmonitor
# RUN mvn -s settings.xml dependency:resolve

# # Copy the source code and build the conflict monitor
# COPY ./jpo-conflictmonitor/src ./src
# RUN mvn -s settings.xml install -DskipTests -Ppackage-jar

# Copy and Build Deduplicator
WORKDIR /home
COPY ./jpo-deduplicator/pom.xml ./jpo-deduplicator/
COPY ./jpo-deduplicator/settings.xml ./jpo-deduplicator/

WORKDIR /home/jpo-deduplicator
RUN mvn -s settings.xml dependency:resolve

COPY ./jpo-deduplicator/src ./src
RUN mvn -s settings.xml install -DskipTests

FROM amazoncorretto:21

WORKDIR /home

COPY --from=builder /home/jpo-deduplicator/src/main/resources/application.yaml /home
COPY --from=builder /home/jpo-deduplicator/src/main/resources/logback.xml /home
COPY --from=builder /home/jpo-deduplicator/target/jpo-deduplicator.jar /home

#COPY cert.crt /home/cert.crt
#RUN keytool -import -trustcacerts -keystore /usr/local/openjdk-11/lib/security/cacerts -storepass changeit -noprompt -alias mycert -file cert.crt

ENTRYPOINT ["java", \
"-Djava.rmi.server.hostname=$DOCKER_HOST_IP", \
"-Dcom.sun.management.jmxremote.port=9090", \
"-Dcom.sun.management.jmxremote.rmi.port=9090", \
"-Dcom.sun.management.jmxremote", \
"-Dcom.sun.management.jmxremote.local.only=true", \
"-Dcom.sun.management.jmxremote.authenticate=false", \
"-Dcom.sun.management.jmxremote.ssl=false", \
"-Dlogback.configurationFile=/home/logback.xml", \
"-jar", \
"/home/jpo-deduplicator.jar"]

# ENTRYPOINT ["tail", "-f", "/dev/null"]
Loading
Loading