Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: MongoDB I/O procedure #120

Merged
merged 2 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/install/container/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,7 @@ example::
.. _containerization: https://www.docker.com/resources/what-container
.. _CrateDB Docker image: https://hub.docker.com/_/crate/
.. _default bridge network: https://docs.docker.com/network/drivers/bridge/#use-the-default-bridge-network
.. _Docker Stack YAML file: https://docs.docker.com/docker-cloud/apps/stack-yaml-reference/
.. _Docker Stack YAML file: https://docs.oldtimes.me/docker/docker-cloud/apps/stack-yaml-reference/index.html
.. _Docker Swarm: https://docs.docker.com/engine/swarm/
.. _Docker volume: https://docs.docker.com/engine/tutorials/dockervolumes/
.. _Docker: https://www.docker.com/
Expand Down
2 changes: 1 addition & 1 deletion docs/integrate/etl/kafka-connect.rst
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ The remaining steps from above remain are applicable without changes.
.. _Kafka: https://www.confluent.io/what-is-apache-kafka/
.. _Kafka Connect JDBC connector: https://docs.confluent.io/kafka-connect-jdbc/current/sink-connector/
.. _Confluent Platform: https://docs.confluent.io/current/cli/index.html
.. _Avro schema: https://avro.apache.org/docs/current/spec.html
.. _Avro schema: https://avro.apache.org/docs/1.10.2/spec.html
.. _PostgreSQL Kafka Connect JDBC driver: https://docs.confluent.io/kafka-connect-jdbc/current/index.html#postgresql-database
.. _Sink Connector: https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/index.html
.. _Source Connector: https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/index.html
Expand Down
216 changes: 216 additions & 0 deletions docs/integrate/etl/mongodb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
(integrate-mongodb)=
(migrating-mongodb)=
(integrate-mongodb-quickstart)=
(import-mongodb)=

# Import data from MongoDB

In this quick tutorial, you'll use the [CrateDB Toolkit MongoDB I/O subsystem]
to import data from [MongoDB] into [CrateDB].

:::{note}
**Important:** The tutorial uses adapter software which is currently in beta testing.
If you discover any issues, please [report them] back to us.
:::

## Synopsis
Transfer data from MongoDB database/collection into CrateDB schema/table.
:::{code} shell
ctk load table \
"mongodb+srv://admin:[email protected]/testdrive/demo" \
--cratedb-sqlalchemy-url='crate://admin:p..d@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200/testdrive/demo?ssl=true'
:::
matkuliak marked this conversation as resolved.
Show resolved Hide resolved

Query data in CrateDB.
:::{code} shell
export CRATEPW=password
crash --host=cratedb.example.org --username=user --command="SELECT * FROM testdrive.demo;"
:::

## Data Model

MongoDB stores data in collections and documents. CrateDB stores
data in schemas and tables.

- A **database** in MongoDB is a physical container for collections, similar
to a schema in CrateDB, which groups tables together within a database.
- A **collection** in MongoDB is a grouping of documents, similar to a table
in CrateDB, which is a structured collection of rows.
- A **document** in MongoDB is a record in a collection, similar to a row in
a CrateDB table. It is a set of key-value pairs, where each key represents
a field, and the value represents the data.
- A **field** in MongoDB is similar to a column in a CrateDB table. In both
systems, fields (or columns) define the attributes for the records
(or rows/documents).
- A **primary key** in MongoDB is typically the _id field, which uniquely
identifies a document within a collection. In CrateDB, a primary key
uniquely identifies a row in a table.
- An **index** in MongoDB is similar to an index in CrateDB. Both are used to
improve query performance by providing a fast lookup for fields (or columns)
within documents (or rows).

-- [Databases and Collections]

## Tutorial

The tutorial heavily uses Docker to provide services and to run jobs.
Alternatively, you can use the drop-in replacement Podman.
The walkthrough uses basic example setup including MongoDB v7.0.x, CrateDB
and a few samples worth of data that is being transferred to CrateDB.

### Services

Prerequisites are running instances of CrateDB and MongoDB.

Start MongoDB.
:::{code} shell
docker run --rm -it --name=mongodb \
--publish=27017:27017 \
--volume="$PWD/var/lib/mongodb:/data/db" \
mongo:latest
:::

Start CrateDB.
:::{code} shell
docker run --rm -it --name=cratedb \
--publish=4200:4200 \
--volume="$PWD/var/lib/cratedb:/data" \
crate:latest -Cdiscovery.type=single-node
:::

### Sample Data

In this case we imported demo data to MongoDB in JSON format:

:::{code} shell
[
{
"_id": "66bb0bd8e17c5c509fbc8b2c",
"VendorID": 2,
"tpep_pickup_datetime": 1563051934000,
"tpep_dropoff_datetime": 1563053222000,
"passenger_count": 2,
"trip_distance": 3.29,
"RatecodeID": 1,
"store_and_fwd_flag": "N",
"PULocationID": 79,
"DOLocationID": 170,
"payment_type": 1,
"fare_amount": 15.5,
"extra": 0.5,
"mta_tax": 0.5,
"tip_amount": 3.86,
"tolls_amount": 0,
"improvement_surcharge": 0.3,
"total_amount": 23.16,
"congestion_surcharge": 2.5,
"airport_fee": ""
}, ...
]
:::

Import data to MongoDB:
:::{code} shell
mongoimport --db testdrive --collection demo --file demodata.json --jsonArray
:::

::: {note}
```mongoimport``` is part of the [MongoDB Database tools]
:::
matkuliak marked this conversation as resolved.
Show resolved Hide resolved

Verify data is present:
:::{code} shell
docker exec -it mongodb mongosh
:::

:::{code} shell
use testdrive
db.demo.find().pretty()
:::

### Data Import

First, create these command aliases, for better UX.
:::{code} shell
alias crash="docker run --rm -it --link=cratedb ghcr.io/crate-workbench/cratedb-toolkit:latest crash"
alias ctk="docker run --rm -it ghcr.io/crate/cratedb-toolkit:latest ctk"
:::

Now, import data from MongoDB database/collection into CrateDB schema/table.
:::{code} shell
ctk load table \
"mongodb://localhost:27017/testdrive/demo" \
--cratedb-sqlalchemy-url="crate://crate@cratedb:4200/testdrive/demo"
:::

Verify that relevant data has been transferred to CrateDB.
:::{code} shell
crash --host=cratedb --command="SELECT * FROM testdrive.demo;"
:::

## Cloud to Cloud

The procedure for importing data from [MongoDB Atlas] into [CrateDB Cloud] is
similar, with a few small adjustments.

First, helpful aliases again:
:::{code} shell
alias ctk="docker run --rm -it ghcr.io/crate/cratedb-toolkit:latest ctk"
alias crash="docker run --rm -it ghcr.io/crate-workbench/cratedb-toolkit:latest crash"
:::

You will need your credentials for both CrateDB and MongoDB.
These are, with examples:

**CrateDB Cloud**
* Host: ```gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net```
* Username: ```admin```
* Password: ```-9..nn```

**MongoDB Atlas**
* Host: ```cluster0.nttj7.mongodb.net```
* User: ```admin```
* Password: ```a1..d1```

For CrateDB, the credentials are displayed at time of cluster creation.
For MongoDB, they can be found in the [cloud platform] itself.

Now, same as before, import data from MongoDB database/collection into
CrateDB schema/table.
:::{code} shell
ctk load table \
"mongodb+srv://admin:[email protected]/testdrive/demo" \
--cratedb-sqlalchemy-url='crate://admin:-..n@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200/testdrive/demo?ssl=true'
:::

::: {note}
Note the **necessary** `ssl=true` query parameter at the end of both database connection URLs
when working on Cloud-to-Cloud transfers.
:::

Verify that relevant data has been transferred to CrateDB.
:::{code} shell
crash --hosts 'https://admin:-..n@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200' --command 'SELECT * FROM testdrive.demo;'
:::

## More information

There are more ways to apply the I/O subsystem of CrateDB Toolkit as
pipeline elements in your daily data operations routines. Please visit the
[CrateDB Toolkit MongoDB I/O subsystem] documentation, to learn more about what's possible.

The MongoDB I/O subsystem is based on the [migr8] migration utility package. Please also
check its documentation to learn about more of its capabilities, supporting
you when working with MongoDB.


[cloud platform]: https://cloud.mongodb.com
[CrateDB]: https://github.com/crate/crate
[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB Toolkit MongoDB I/O subsystem]: https://cratedb-toolkit.readthedocs.io/io/mongodb/loader.html
[Databases and Collections]: https://www.mongodb.com/docs/manual/core/databases-and-collections/#databases-and-collections
[migr8]: https://cratedb-toolkit.readthedocs.io/io/mongodb/migr8.html
[MongoDB]: https://www.mongodb.com/docs/manual/tutorial/install-mongodb-community-with-docker/
[MongoDB Atlas]: https://www.mongodb.com/cloud/atlas
[MongoDB Database tools]: https://www.mongodb.com/docs/database-tools/installation/installation-linux/
[report them]: https://github.com/crate-workbench/cratedb-toolkit/issues
141 changes: 0 additions & 141 deletions docs/integrate/etl/mongodb.rst

This file was deleted.