! Work in progress !
See https://mertend.github.io/node-crawler/ for a detailed documentation.
Node-Crawler is a highly customizable, Node-based web application for creating web crawlers and further processing and transforming the retrieved data. Users can build tailor-made web crawlers and manipulate and transform the collected data as needed. The output data format can be set to a wide range of formats including JSON, CSV, and various database formats.
The app is developed using Next.js and the React Flow library provides a framework for creating the node-based editor.
- Node-based Editing: Users can create and edit their own crawler workflows by drag-and-dropping nodes.
- Data Transformation: The application supports a variety of data manipulation and transformation operations for cleaning and restructuring the gathered data.
- Data Export: The transformed data can be output in a variety of formats including JSON, CSV, and various database formats.
Make sure you have Node.js and npm installed on your system before you start.
- Clone the repository:
git clone https://github.com/MertenD/node-crawler.git
- Navigate into the directory and install the dependencies:
cd node-crawler
npm install
- Start the development server:
npm run dev
Now you should be able to see the web application on http://localhost:3000
in your browser.
Follow these steps when you need to create a new Node:
Next, add your newly created Node to the config/NodeType enum.
First, create a new file in the components/editor/pages/canvas/nodes directory. In this file, you will define the following elements:
-
Data Interface: Create a Data interface that stores all data the user can configure.
-
Style Function: Create a Style function (using the
createNodeShapeStyle()
function), where you can customize the Node's appearance. -
Node Component: Create a Node component (using the
createNodeComponent()
function), which will be the Node on the canvas. -
Options Component: Create an Options component (using the
createOptionsComponent()
function), where the user can configure the Node's behavior.
You can use the following template to create a new Node:
// TODO: Replace [NAME] everywhere
// --- Data ---
export interface [NAME]NodeData extends NodeData {
// TODO: Add data attributes here
}
// --- Style ---
export const [NAME]ShapeStyle = createNodeShapeStyle({
// TODO: Add additional CSS for the node's shape here
})
// --- Node ---
export const [NAME]Node = createNodeComponent<[NAME]NodeData>(
NodeType.[NAME]_NODE,
[NAME]ShapeStyle,
(id, selected, data) => {
// TODO: Place the node content here
}
)
// --- Options ---
export const [NAME]Options = createOptionsComponent<[NAME]NodeData>("Start", ({ id, data, onDataUpdated }) => {
return // TODO: Place options here
})
Add all metadata of the new Node to the config/NodesMetadata.tsx file.
Define the connection rules for the new Node in the config/ConnectionRules.ts file.
You need to create a new class in the engine/nodes directory. This class should extend the BasicNode
interface. Below is a basic template for your reference:
// TODO: Replace [NAME]
export class Engine[NAME]Node implements BasicNode {
id: string;
nodeType: NodeType
data: // TODO
constructor(id: string, data: /* TODO */) {
this.id = id
this.nodeType = // TODO
this.data = data
}
async run() {
// Optional: Get inputs from previous nodes
const input = usePlayStore.getState().getInput(this.id, "input")
if (input) {
// TODO Put the logic of the node here
// Optional: Add downloadable file
usePlayStore.getState().addFile(/* TODO */)
// Optional: Make outputs accessable for the next node
usePlayStore.getState().addOutgoingPipelines(this.id, /* TODO */)
// Optional: Write to the log
usePlayStore.getState().writeToLog(/* TODO */)
// End with calling the next node
usePlayStore.getState().nextNode()
}
}
}
The final step involves adding the transformation logic for the node. This transformation will convert a React Flow Node
into an instance of your newly created class from Step 5. To do this, navigate to the util/NodeMapTransformer.ts file and
add a new case to the getNodeFromType()
method where you create the instance.