TilesPatcher

The TilesPatcher node is responsible for collecting and merging the results from the tiled frames processed by the neural network.

Key Responsibilities

bounding box extraction (decoding of neural network output)
scales and map the output bounding boxes to global image coordinates based on tile_index
performs Non-Maximum Suppression (NMS) (read more in paper or this Medium post) to eliminate duplicate detections and outputs the final set of bounding boxes

_extract_bboxes(nn_output)

This method extracts bounding boxes and other necessary data from the neural network output. Bounding boxes are extracted as follows:

Confidence Filtering: Bounding boxes with confidence scores below self.conf_thresh are removed.
Box Format: The bboxes are converted from center-based format $(x, y, w, h)$ to corner-based format $(x_1, y_1, x_2, y_2)$ using utility method xywh2xyxy(x).

_map_bboxes_to_tile(bboxes, tile_index)

This method adjusts the bounding boxes from the tile's local coordinates to the global image coordinates.

Retrieve Tile Information: The method gets the tile’s position and dimensions using _get_tile_info.
Adjust Coordinates: Bounding boxes are scaled and translated based on the tile's top-left corner and the scale factor used to resize the tile for the neural network.

The adjusted bounding box coordinates $(x_1, y_1, x_2, y_2)$ are calculated as:

$$x_1' = \frac{(x_1 - \text{{x\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_x}}$$

$$y_1' = \frac{(y_1 - \text{{y\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_y}}$$

$$x_2' = \frac{(x_2 - \text{{x\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_x}}$$

$$y_2' = \frac{(y_2 - \text{{y\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_y}}$$

Where $x_1, y_1, x_2, y_2$ are the bounding box coordinates in the tile, and $x_1', y_1', x_2', y_2'$ are the adjusted global coordinates.

Syncing

This node also performs a syncing mechanism. Because frames are sent sequentially (not in parallel), we can deduce some heuristics for the waiting of the tiles.

Tiles are sent sequentially to the network and they are outputed by the network in a sequential fashion.
We can get the information about how many tiles are we expecting.

Hence, we only send the bounding boxes once we either:

get all the expected tiles
the timestamp of the newly received tile differs from the ones in buffer

_send_output(timestamp, device_timestamp)

Once all tiles for a frame are processed, the method sends the final set of bounding boxes after performing Non-Maximum Suppression (NMS).

Merging Bounding Boxes: The bounding boxes from all tiles are merged into one list.
Non-Maximum Suppression: NMS is applied to remove overlapping boxes with high IoU (above $\text{iou\_thresh}$), retaining only the box with the highest confidence score.

For more information about NMS, read here.

The final bounding boxes are then sent as output.

Provide feedback

Saved searches