Skip to content

TilesPatcher

Han Bui edited this page Oct 2, 2024 · 1 revision

The TilesPatcher node is responsible for collecting and merging the results from the tiled frames processed by the neural network.

Key Responsibilities

  • bounding box extraction (decoding of neural network output)
  • scales and map the output bounding boxes to global image coordinates based on tile_index
  • performs Non-Maximum Suppression (NMS) (read more in paper or this Medium post) to eliminate duplicate detections and outputs the final set of bounding boxes

_extract_bboxes(nn_output)

This method extracts bounding boxes and other necessary data from the neural network output. Bounding boxes are extracted as follows:

  • Confidence Filtering: Bounding boxes with confidence scores below self.conf_thresh are removed.
  • Box Format: The bboxes are converted from center-based format $(x, y, w, h)$ to corner-based format $(x_1, y_1, x_2, y_2)$ using utility method xywh2xyxy(x).

_map_bboxes_to_tile(bboxes, tile_index)

This method adjusts the bounding boxes from the tile's local coordinates to the global image coordinates.

  1. Retrieve Tile Information: The method gets the tile’s position and dimensions using _get_tile_info.
  2. Adjust Coordinates: Bounding boxes are scaled and translated based on the tile's top-left corner and the scale factor used to resize the tile for the neural network.

The adjusted bounding box coordinates $(x_1, y_1, x_2, y_2)$ are calculated as:

$$x_1' = \frac{(x_1 - \text{{x\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_x}}$$ $$y_1' = \frac{(y_1 - \text{{y\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_y}}$$ $$x_2' = \frac{(x_2 - \text{{x\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_x}}$$ $$y_2' = \frac{(y_2 - \text{{y\_offset}})}{\text{{tile\_scale}}} + \text{{tile\_y}}$$

Where $x_1, y_1, x_2, y_2$ are the bounding box coordinates in the tile, and $x_1', y_1', x_2', y_2'$ are the adjusted global coordinates.

Syncing

This node also performs a syncing mechanism. Because frames are sent sequentially (not in parallel), we can deduce some heuristics for the waiting of the tiles.

  • Tiles are sent sequentially to the network and they are outputed by the network in a sequential fashion.
  • We can get the information about how many tiles are we expecting.

Hence, we only send the bounding boxes once we either:

  1. get all the expected tiles
  2. the timestamp of the newly received tile differs from the ones in buffer

_send_output(timestamp, device_timestamp)

Once all tiles for a frame are processed, the method sends the final set of bounding boxes after performing Non-Maximum Suppression (NMS).

  1. Merging Bounding Boxes: The bounding boxes from all tiles are merged into one list.
  2. Non-Maximum Suppression: NMS is applied to remove overlapping boxes with high IoU (above $\text{iou\_thresh}$), retaining only the box with the highest confidence score.

For more information about NMS, read here.

The final bounding boxes are then sent as output.