Skip to content

[ICPR 2024 Competition on Rider Intention Prediction] - [Top Submission] - State-Space Model based sequence modelling of rider's-view videos for intent prediction tasks.

Notifications You must be signed in to change notification settings

SajayR/ICPR-RIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

SSM-based RIP Model for Rider Intention Prediction

1. Preliminaries

The backbone of our classification models rely on recent advancements with sequence modelling architectures, particularly state-space models. A recent selective SSM model, namely Mamba, has shown great promise in various different domains, starting off with text, images and most relevant to our task, time-series tasks.
The task of Rider’s intent prediction could be considered a task for high-dimension time series classification, where the subtle changes in frame embeddings throughout video clips could hint towards the prediction class
1.1 Mamba Architecture Overview
Mamba is a new SSM architecture built for information-dense data. They are, by design, fully recurrent models that make them suitable for operating on long sequences while being able to selectively prioritize information with the data-driven selection mechanism. They achieve transformer-quality performance while maintaining a linear-time complexity.
1.1.1 Mamba Block Architecture
The fundamental building block of our model is the Mamba block, which processes input sequences through a series of operations

  1. Input Projection: The input sequence X of shape (B, V, D) is linearly projected to create two intermediate representations, x and z, both of shape (B, V, ED), where B is the batch size, V is the sequence length, D is the input dimension, and E is an expansion factor.
  2. Convolutional Layer: A 1D convolutional operation is applied to x, followed by the SiLU activation function, producing x'.
  3. Parameter Generation: The block generates input-dependent parameters A, B, and C through linear projections of x'. Additionally, it computes Δ, which controls the discretization of the continuous-time SSM.
  4. Selective SSM: The core of the Mamba block is the selective SSM operation, which processes the input sequence using the generated parameters.
  5. Output Projection: The output of the selective SSM is combined with the z intermediate representation and projected back to the original input dimension.

Algorithm 1: The process of Mamba Block Input: 𝑿 ∶ (𝐵, 𝑉 , 𝐷)
Output: 𝒀 ∶ (𝐵, 𝑉 , 𝐷)
1: 𝑥, 𝑧 ∶ (𝐵, 𝑉 , 𝐸𝐷) ← Linear(𝑼) {Linear projection}
2: 𝑥 ′ ∶ (𝐵, 𝑉 , 𝐸𝐷) ← SiLU(Conv1D(𝑥))
3: A ∶ (𝐷, 𝑁) ← 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 {Structured state matrix}
4: B,C ∶ (𝐵, 𝑉 , 𝑁) ← Linear(𝑥 ′ ), Linear(𝑥 ′ )
5: Δ ∶ (𝐵, 𝑉 , 𝐷) ← Softplus(𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 + Broadcast(Linear(𝑥 ′ )))
6: A,B ∶ (𝐵.𝑉 .𝐷.𝑁) ← 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑖𝑧𝑒(Δ, A,B) {Inputdependent parameters and discretization} 7: 𝑦 ∶ (𝐵, 𝑉 , 𝐸𝐷) ← SelectiveSSM(A,B,C)(𝑥 ′ ) 8: 𝑦 ′ ∶ (𝐵, 𝑉 , 𝐸𝐷) ← 𝑦 ⊗ SiLU(𝑧)
9: 𝒀 ∶ (𝐵, 𝑉 , 𝐷) ← Linear(𝑦 ′ ) {Linear Projection}
Where ⊗ denotes element-wise multiplication.

2. Methodology

2.1 Single View Model
2.1.1 Model Structure
Our single-view model for rider intention prediction uses the Mamba SSM architecture to process temporal sequences of frame embeddings extracted from video clips. This approach allows us to capture both short-term and long-term dependencies in the rider's behavior, which are crucial for accurate intention prediction.
The input to our model consists of VGG16 embeddings extracted from each frame of the frontal view camera footage. Each embedding has a dimension of 512, providing a comprehensive representation of the visual content in each frame.
To handle the variable-length nature of our video clips, we employ a dynamic padding strategy within each batch during training and inference. This approach allows us to efficiently process sequences of different lengths without losing temporal information or introducing unnecessary computational overhead.
The architecture of our single-view model is designed to progressively refine the temporal representation of the input sequence. It consists of a series of Mamba blocks, each of which processes the entire sequence and updates its internal state. The number of Mamba blocks is a hyperparameter that we tuned based on empirical performance on the validation set.
After the sequence has been processed by the Mamba blocks, we apply global average pooling across the temporal dimension. This operation condenses the temporal information into a fixed-size representation, which is crucial for our classification task as it needs to produce a single prediction for the entire sequence.
The pooled features are then passed through a final linear layer, which maps them to class probabilities corresponding to different rider intentions. We use a softmax activation function to ensure that the output represents a valid probability distribution over the possible intention classes.
Key hyperparameters that influence our model here are:

  • D_state: State dimension of the selective layer
  • D_conv: Kernel size of the convolutional layer in mamba block
  • Expand: Expansion factor for internal dimension

Algorithm 2: Process of Single-View Model
Input:
X: (B, T, F) - Batch of video features
B: Batch size
T: Sequence length
F: Feature dimension (512 for VGG16 features)
Output:
Y: (B, C) - Predicted class probabilities
C: Number of classes (rider intentions)
Flow:
1. Mamba Block Processing:
for each Mamba block:
x ← MambaBlock(x)

2. Global Pooling:
x_pooled: (B, D) ← GlobalAveragePooling(x)

3. Classification:
Y: (B, C) ← SoftmaxClassifier(x_pooled)

Function SoftmaxClassifier(x):
logits ← LinearLayer(x)
probabilities ← Softmax(logits)
return probabilities

2.2 Multi-View Model
2.2.1 Model Structure
To incorporate information from multiple camera views and potentially improve the accuracy of our predictions, we developed a multi-view model that processes features from the frontal, left side mirror, and right side mirror cameras simultaneously. This approach allows our model to capture a more comprehensive representation of the rider's environment and behavior.
Our multi-view architecture is based on an ensemble of three single-view Mamba models, each dedicated to processing the features from one of the camera views. This design choice allows each model to specialize in extracting relevant information from its respective view while maintaining the ability to capture view-specific temporal dynamics.
The multi-view model processes the input features as follows: First, each view's features (frontal, left mirror, and right mirror) are independently fed through a separate Mamba model, as described in the single-view architecture. This parallel processing allows each model to focus on the unique information provided by its respective view.
After obtaining predictions from each view-specific model, we employ a learnable weighting mechanism to combine these predictions. This approach allows our model to automatically determine the relative importance of each view for the intention prediction task. The weights are initialized randomly and are trained end-to-end with the rest of the model parameters, allowing them to adapt to the specific characteristics of our dataset.
The final prediction is computed as a weighted sum of the individual view outputs. To ensure that the resulting combination represents a valid probability distribution, we apply a softmax function to the weighted sum. This process can be formalized as:

Y_combined = W[0] * Y_front + W[1] * Y_left + W[2] * Y_right Y_final = Softmax(Y_combined)

Where W represents the learnable weights, and Y_front, Y_left, and Y_right are the outputs from the frontal, left mirror, and right mirror view models, respectively.

This ensemble-based approach offers several advantages. It allows our model to leverage complementary information from different views, potentially capturing aspects of the rider's behavior or environment that may not be visible from a single perspective. Additionally, the learnable weighting mechanism provides a degree of interpretability, as the final weights can give insights into which views are most informative for the intention prediction task.

Algorithm 3: Process of Multi-view Model
Input:
X_front: (B, T, F) - Batch of frontal view features
X_left: (B, T, F) - Batch of left mirror view features
X_right: (B, T, F) - Batch of right mirror view features
B: Batch size
T: Sequence length
F: Feature dimension (512 for VGG16 features)
Output:
Y: (B, C) - Predicted class probabilities
C: Number of classes (rider intentions)
Flow
1. Single-View Processing:
Y_front: (B, C) ← MambaModel(X_front)
Y_left: (B, C) ← MambaModel(X_left)
Y_right: (B, C) ← MambaModel(X_right)

2. Ensemble Weighting:
W: (3,) ← LearneableWeights()

3. Weighted Combination:
Y_combined: (B, C) ← W[0] * Y_front + W[1] * Y_left + W[2] * Y_right

4. Final Classification:
Y: (B, C) ← Softmax(Y_combined)

Function MambaModel(X):
for each Mamba block:
x ← MambaBlock(x)
x_pooled: (B, D) ←GlobalAveragePooling(x)
logits: (B, C) ← LinearLayer(x_pooled)
return logits

3. Implementation

3.1 Data Preprocessing

For data preprocessing, we utilize pre-extracted VGG16 features for both single-view and multi-view tasks. These features are normalized using z-score normalization (zero mean, unit variance) on a per-sequence basis. This normalization step is crucial for ensuring that the input features are on a consistent scale, which can help with the stability and efficiency of the training process.

We experimented with additional dimensionality reduction or projection of the input features before feeding them into the Mamba models. However, we found that these additional steps did not yield significant performance improvements. As a result, we decided to use the VGG16 features directly, maintaining the original 512-dimensional representation for each frame.

Given that our input sequences (video clips) can have variable lengths, we implement a batch-wise padding strategy where sequences within a batch are padded to the maximum length in that specific batch. This approach allowed us to efficiently process diverse sequence lengths without introducing unnecessary padding across the entire dataset.

3.2 Training Procedure
Both models were trained using the following configuration:
● Optimizer: AdamW
● Learning rate: 0.001
● Weight decay: 1e-5
● Batch size: 16
● Number of epochs: 20
● Learning rate scheduler: StepLR (step_size=3, gamma=0.8)
● Loss function: Cross-Entropy Loss
We implemented early stopping based on validation accuracy, saving the best-performing model during training.
● d_model: 512 (matching input dimension)
● d_state: 32
● d_conv: 4
● expand: 8

About

[ICPR 2024 Competition on Rider Intention Prediction] - [Top Submission] - State-Space Model based sequence modelling of rider's-view videos for intent prediction tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published