Franka Arm Manipulation using Humans Demos in Kitchen Environment by me
This project enhances robotic arm manipulation by integrating human demonstrations using modified soft-actor-critic method, enabling robots to perform complex tasks like opening cabinets more effectively.
SAC is a reinforcement learning algorithm that trains an agent to act optimally in continuous action spaces, such as controlling a robot arm or navigating a drone. In the code:
- The environment is FrankaKitchen-v1, where the agent completes tasks like opening a cabinet.
- The agent optimizes its policy using the Soft Actor-Critic (SAC) algorithm.
- The algorithm prioritizes reward maximization while encouraging exploration via entropy.
SAC involves three key networks:
- Actor (Policy): Learns which actions to take in a given state to maximize reward.
- Critics (Q-value estimators): Evaluate how good a given action is in a particular state.
- Target Critic: Provides stable Q-value targets for training the critics.
The overall flow can be broken into three phases.
-
Set Up Environment:
- The environment is created (
gym.make
), and a wrapper processes observations for compatibility.
- The environment is created (
-
Agent Initialization:
- Actor:
- Learns a policy represented as a probability distribution.
- Outputs:
- Mean and log standard deviation of action distributions.
- Ensures exploration via stochastic sampling.
- Critics:
- Two independent networks (Q1 and Q2) estimate action values for stability (avoids overestimation bias).
- Target Critic:
- Initially copies the weights of the Critic and updates slowly to ensure stable targets.
- Actor:
-
Replay Buffer:
- Stores past experiences (
state
,action
,reward
,next_state
,done
). - Enables efficient learning by reusing past experiences.
- Stores past experiences (
-
Loading Expert Data:
- In Phase 1, the agent leverages human demonstration data (
human_memory.npz
) to jumpstart training.
- In Phase 1, the agent leverages human demonstration data (
The core training happens in three stages with decreasing reliance on expert data:
- The agent uses the Actor to:
- Sample an action based on the current policy.
- Observe the resulting next state, reward, and whether the episode ends.
- The transition (
state, action, reward, next_state
) is stored in the Replay Buffer.
- The agent randomly samples a batch of transitions to train itself, ensuring diverse learning.
- The Critics learn to predict Q-values, which represent the expected reward for a state-action pair.
-
Target Q-value computation:
- Uses the Target Critic to estimate future rewards for
next_state
. - Incorporates the current reward and a discount factor (
gamma
) to compute the target:$Q_{\text{target}} = r + \gamma \cdot (1 - \text{done}) \cdot \min(Q_1', Q_2') - \alpha \cdot \text{log\_prob}$ - The entropy term (
$\alpha \cdot \text{log\_prob}$ ) encourages exploration by penalizing deterministic policies.
- Uses the Target Critic to estimate future rewards for
-
Critic Loss:
- Compares the predicted Q-values (
$Q_1, Q_2$ ) to the computed target Q-value using Mean Squared Error.
- Compares the predicted Q-values (
- The Actor improves its policy to maximize the Q-values predicted by the critics.
- Actor Loss:
- Encourages actions that:
- Maximize Q-values (
$\min(Q_1, Q_2)$ ). - Maintain high entropy (exploration).
- Maximize Q-values (
- Encourages actions that:
- The Target Critic's weights are soft-updated:
$$\theta_{\text{target}} \gets \tau \cdot \theta + (1 - \tau) \cdot \theta_{\text{target}}$$ - Ensures smoother, more stable training.
- TensorBoard logs:
- Critic loss, Actor loss, and rewards.
- Saves checkpoints to allow resuming training later.
The agent is trained in three phases:
- Phase 1: High reliance on expert data:
- Expert data ratio = 50%.
- Balances learning from the replay buffer and human-provided data.
- Phase 2: Reduced expert reliance:
- Expert data ratio = 25%.
- Encourages the agent to learn more from its own exploration.
- Phase 3: Full autonomy:
- Expert data ratio = 0%.
- The agent learns purely from its own experience.
-
Actor (Policy):
- Learns how to act optimally by maximizing rewards while maintaining exploration.
-
Critics (Q-Values):
- Evaluate the quality of actions taken by the policy.
- Two critics reduce overestimation bias.
-
Replay Buffer:
- Ensures sample efficiency by reusing past experiences.
- Decorrelation: Helps prevent learning from sequentially correlated data.
-
Entropy Regularization:
- Encourages exploration, preventing premature convergence to suboptimal strategies.
-
Target Networks:
- Provide stable targets for critic training, avoiding instability caused by rapidly changing Q-values.
-
Expert Data:
- Jumpstarts training by introducing good behaviors early on, especially useful in complex tasks like robotics.
- Initialize environment, agent, and replay buffer.
- Phase 1 (Exploration with Expert Data):
- Train using a mix of expert and self-collected data.
- Phase 2 (Reduced Expert Reliance):
- Gradually shift focus to agent-collected experiences.
- Phase 3 (Full Autonomy):
- Train entirely on self-collected experiences.
- For Each Episode:
- Interact with the environment.
- Store experiences in the replay buffer.
- Periodically sample experiences to:
- Update Critics using target Q-values.
- Update Actor using learned Q-values and entropy regularization.
- Log metrics and save model checkpoints.
- MacOS Sequoia 15.1.1
- Python 3.11.9
- required installation is mentioned in
requirements.txt