if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

risingClouds · 2024-11-06T13:15:56Z

I want to retrain dinov2 on my own domian-sepecific datasets with 220k unlabeled images. I reshape each image as [896,896] and train it on 4 A100-GPU, batch_size_per_gpu is 20.the config is as following:

MODEL:
  WEIGHTS: ''
ibot:
  separate_head: true
train:
  batch_size_per_gpu: 20
  OFFICIAL_EPOCH_LENGTH: 650
  dataset_path: ImageNet:split=TRAIN
  centering: sinkhorn_knopp
student:
  arch: vit_base
  patch_size: 14
  drop_path_rate: 0.4
  ffn_layer: swiglufused
  block_chunks: 4
teacher:
  momentum_teacher: 0.994
optim:
  epochs: 200
  weight_decay_end: 0.2
  base_lr: 2.0e-04  # learning rate for a batch size of 1024
  warmup_epochs: 10
  layerwise_decay: 1.0
crops:
  global_crops_size: 896
  local_crops_size: 392
evaluation:
  eval_period_iterations: 5000

and total loss just converge to 11 from 14. Is that because of bad hyper parameters?

{"iteration": 0, "iter_time": 20.96675682067871, "data_time": 17.528287887573242, "lr": 0.0, "wd": 0.04000000000000001, "mom": 0.994, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.873044967651367, "dino_local_crops_loss": 10.032546043395996, "dino_global_crops_loss": 1.2540682554244995, "koleo_loss": 0.775390625, "ibot_loss": 2.811039686203003}
{"iteration": 10, "iter_time": 5.467006206512451, "data_time": 1.5937976837158203, "lr": 4.300792386328262e-08, "wd": 0.04000000058400027, "mom": 0.9940000000219, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.863398551940918, "dino_local_crops_loss": 10.03254508972168, "dino_global_crops_loss": 1.2540682554244995, "koleo_loss": 0.765625, "ibot_loss": 2.811257839202881}
{"iteration": 20, "iter_time": 3.6815314292907715, "data_time": 0.07533036172389984, "lr": 8.601584772656524e-08, "wd": 0.04000000233600104, "mom": 0.9940000000876, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.860908508300781, "dino_local_crops_loss": 10.032539367675781, "dino_global_crops_loss": 1.2540677785873413, "koleo_loss": 0.76318359375, "ibot_loss": 2.811257839202881}
{"iteration": 30, "iter_time": 3.3603122234344482, "data_time": 0.32184287905693054, "lr": 1.7203169545313049e-07, "wd": 0.040000009344003995, "mom": 0.9940000003504001, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.85217571258545, "dino_local_crops_loss": 10.032195091247559, "dino_global_crops_loss": 1.2540218830108643, "koleo_loss": 0.7548828125, "ibot_loss": 2.8113298416137695}
{"iteration": 40, "iter_time": 3.7108612060546875, "data_time": 0.578641951084137, "lr": 2.580475431796957e-07, "wd": 0.04000002102400846, "mom": 0.9940000007884003, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.834327697753906, "dino_local_crops_loss": 10.031959533691406, "dino_global_crops_loss": 1.2539944648742676, "koleo_loss": 0.736328125, "ibot_loss": 2.811502456665039}
{"iteration": 50, "iter_time": 3.860743761062622, "data_time": 0.43079957365989685, "lr": 3.4406339090626097e-07, "wd": 0.040000037376013764, "mom": 0.9940000014016005, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.810681343078613, "dino_local_crops_loss": 10.031583786010742, "dino_global_crops_loss": 1.2539479732513428, "koleo_loss": 0.7138671875, "ibot_loss": 2.811478614807129}
{"iteration": 60, "iter_time": 3.7952067852020264, "data_time": 0.42305120825767517, "lr": 4.300792386328262e-07, "wd": 0.04000005840001894, "mom": 0.9940000021900007, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.78423023223877, "dino_local_crops_loss": 10.030817031860352, "dino_global_crops_loss": 1.253852367401123, "koleo_loss": 0.68896484375, "ibot_loss": 2.8113462924957275}
{"iteration": 70, "iter_time": 3.729526996612549, "data_time": 0.5089820027351379, "lr": 5.160950863593914e-07, "wd": 0.040000084096022764, "mom": 0.9940000031536008, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.754712104797363, "dino_local_crops_loss": 10.029962539672852, "dino_global_crops_loss": 1.2537462711334229, "koleo_loss": 0.65869140625, "ibot_loss": 2.811479330062866}
{"iteration": 80, "iter_time": 3.8763680458068848, "data_time": 0.6289194822311401, "lr": 6.021109340859567e-07, "wd": 0.04000011446402377, "mom": 0.9940000042924009, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.729358673095703, "dino_local_crops_loss": 10.029097557067871, "dino_global_crops_loss": 1.2536370754241943, "koleo_loss": 0.6357421875, "ibot_loss": 2.811481475830078}
{"iteration": 90, "iter_time": 3.622586727142334, "data_time": 0.6086481809616089, "lr": 6.881267818125219e-07, "wd": 0.040000149504020116, "mom": 0.9940000056064008, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.707359313964844, "dino_local_crops_loss": 10.027824401855469, "dino_global_crops_loss": 1.2534765005111694, "koleo_loss": 0.61376953125, "ibot_loss": 2.811448097229004}
{"iteration": 100, "iter_time": 3.3090415000915527, "data_time": 0.35113075375556946, "lr": 7.741426295390871e-07, "wd": 0.04000018921600981, "mom": 0.9940000070956003, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.687376976013184, "dino_local_crops_loss": 10.026717185974121, "dino_global_crops_loss": 1.2533398866653442, "koleo_loss": 0.5966796875, "ibot_loss": 2.8113155364990234}
{"iteration": 110, "iter_time": 3.7091193199157715, "data_time": 0.7397410273551941, "lr": 8.601584772656524e-07, "wd": 0.0400002335999905, "mom": 0.9940000087599996, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.66700553894043, "dino_local_crops_loss": 10.025345802307129, "dino_global_crops_loss": 1.2531678676605225, "koleo_loss": 0.5771484375, "ibot_loss": 2.8110175132751465}
{"iteration": 120, "iter_time": 3.6070103645324707, "data_time": 0.9350411295890808, "lr": 9.461743249922176e-07, "wd": 0.04000028265595962, "mom": 0.9940000105995985, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.645130157470703, "dino_local_crops_loss": 10.023780822753906, "dino_global_crops_loss": 1.2529722452163696, "koleo_loss": 0.55859375, "ibot_loss": 2.810994863510132}
{"iteration": 130, "iter_time": 3.4192874431610107, "data_time": 0.5709220170974731, "lr": 1.0321901727187828e-06, "wd": 0.04000033638391426, "mom": 0.9940000126143967, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.628183364868164, "dino_local_crops_loss": 10.022283554077148, "dino_global_crops_loss": 1.2527868747711182, "koleo_loss": 0.5419921875, "ibot_loss": 2.810994863510132}

.....

{"iteration": 83100, "iter_time": 3.6564724445343018, "data_time": 0.5249921083450317, "lr": 1.8331884971103754e-05, "wd": 0.1538698011829041, "mom": 0.9982701175443589, "last_layer_lr": 1.8331884971103754e-05, "current_batch_size": 20.0, "total_loss": 10.993171691894531, "dino_local_crops_loss": 7.494181156158447, "dino_global_crops_loss": 0.9350773692131042, "koleo_loss": -0.029388427734375, "ibot_loss": 2.5918610095977783}
{"iteration": 83110, "iter_time": 3.742112636566162, "data_time": 0.5908024311065674, "lr": 1.8325394093353296e-05, "wd": 0.15388731492445962, "mom": 0.9982707743096673, "last_layer_lr": 1.8325394093353296e-05, "current_batch_size": 20.0, "total_loss": 10.972248077392578, "dino_local_crops_loss": 7.489750862121582, "dino_global_crops_loss": 0.9348238110542297, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5916199684143066}
{"iteration": 83120, "iter_time": 3.7471721172332764, "data_time": 0.3674837350845337, "lr": 1.8318903870812672e-05, "wd": 0.15390482668699507, "mom": 0.9982714310007623, "last_layer_lr": 1.8318903870812672e-05, "current_batch_size": 20.0, "total_loss": 11.006226539611816, "dino_local_crops_loss": 7.509341239929199, "dino_global_crops_loss": 0.9356716871261597, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5926804542541504}
{"iteration": 83130, "iter_time": 3.761854648590088, "data_time": 0.09775485843420029, "lr": 1.8312414303901845e-05, "wd": 0.15392233646948772, "mom": 0.9982720876176058, "last_layer_lr": 1.8312414303901845e-05, "current_batch_size": 20.0, "total_loss": 11.014735221862793, "dino_local_crops_loss": 7.510159969329834, "dino_global_crops_loss": 0.9359200596809387, "koleo_loss": -0.029449462890625, "ibot_loss": 2.592142105102539}
{"iteration": 83140, "iter_time": 3.8580803871154785, "data_time": 0.04276227951049805, "lr": 1.8305925393040766e-05, "wd": 0.15393984427091512, "mom": 0.9982727441601593, "last_layer_lr": 1.8305925393040766e-05, "current_batch_size": 20.0, "total_loss": 10.995990753173828, "dino_local_crops_loss": 7.500287055969238, "dino_global_crops_loss": 0.9372903108596802, "koleo_loss": -0.0295867919921875, "ibot_loss": 2.5913407802581787}
{"iteration": 83150, "iter_time": 3.751682996749878, "data_time": 0.16627313196659088, "lr": 1.829943713864931e-05, "wd": 0.15395735009025469, "mom": 0.9982734006283845, "last_layer_lr": 1.829943713864931e-05, "current_batch_size": 20.0, "total_loss": 10.992744445800781, "dino_local_crops_loss": 7.498546600341797, "dino_global_crops_loss": 0.9366305470466614, "koleo_loss": -0.0296173095703125, "ibot_loss": 2.5913407802581787}
{"iteration": 83160, "iter_time": 3.728980302810669, "data_time": 0.5257463455200195, "lr": 1.8292949541147337e-05, "wd": 0.15397485392648413, "mom": 0.9982740570222431, "last_layer_lr": 1.8292949541147337e-05, "current_batch_size": 20.0, "total_loss": 11.040926933288574, "dino_local_crops_loss": 7.53302526473999, "dino_global_crops_loss": 0.9391398429870605, "koleo_loss": -0.0296173095703125, "ibot_loss": 2.592230796813965}
{"iteration": 83170, "iter_time": 3.726938247680664, "data_time": 0.48002785444259644, "lr": 1.8286462600954654e-05, "wd": 0.15399235577858122, "mom": 0.9982747133416968, "last_layer_lr": 1.8286462600954654e-05, "current_batch_size": 20.0, "total_loss": 11.04371452331543, "dino_local_crops_loss": 7.533961296081543, "dino_global_crops_loss": 0.9398974776268005, "koleo_loss": -0.02960205078125, "ibot_loss": 2.5900421142578125}
{"iteration": 83180, "iter_time": 4.050873756408691, "data_time": 0.28923410177230835, "lr": 1.8279976318491018e-05, "wd": 0.15400985564552389, "mom": 0.9982753695867072, "last_layer_lr": 1.8279976318491018e-05, "current_batch_size": 20.0, "total_loss": 11.041265487670898, "dino_local_crops_loss": 7.533961296081543, "dino_global_crops_loss": 0.9396708607673645, "koleo_loss": -0.0293426513671875, "ibot_loss": 2.590700149536133}
{"iteration": 83190, "iter_time": 4.258638381958008, "data_time": 0.8373247385025024, "lr": 1.8273490694176158e-05, "wd": 0.15402735352629013, "mom": 0.9982760257572358, "last_layer_lr": 1.8273490694176158e-05, "current_batch_size": 20.0, "total_loss": 11.032852172851562, "dino_local_crops_loss": 7.531869888305664, "dino_global_crops_loss": 0.939987301826477, "koleo_loss": -0.0293426513671875, "ibot_loss": 2.588805675506592}
{"iteration": 83200, "iter_time": 3.8045802116394043, "data_time": 1.1129602193832397, "lr": 1.8267005728429744e-05, "wd": 0.15404484941985802, "mom": 0.9982766818532447, "last_layer_lr": 1.8267005728429744e-05, "current_batch_size": 20.0, "total_loss": 11.032852172851562, "dino_local_crops_loss": 7.535743713378906, "dino_global_crops_loss": 0.9400883316993713, "koleo_loss": -0.0294647216796875, "ibot_loss": 2.5855214595794678}
{"iteration": 83210, "iter_time": 3.729626178741455, "data_time": 1.0515087842941284, "lr": 1.826052142167143e-05, "wd": 0.1540623433252058, "mom": 0.9982773378746952, "last_layer_lr": 1.826052142167143e-05, "current_batch_size": 20.0, "total_loss": 11.02900505065918, "dino_local_crops_loss": 7.531525135040283, "dino_global_crops_loss": 0.9418781399726868, "koleo_loss": -0.0294189453125, "ibot_loss": 2.5855214595794678}
{"iteration": 83220, "iter_time": 3.7799758911132812, "data_time": 1.0355918407440186, "lr": 1.82540377743208e-05, "wd": 0.1540798352413119, "mom": 0.9982779938215492, "last_layer_lr": 1.82540377743208e-05, "current_batch_size": 20.0, "total_loss": 11.012650489807129, "dino_local_crops_loss": 7.503568649291992, "dino_global_crops_loss": 0.9382875561714172, "koleo_loss": -0.0294342041015625, "ibot_loss": 2.584805488586426}
{"iteration": 83230, "iter_time": 3.6318047046661377, "data_time": 0.8730975985527039, "lr": 1.82475547867974e-05, "wd": 0.15409732516715471, "mom": 0.9982786496937683, "last_layer_lr": 1.82475547867974e-05, "current_batch_size": 20.0, "total_loss": 11.015081405639648, "dino_local_crops_loss": 7.518352031707764, "dino_global_crops_loss": 0.9382875561714172, "koleo_loss": -0.02947998046875, "ibot_loss": 2.587644100189209}
{"iteration": 83240, "iter_time": 3.2084033489227295, "data_time": 0.4374796450138092, "lr": 1.8241072459520745e-05, "wd": 0.15411481310171288, "mom": 0.9982793054913143, "last_layer_lr": 1.8241072459520745e-05, "current_batch_size": 20.0, "total_loss": 11.025972366333008, "dino_local_crops_loss": 7.535113334655762, "dino_global_crops_loss": 0.9390830993652344, "koleo_loss": -0.0296630859375, "ibot_loss": 2.589679002761841}
{"iteration": 83250, "iter_time": 3.0241451263427734, "data_time": 0.18751582503318787, "lr": 1.8234590792910296e-05, "wd": 0.1541322990439651, "mom": 0.9982799612141486, "last_layer_lr": 1.8234590792910296e-05, "current_batch_size": 20.0, "total_loss": 11.022649765014648, "dino_local_crops_loss": 7.514020919799805, "dino_global_crops_loss": 0.9350559115409851, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5968191623687744}
{"iteration": 83260, "iter_time": 2.94983172416687, "data_time": 0.18638351559638977, "lr": 1.8228109787385494e-05, "wd": 0.15414978299289014, "mom": 0.9982806168622333, "last_layer_lr": 1.8228109787385494e-05, "current_batch_size": 20.0, "total_loss": 11.001459121704102, "dino_local_crops_loss": 7.49662446975708, "dino_global_crops_loss": 0.934248149394989, "koleo_loss": -0.029266357421875, "ibot_loss": 2.6000168323516846}
{"iteration": 83270, "iter_time": 2.8373987674713135, "data_time": 0.1555553823709488, "lr": 1.8221629443365692e-05, "wd": 0.15416726494746696, "mom": 0.99828127243553, "last_layer_lr": 1.8221629443365692e-05, "current_batch_size": 20.0, "total_loss": 11.018768310546875, "dino_local_crops_loss": 7.529201984405518, "dino_global_crops_loss": 0.9378029108047485, "koleo_loss": -0.029296875, "ibot_loss": 2.5948092937469482}
{"iteration": 83280, "iter_time": 3.3542747497558594, "data_time": 0.6156438589096069, "lr": 1.8215149761270247e-05, "wd": 0.15418474490667466, "mom": 0.9982819279340003, "last_layer_lr": 1.8215149761270247e-05, "current_batch_size": 20.0, "total_loss": 11.030592918395996, "dino_local_crops_loss": 7.530805587768555, "dino_global_crops_loss": 0.940298318862915, "koleo_loss": -0.029266357421875, "ibot_loss": 2.5932984352111816}

MyHogsmeade · 2024-12-09T13:57:45Z

I've encountered a similar issue as well. Do you have any suggestions?

1921134176 · 2024-12-11T02:11:56Z

I also trained my own dataset and found that the loss decreased rapidly, then became difficult to continue decreasing, but the final model performance was still good. Suggest using freeze fine-tuning or kNN to check if your model converges. Therefore, although the loss does not decrease, continuing to train the model should still be moving in a positive direction.

MyHogsmeade · 2024-12-15T13:47:37Z

I also trained my own dataset and found that the loss decreased rapidly, then became difficult to continue decreasing, but the final model performance was still good. Suggest using freeze fine-tuning or kNN to check if your model converges. Therefore, although the loss does not decrease, continuing to train the model should still be moving in a positive direction.

Thank you for your suggestion. I will try to evaluate the effect through downstream testing.

zhangjiancheng1024 · 2024-12-25T07:59:34Z

I have also encountered similar problems, and I found that perhaps because my images are not natural images, but medical images, removing the part of image preprocessing that normalizes based on the mean and variance of the ImageNet dataset, the loss can converge from 11 to around 4

natwille1 · 2025-01-14T12:42:16Z

I have also encountered similar problems, and I found that perhaps because my images are not natural images, but medical images, removing the part of image preprocessing that normalizes based on the mean and variance of the ImageNet dataset, the loss can converge from 11 to around 4
@zhangjiancheng1024

I'm researching a similar scenario - medical images and training ViT using Dinov2 from scratch. I'm struggling to get training to stabilize - are there a range of hyperparameters you found worked well for you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

risingClouds commented Nov 6, 2024

MyHogsmeade commented Dec 9, 2024

1921134176 commented Dec 11, 2024

MyHogsmeade commented Dec 15, 2024

zhangjiancheng1024 commented Dec 25, 2024

natwille1 commented Jan 14, 2025 •

edited

Loading

if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

Comments

risingClouds commented Nov 6, 2024

MyHogsmeade commented Dec 9, 2024

1921134176 commented Dec 11, 2024

MyHogsmeade commented Dec 15, 2024

zhangjiancheng1024 commented Dec 25, 2024

natwille1 commented Jan 14, 2025 • edited Loading

natwille1 commented Jan 14, 2025 •

edited

Loading