Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if model is convergence. what range dose total loss in?Did someone find a set of hyper parameters adapt a small batch size? #479

Open
risingClouds opened this issue Nov 6, 2024 · 5 comments

Comments

@risingClouds
Copy link

I want to retrain dinov2 on my own domian-sepecific datasets with 220k unlabeled images. I reshape each image as [896,896] and train it on 4 A100-GPU, batch_size_per_gpu is 20.the config is as following:

MODEL:
  WEIGHTS: ''
ibot:
  separate_head: true
train:
  batch_size_per_gpu: 20
  OFFICIAL_EPOCH_LENGTH: 650
  dataset_path: ImageNet:split=TRAIN
  centering: sinkhorn_knopp
student:
  arch: vit_base
  patch_size: 14
  drop_path_rate: 0.4
  ffn_layer: swiglufused
  block_chunks: 4
teacher:
  momentum_teacher: 0.994
optim:
  epochs: 200
  weight_decay_end: 0.2
  base_lr: 2.0e-04  # learning rate for a batch size of 1024
  warmup_epochs: 10
  layerwise_decay: 1.0
crops:
  global_crops_size: 896
  local_crops_size: 392
evaluation:
  eval_period_iterations: 5000

and total loss just converge to 11 from 14. Is that because of bad hyper parameters?

{"iteration": 0, "iter_time": 20.96675682067871, "data_time": 17.528287887573242, "lr": 0.0, "wd": 0.04000000000000001, "mom": 0.994, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.873044967651367, "dino_local_crops_loss": 10.032546043395996, "dino_global_crops_loss": 1.2540682554244995, "koleo_loss": 0.775390625, "ibot_loss": 2.811039686203003}
{"iteration": 10, "iter_time": 5.467006206512451, "data_time": 1.5937976837158203, "lr": 4.300792386328262e-08, "wd": 0.04000000058400027, "mom": 0.9940000000219, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.863398551940918, "dino_local_crops_loss": 10.03254508972168, "dino_global_crops_loss": 1.2540682554244995, "koleo_loss": 0.765625, "ibot_loss": 2.811257839202881}
{"iteration": 20, "iter_time": 3.6815314292907715, "data_time": 0.07533036172389984, "lr": 8.601584772656524e-08, "wd": 0.04000000233600104, "mom": 0.9940000000876, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.860908508300781, "dino_local_crops_loss": 10.032539367675781, "dino_global_crops_loss": 1.2540677785873413, "koleo_loss": 0.76318359375, "ibot_loss": 2.811257839202881}
{"iteration": 30, "iter_time": 3.3603122234344482, "data_time": 0.32184287905693054, "lr": 1.7203169545313049e-07, "wd": 0.040000009344003995, "mom": 0.9940000003504001, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.85217571258545, "dino_local_crops_loss": 10.032195091247559, "dino_global_crops_loss": 1.2540218830108643, "koleo_loss": 0.7548828125, "ibot_loss": 2.8113298416137695}
{"iteration": 40, "iter_time": 3.7108612060546875, "data_time": 0.578641951084137, "lr": 2.580475431796957e-07, "wd": 0.04000002102400846, "mom": 0.9940000007884003, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.834327697753906, "dino_local_crops_loss": 10.031959533691406, "dino_global_crops_loss": 1.2539944648742676, "koleo_loss": 0.736328125, "ibot_loss": 2.811502456665039}
{"iteration": 50, "iter_time": 3.860743761062622, "data_time": 0.43079957365989685, "lr": 3.4406339090626097e-07, "wd": 0.040000037376013764, "mom": 0.9940000014016005, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.810681343078613, "dino_local_crops_loss": 10.031583786010742, "dino_global_crops_loss": 1.2539479732513428, "koleo_loss": 0.7138671875, "ibot_loss": 2.811478614807129}
{"iteration": 60, "iter_time": 3.7952067852020264, "data_time": 0.42305120825767517, "lr": 4.300792386328262e-07, "wd": 0.04000005840001894, "mom": 0.9940000021900007, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.78423023223877, "dino_local_crops_loss": 10.030817031860352, "dino_global_crops_loss": 1.253852367401123, "koleo_loss": 0.68896484375, "ibot_loss": 2.8113462924957275}
{"iteration": 70, "iter_time": 3.729526996612549, "data_time": 0.5089820027351379, "lr": 5.160950863593914e-07, "wd": 0.040000084096022764, "mom": 0.9940000031536008, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.754712104797363, "dino_local_crops_loss": 10.029962539672852, "dino_global_crops_loss": 1.2537462711334229, "koleo_loss": 0.65869140625, "ibot_loss": 2.811479330062866}
{"iteration": 80, "iter_time": 3.8763680458068848, "data_time": 0.6289194822311401, "lr": 6.021109340859567e-07, "wd": 0.04000011446402377, "mom": 0.9940000042924009, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.729358673095703, "dino_local_crops_loss": 10.029097557067871, "dino_global_crops_loss": 1.2536370754241943, "koleo_loss": 0.6357421875, "ibot_loss": 2.811481475830078}
{"iteration": 90, "iter_time": 3.622586727142334, "data_time": 0.6086481809616089, "lr": 6.881267818125219e-07, "wd": 0.040000149504020116, "mom": 0.9940000056064008, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.707359313964844, "dino_local_crops_loss": 10.027824401855469, "dino_global_crops_loss": 1.2534765005111694, "koleo_loss": 0.61376953125, "ibot_loss": 2.811448097229004}
{"iteration": 100, "iter_time": 3.3090415000915527, "data_time": 0.35113075375556946, "lr": 7.741426295390871e-07, "wd": 0.04000018921600981, "mom": 0.9940000070956003, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.687376976013184, "dino_local_crops_loss": 10.026717185974121, "dino_global_crops_loss": 1.2533398866653442, "koleo_loss": 0.5966796875, "ibot_loss": 2.8113155364990234}
{"iteration": 110, "iter_time": 3.7091193199157715, "data_time": 0.7397410273551941, "lr": 8.601584772656524e-07, "wd": 0.0400002335999905, "mom": 0.9940000087599996, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.66700553894043, "dino_local_crops_loss": 10.025345802307129, "dino_global_crops_loss": 1.2531678676605225, "koleo_loss": 0.5771484375, "ibot_loss": 2.8110175132751465}
{"iteration": 120, "iter_time": 3.6070103645324707, "data_time": 0.9350411295890808, "lr": 9.461743249922176e-07, "wd": 0.04000028265595962, "mom": 0.9940000105995985, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.645130157470703, "dino_local_crops_loss": 10.023780822753906, "dino_global_crops_loss": 1.2529722452163696, "koleo_loss": 0.55859375, "ibot_loss": 2.810994863510132}
{"iteration": 130, "iter_time": 3.4192874431610107, "data_time": 0.5709220170974731, "lr": 1.0321901727187828e-06, "wd": 0.04000033638391426, "mom": 0.9940000126143967, "last_layer_lr": 0.0, "current_batch_size": 20.0, "total_loss": 14.628183364868164, "dino_local_crops_loss": 10.022283554077148, "dino_global_crops_loss": 1.2527868747711182, "koleo_loss": 0.5419921875, "ibot_loss": 2.810994863510132}

.....

{"iteration": 83100, "iter_time": 3.6564724445343018, "data_time": 0.5249921083450317, "lr": 1.8331884971103754e-05, "wd": 0.1538698011829041, "mom": 0.9982701175443589, "last_layer_lr": 1.8331884971103754e-05, "current_batch_size": 20.0, "total_loss": 10.993171691894531, "dino_local_crops_loss": 7.494181156158447, "dino_global_crops_loss": 0.9350773692131042, "koleo_loss": -0.029388427734375, "ibot_loss": 2.5918610095977783}
{"iteration": 83110, "iter_time": 3.742112636566162, "data_time": 0.5908024311065674, "lr": 1.8325394093353296e-05, "wd": 0.15388731492445962, "mom": 0.9982707743096673, "last_layer_lr": 1.8325394093353296e-05, "current_batch_size": 20.0, "total_loss": 10.972248077392578, "dino_local_crops_loss": 7.489750862121582, "dino_global_crops_loss": 0.9348238110542297, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5916199684143066}
{"iteration": 83120, "iter_time": 3.7471721172332764, "data_time": 0.3674837350845337, "lr": 1.8318903870812672e-05, "wd": 0.15390482668699507, "mom": 0.9982714310007623, "last_layer_lr": 1.8318903870812672e-05, "current_batch_size": 20.0, "total_loss": 11.006226539611816, "dino_local_crops_loss": 7.509341239929199, "dino_global_crops_loss": 0.9356716871261597, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5926804542541504}
{"iteration": 83130, "iter_time": 3.761854648590088, "data_time": 0.09775485843420029, "lr": 1.8312414303901845e-05, "wd": 0.15392233646948772, "mom": 0.9982720876176058, "last_layer_lr": 1.8312414303901845e-05, "current_batch_size": 20.0, "total_loss": 11.014735221862793, "dino_local_crops_loss": 7.510159969329834, "dino_global_crops_loss": 0.9359200596809387, "koleo_loss": -0.029449462890625, "ibot_loss": 2.592142105102539}
{"iteration": 83140, "iter_time": 3.8580803871154785, "data_time": 0.04276227951049805, "lr": 1.8305925393040766e-05, "wd": 0.15393984427091512, "mom": 0.9982727441601593, "last_layer_lr": 1.8305925393040766e-05, "current_batch_size": 20.0, "total_loss": 10.995990753173828, "dino_local_crops_loss": 7.500287055969238, "dino_global_crops_loss": 0.9372903108596802, "koleo_loss": -0.0295867919921875, "ibot_loss": 2.5913407802581787}
{"iteration": 83150, "iter_time": 3.751682996749878, "data_time": 0.16627313196659088, "lr": 1.829943713864931e-05, "wd": 0.15395735009025469, "mom": 0.9982734006283845, "last_layer_lr": 1.829943713864931e-05, "current_batch_size": 20.0, "total_loss": 10.992744445800781, "dino_local_crops_loss": 7.498546600341797, "dino_global_crops_loss": 0.9366305470466614, "koleo_loss": -0.0296173095703125, "ibot_loss": 2.5913407802581787}
{"iteration": 83160, "iter_time": 3.728980302810669, "data_time": 0.5257463455200195, "lr": 1.8292949541147337e-05, "wd": 0.15397485392648413, "mom": 0.9982740570222431, "last_layer_lr": 1.8292949541147337e-05, "current_batch_size": 20.0, "total_loss": 11.040926933288574, "dino_local_crops_loss": 7.53302526473999, "dino_global_crops_loss": 0.9391398429870605, "koleo_loss": -0.0296173095703125, "ibot_loss": 2.592230796813965}
{"iteration": 83170, "iter_time": 3.726938247680664, "data_time": 0.48002785444259644, "lr": 1.8286462600954654e-05, "wd": 0.15399235577858122, "mom": 0.9982747133416968, "last_layer_lr": 1.8286462600954654e-05, "current_batch_size": 20.0, "total_loss": 11.04371452331543, "dino_local_crops_loss": 7.533961296081543, "dino_global_crops_loss": 0.9398974776268005, "koleo_loss": -0.02960205078125, "ibot_loss": 2.5900421142578125}
{"iteration": 83180, "iter_time": 4.050873756408691, "data_time": 0.28923410177230835, "lr": 1.8279976318491018e-05, "wd": 0.15400985564552389, "mom": 0.9982753695867072, "last_layer_lr": 1.8279976318491018e-05, "current_batch_size": 20.0, "total_loss": 11.041265487670898, "dino_local_crops_loss": 7.533961296081543, "dino_global_crops_loss": 0.9396708607673645, "koleo_loss": -0.0293426513671875, "ibot_loss": 2.590700149536133}
{"iteration": 83190, "iter_time": 4.258638381958008, "data_time": 0.8373247385025024, "lr": 1.8273490694176158e-05, "wd": 0.15402735352629013, "mom": 0.9982760257572358, "last_layer_lr": 1.8273490694176158e-05, "current_batch_size": 20.0, "total_loss": 11.032852172851562, "dino_local_crops_loss": 7.531869888305664, "dino_global_crops_loss": 0.939987301826477, "koleo_loss": -0.0293426513671875, "ibot_loss": 2.588805675506592}
{"iteration": 83200, "iter_time": 3.8045802116394043, "data_time": 1.1129602193832397, "lr": 1.8267005728429744e-05, "wd": 0.15404484941985802, "mom": 0.9982766818532447, "last_layer_lr": 1.8267005728429744e-05, "current_batch_size": 20.0, "total_loss": 11.032852172851562, "dino_local_crops_loss": 7.535743713378906, "dino_global_crops_loss": 0.9400883316993713, "koleo_loss": -0.0294647216796875, "ibot_loss": 2.5855214595794678}
{"iteration": 83210, "iter_time": 3.729626178741455, "data_time": 1.0515087842941284, "lr": 1.826052142167143e-05, "wd": 0.1540623433252058, "mom": 0.9982773378746952, "last_layer_lr": 1.826052142167143e-05, "current_batch_size": 20.0, "total_loss": 11.02900505065918, "dino_local_crops_loss": 7.531525135040283, "dino_global_crops_loss": 0.9418781399726868, "koleo_loss": -0.0294189453125, "ibot_loss": 2.5855214595794678}
{"iteration": 83220, "iter_time": 3.7799758911132812, "data_time": 1.0355918407440186, "lr": 1.82540377743208e-05, "wd": 0.1540798352413119, "mom": 0.9982779938215492, "last_layer_lr": 1.82540377743208e-05, "current_batch_size": 20.0, "total_loss": 11.012650489807129, "dino_local_crops_loss": 7.503568649291992, "dino_global_crops_loss": 0.9382875561714172, "koleo_loss": -0.0294342041015625, "ibot_loss": 2.584805488586426}
{"iteration": 83230, "iter_time": 3.6318047046661377, "data_time": 0.8730975985527039, "lr": 1.82475547867974e-05, "wd": 0.15409732516715471, "mom": 0.9982786496937683, "last_layer_lr": 1.82475547867974e-05, "current_batch_size": 20.0, "total_loss": 11.015081405639648, "dino_local_crops_loss": 7.518352031707764, "dino_global_crops_loss": 0.9382875561714172, "koleo_loss": -0.02947998046875, "ibot_loss": 2.587644100189209}
{"iteration": 83240, "iter_time": 3.2084033489227295, "data_time": 0.4374796450138092, "lr": 1.8241072459520745e-05, "wd": 0.15411481310171288, "mom": 0.9982793054913143, "last_layer_lr": 1.8241072459520745e-05, "current_batch_size": 20.0, "total_loss": 11.025972366333008, "dino_local_crops_loss": 7.535113334655762, "dino_global_crops_loss": 0.9390830993652344, "koleo_loss": -0.0296630859375, "ibot_loss": 2.589679002761841}
{"iteration": 83250, "iter_time": 3.0241451263427734, "data_time": 0.18751582503318787, "lr": 1.8234590792910296e-05, "wd": 0.1541322990439651, "mom": 0.9982799612141486, "last_layer_lr": 1.8234590792910296e-05, "current_batch_size": 20.0, "total_loss": 11.022649765014648, "dino_local_crops_loss": 7.514020919799805, "dino_global_crops_loss": 0.9350559115409851, "koleo_loss": -0.029571533203125, "ibot_loss": 2.5968191623687744}
{"iteration": 83260, "iter_time": 2.94983172416687, "data_time": 0.18638351559638977, "lr": 1.8228109787385494e-05, "wd": 0.15414978299289014, "mom": 0.9982806168622333, "last_layer_lr": 1.8228109787385494e-05, "current_batch_size": 20.0, "total_loss": 11.001459121704102, "dino_local_crops_loss": 7.49662446975708, "dino_global_crops_loss": 0.934248149394989, "koleo_loss": -0.029266357421875, "ibot_loss": 2.6000168323516846}
{"iteration": 83270, "iter_time": 2.8373987674713135, "data_time": 0.1555553823709488, "lr": 1.8221629443365692e-05, "wd": 0.15416726494746696, "mom": 0.99828127243553, "last_layer_lr": 1.8221629443365692e-05, "current_batch_size": 20.0, "total_loss": 11.018768310546875, "dino_local_crops_loss": 7.529201984405518, "dino_global_crops_loss": 0.9378029108047485, "koleo_loss": -0.029296875, "ibot_loss": 2.5948092937469482}
{"iteration": 83280, "iter_time": 3.3542747497558594, "data_time": 0.6156438589096069, "lr": 1.8215149761270247e-05, "wd": 0.15418474490667466, "mom": 0.9982819279340003, "last_layer_lr": 1.8215149761270247e-05, "current_batch_size": 20.0, "total_loss": 11.030592918395996, "dino_local_crops_loss": 7.530805587768555, "dino_global_crops_loss": 0.940298318862915, "koleo_loss": -0.029266357421875, "ibot_loss": 2.5932984352111816}
@MyHogsmeade
Copy link

I've encountered a similar issue as well. Do you have any suggestions?

@1921134176
Copy link

I also trained my own dataset and found that the loss decreased rapidly, then became difficult to continue decreasing, but the final model performance was still good. Suggest using freeze fine-tuning or kNN to check if your model converges. Therefore, although the loss does not decrease, continuing to train the model should still be moving in a positive direction.

@MyHogsmeade
Copy link

I also trained my own dataset and found that the loss decreased rapidly, then became difficult to continue decreasing, but the final model performance was still good. Suggest using freeze fine-tuning or kNN to check if your model converges. Therefore, although the loss does not decrease, continuing to train the model should still be moving in a positive direction.

Thank you for your suggestion. I will try to evaluate the effect through downstream testing.

@zhangjiancheng1024
Copy link

I have also encountered similar problems, and I found that perhaps because my images are not natural images, but medical images, removing the part of image preprocessing that normalizes based on the mean and variance of the ImageNet dataset, the loss can converge from 11 to around 4

@natwille1
Copy link

natwille1 commented Jan 14, 2025

I have also encountered similar problems, and I found that perhaps because my images are not natural images, but medical images, removing the part of image preprocessing that normalizes based on the mean and variance of the ImageNet dataset, the loss can converge from 11 to around 4
@zhangjiancheng1024

I'm researching a similar scenario - medical images and training ViT using Dinov2 from scratch. I'm struggling to get training to stabilize - are there a range of hyperparameters you found worked well for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants