Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in migration to production instance #70

Closed
belforte opened this issue Jul 28, 2022 · 11 comments
Closed

error in migration to production instance #70

belforte opened this issue Jul 28, 2022 · 11 comments

Comments

@belforte
Copy link
Member

following up on some terminally failed migration from prod/global to prod/phys03
seems like servers do not agree on key names inside blocks.
I do not know if this has something to do with new dbs2go server or not.
Sorry if I posted in wrong place.

belforte@vocms0750/dbs-logs> pwd
/cephfs/product/dbs-logs
belforte@vocms0750/dbs-logs> grep -C5 3310726  dbsmigration-20220728-dbsmigration-5d784c775d-dncdm.log 
--------------------getResource--  Thu Jul 28 04:20:29 2022 Migration request ID: 3310726
--------------------Thu Jul 28 04:20:30 2022 Inserting block: /DYToLL_M-50_TuneCP5_14TeV-pythia8/Run3Summer21DRPremix-120X_mcRun3_2021_realistic_v6-v2/GEN-SIM-DIGI-RAW#1b483377-b507-4249-b2b9-4eaa5b9723d0 for request id: 3310726
Thu Jul 28 04:20:30 2022 dbsException-invalid-input2: DBSBlockInsert/FileParents:		KeyError exception: parent_file_id. 
Traceback (most recent call last):
  File "/data/srv/HG2112c/sw/slc7_amd64_gcc630/cms/dbs3-migration/3.16.0-comp5/lib/python2.7/site-packages/dbs/business/DBSBlockInsert.py", line 258, in insertBlockFile
    del fileParentList[k]['parent_file_id']
KeyError: 'parent_file_id'
belforte@vocms0750/dbs-logs> 

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2022

Stefano, it is a difference in obsolete information from blockdump API of DBSReader server. The parent_file_id is not returned from Go-based server since, in fact, it can't be used anywhere during migration, and it is obsolete since the parent LFN name is returned. Said that, the Python DBSReader server does return it. The issue here is that DBSMIgration code tries to delete it without checking if such key exist in a dictionary.

You can see it for yourself:

blk=/DYToLL_M-50_TuneCP5_14TeV-pythia8/Run3Summer21DRPremix-120X_mcRun3_2021_realistic_v6-v2/GEN-SIM-DIGI-RAW%231b483377-b507-4249-b2b9-4eaa5b9723d0

# output of python DBSReader
https://cmsweb.cern.ch/dbspy/prod/global/DBSReader/blockdump?block_name=$blk

# if you save it and parse it you'll see something like
      "parent_file_id": 761940581,
      "parent_logical_file_name": "/store/mc/Run3Summer21GS/DYToLL_M-50_TuneCP5_14TeV-pythia8/GEN-SIM/120X_mcRun3_2021_realistic_v5-v2/30002/50d225ed-dd41-4fa4-aebe-8c0805fdc9b5.root"

Now, the output of GO-based server is the following

https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blockdump?block_name=$blk

and it you'll parse it you'll see that there is no paret_file_id in it. Since information is redundant, I did not include it in a query (in fact it increase output size of returned json object, and more parent files exist the more output become).

Therefore, we have two solutions here:

  • either fix DBSMigration.py to check this key and if it exist delete it (this is my preferred solution), or
  • add back parent_file_id into Go-based DBSReader server to have compatibility with python one. As I pointed out it is redundant information, and in fact increase size of returned json object. Therefore, I rather prefer to avoid this.

Please let me know which solution you think would be appropriate from your point of view. I can simply adjust DBSMigration.py codebase but it will require new release and I don;t know how easy or hard it would be to get it and use it with CRAB. And, if we want to adjust Go-based server I'll need to roll new DBS server release and test/validate it and put to production k8s which will take time too.

Please note, I'll be on vacation for 2.5 weeks starting tomorrow.

@belforte
Copy link
Member Author

Thanks Valentin
I surely understand, and I thought we had hit this difference in keys between new Reader and old MIgration servers already.
Anyhow...
Deploying a new python based DBSMigration server "today" is IMHO out of question.
A quick fix to Go-based Reader may do, but.. how often does this particular problem hit ? I do not know. So this is up to you.

I do not like that publication in production instance is degrading more and more with an increasing amount of tasks which I need to blacklist form publication. That was not expected of course.

But I see no practical alternative to letting things degrade more, while I can keep testing publication to testbed in parallel, and we can complete update of migration server in production when you are back. The main reason for failure is the
Required Configuration application name, release version, pset hash and global tag: cmsRun, CMSSW_10_1_8 ,GIBBERISH,fake not found in DB which we discussed already and there our best hope is to switch to new server, and if that will not be enough to fix DB inconsistency, someone will have to fix DB by hand.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2022

During migration from python to Go, I not only try to preserve existing functionality but also improve things. I doubt we need to add to new implementation things which either redundant, or wrong, or inefficient. As such certain amount of adjustment is required and I try my best to hide this from clients. The migration of reader and writer went quite smoothly, but I need to admit that DBS Migration server (which turns to rely on both reader and writer implementation) is not ideal.

Said that, I don't think it is good idea to upgrade DBS servers on production nodes a day before I'll take vacation. As such things should wait until I'll be back. Meanwhile, to do the right thing and make things compatible, I'll provide a fix to DBSMigration.py and create new set of RPMs. I'll ask Muhammad to pick it up in next round of upgrade on VMs.

@belforte
Copy link
Member Author

I fully agree with "no changes now" but also I would rather stick with the evil I know and do not make any changes while you are away. I am sure you have many things to do. Fixing code that we'll pitch in one months does not sound a good investment of your time.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2022

I created separate PR for DBSMigration codebase, see dmwm/DBS#661, which address this issue.

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2022

I also changed DBS Go-based code, see #71, which now put back file_parent_id into bulkblocks API. The new release is deployed on testbed and I'll be ready to test it once I back from vacation.

At that time I'll make a decision what will be simpler:

  • upgrade py-base DBS migration codebase. We'll need new RPMs and pass all DBS python tests (since I checked the code and the recent tag was dated by 2018), or
  • put new version of Go-based code with redundant file_paret_id info as a temporary solution, and once we'll perform migration we'll clean-it up. This may be the easiest solution though, as it will provide backward compatibility between py/go-servers but we'll need to revert changes once we'll switch DBSMigrate/DBSMigration servers.

@belforte
Copy link
Member Author

thanks Valentin. I hope that by the time you come back we have enough evidence that things work that we put new DBSMigration in production and go on from there. So far no new problem has popped up with new migration server.

@vkuznet
Copy link
Contributor

vkuznet commented Aug 16, 2022

Stefano, could you please give me update on migration requests during my absence, and let me know if there is anything I need to address.

@belforte
Copy link
Member Author

I have not seen any other problems with the servers since you fixed #74
only some occasional HTTP 400 which was OK when trying again later. I will improve logging so if/when this happens again we have the full message. But I am still in parasitic mode, I am not using the result of publications in testbed for bookkeeping at the file level, so something may slip through.

@belforte
Copy link
Member Author

still no new problems

@vkuznet
Copy link
Contributor

vkuznet commented Aug 23, 2022

closing as resolved before migration to k8s.

@vkuznet vkuznet closed this as completed Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants