Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloader: catch more error situations #98

Open
tiborsimko opened this issue Nov 18, 2020 · 4 comments
Open

downloader: catch more error situations #98

tiborsimko opened this issue Nov 18, 2020 · 4 comments
Assignees

Comments

@tiborsimko
Copy link
Member

Now that we are catching downloading errors via --retry-limit and --retry-sleep parameters thanks to #91, we can build upon this basis and enrich the error handling logic to catch more errors (such as Error 500 received from server, etc).

Example how to reproduce:

  1. Run locally a CERN Open Data instance:
$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/cms-tools-higgsexample20112012.json
$ firefox http://localhost/record/5500
  1. Launch downloading of files with big enough retry sleep time:
$ cernopendata-client download-files --recid 5500 --verify  --server http://localhost --retry-sleep 30
  1. While the download is running, bring down the web site.

Current behaviour of the client is to output Python traceback:

==> Verifying file demoanalyzer_cfg_level4data.py...
  -> Expected size 3821, found 3821
  -> Expected checksum adler32:177b49c0, found adler32:177b49c0
==> Downloading file 10 of 11
  -> File: ./5500/mass4l_combine.pdf
Traceback (most recent call last):
  File ".../bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File ".../lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File ".../lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File ".../lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File ".../lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File ".../lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File ".../lib/python3.8/site-packages/cernopendata_client/cli.py", line 337, in download_files
    download_single_file(path=path, file_location=file_location, protocol=protocol)
  File ".../lib/python3.8/site-packages/cernopendata_client/downloader.py", line 109, in download_single_file
    c.perform()
pycurl.error: (52, 'Empty reply from server')

Expected behaviour is to:

  • client should catch errors of this kind in some try/except and not exit immediately
  • .. or use pycurl/requests/xrootd client native retry functionalities directly!
  • client should wait 30 seconds for a retry
  • in the meantime, you bring the server up again via docker-compose up
  • client should happily reconnect and download the file
  1. The same test can be performed not only when the server is down, but simulating various responses from the server, such as Error 500, Error 206, etc. This can be done easily by editing Nginx configuration (or the Flask app) to return various kinds of errors.
@tiborsimko
Copy link
Member Author

Example of "native" retry tools:

Haven't looked at what pycurl/xrootd may have built-in. We could always improve our own for _retry in range loop not to check only fro error = ( DOWNLOAD_ERROR_PAGE[..., but turn it into a more complete function catching other exceptional situations.

@tiborsimko
Copy link
Member Author

Example: I just launch two commands quickly after each other, and I saw:

$ wget http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
--2020-11-19 11:56:39--  http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-11-19 11:56:40 ERROR 403: Forbidden.

$ $ wget http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
--2020-11-19 11:56:44--  http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1594567234 (1.5G) [application/zip]
Saving to: ‘GamGam.zip’

The 403 is a rather frequent error by the underlying reverse proxy that we also worked around in #91 via size and checksum.

After we introduce exit status based retry, we could probably remove that check-error-page-size-and-checksum hack.

@ParthS007 ParthS007 self-assigned this Nov 20, 2020
@tiborsimko
Copy link
Member Author

Example: 206 Partial Content situation occurred several times while downloading one big file via wget:

$ wget http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
--2020-06-24 14:53:19--  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Resolving opendata.cern.ch (opendata.cern.ch)... 188.184.93.89, 188.185.82.144, 188.184.28.138
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4011567937 (3.7G) [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         26%[=========================================>                                                                                                                    ]   1.01G  7.82MB/s    in 87s

2020-06-24 14:54:46 (11.9 MB/s) - Connection closed at byte 1081769706. Retrying.

--2020-06-24 14:54:47--  (try: 2)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 2929798231 (2.7G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         29%[++++++++++++++++++++++++++++++++++++++++++===>                                                                                                                ]   1.09G  11.5MB/s    in 6.7s

2020-06-24 14:54:54 (12.6 MB/s) - Connection closed at byte 1169634135. Retrying.

--2020-06-24 14:54:56--  (try: 3)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 2841933802 (2.6G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         56%[++++++++++++++++++++++++++++++++++++++++++++++=========================================>                                                                      ]   2.09G  15.1MB/s    in 88s

2020-06-24 14:56:24 (11.7 MB/s) - Connection closed at byte 2248454684. Retrying.

--2020-06-24 14:56:27--  (try: 4)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 1763113253 (1.6G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         82%[++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++==========================================>                           ]   3.10G  13.1MB/s    in 82s

2020-06-24 14:57:49 (12.6 MB/s) - Connection closed at byte 3328930017. Retrying.

--2020-06-24 14:57:53--  (try: 5)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 682637920 (651M) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         85%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++===>                       ]   3.20G  6.88MB/s    in 10s

2020-06-24 14:58:04 (9.83 MB/s) - Connection closed at byte 3433695865. Retrying.

--2020-06-24 14:58:09--  (try: 6)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 577872072 (551M) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                        100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++======================>]   3.74G  14.6MB/s    in 36s

2020-06-24 14:58:44 (15.5 MB/s) - ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’ saved [4011567937/4011567937]

@ParthS007
Copy link
Member

Musing after researching on the Issue:

There are two things we need to address

  • Error handling
  • Retry functionality

Every library we use(pycurl/requests/xrootdpyfs) has their own way of handling these two things.

  1. pycurl

    • Error handling:
      • SO Answer(Basic error handling)
      • Using pycurl.error and getinfo for different HTTP Codes
    • Retry:
      • Add a retry loop on WRITEDATA
      • Using pycurl.TIMEOUT and/or pycurl.CONNECTTIMEOUT or something similar(haven't tested yet)
  2. requests

    • Error handling
      • Add general try exception handling
      • Display messages according to the error
    • Retry
      • Add retry loop on requests.get
      • Use HTTPAdapter and retry from requests library - Blog link
  3. xrootdpyfs

    • Error handling
      • Add general try exception handling
      • Display messages according to error
    • Retry

Please let me know if I can break those steps as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants