Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEV] Suspicious replica recoveror: Configure conveyor-finisher such that it starts marking replicas suspicious #692

Closed
haozturk opened this issue Dec 19, 2023 · 15 comments
Assignees
Labels
dev development activity

Comments

@haozturk
Copy link
Contributor

Context: https://indico.cern.ch/event/1356295/
Needed for #403

I suggest to put the following patterns to the config in integration and watch how it goes:

- No such file or directory,
- CHECKSUM MISMATCH Source and
destination checksums do not match,
- SOURCE CHECKSUM MISMATCH,
- Unable to read file - wrong file checksum,
- checksum verification failed

Even if we do it in production, it should do nothing but marking some replicas suspicious depending on transfer errors, which has no effect as long as suspicious-replica-recoveror daemon is not running. But let's not do anything in production for now, especially before the xmas break @ericvaandering @dynamic-entropy any thoughts?

@haozturk haozturk added the dev development activity label Dec 19, 2023
@haozturk haozturk self-assigned this Dec 19, 2023
@haozturk haozturk added this to the Recover suspicious replicas milestone Dec 19, 2023
@haozturk
Copy link
Contributor Author

I've added the suspicious pattern to the finisher config in the integration:

{"message": "Suspicious patterns: ['.*No such file or directory.*', '.*no such file or directory.*', '.*CHECKSUM MISMATCH Source and destination checksums do not match.*', '.*SOURCE CHECKSUM MISMATCH.*', '.*Unable to read file - wrong file checksum.*', '.*checksum verification failed.*', '.*direct_access.*', '.*Copy failed with mode 3rd pull', 'with error: Transfer failed: failure: Server returned nothing.*', '.*HTTP 404 : File not found.*']", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"logger": "root", "level": "DEBUG"}, "process": {"pid": 9}, "@timestamp": "2024-01-19T12:26:05.379Z"}

Now working on producing relevant transfer errors to see whether finisher will be able to mark them as suspicious

@haozturk
Copy link
Contributor Author

I produced some stuck rules with checksum problems in the integration. The finisher is able to spot them, however not able to mark them as suspicious

{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.430Z", "process": {"pid": 9}, "message": "Checking suspicious file for request: 23c97c6470fd44d88d1dfa4abce0ffa9, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [5] SOURCE CHECKSUM MISMATCH Source and user-defined ADLER32 do not match (6bc75436 != 18134cde)", "error": {"type": null, "message": null, "stack_trace": null}}
{"log": {"logger": "root", "level": "DEBUG"}, "@timestamp": "2024-01-23T14:06:23.433Z", "process": {"pid": 9}, "message": "Found suspicious urls: ['davs://eoscms.cern.ch:443/eos/cms/store/backfill/1/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv5d-v11/00000/99141D34-7C33-3747-B330-E16940FA6DAB.root']", "error": {"type": null, "message": null, "stack_trace": null}}

Problem seems to be in the get_pfn_to_rse function [1] which cannot obtain the rse from the pfns. Looking deeper.

[1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L445

@haozturk
Copy link
Contributor Author

Okay, I confirm that the reason why conveyor-finisher cannot mark a replica suspicious in the bad_replicas table is this ATLAS specific code [1]

In particular the did name is computed incorrectly. For instance, the replicas variable gets the following value

[{'scope': store, 'name': '9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root', 'rse_id': '0f87087236244c4ca443472498e0aef2', 'state': <BadFilesStatus.SUSPICIOUS: 'S'>}]

for the following pfn

davs://eoscms.cern.ch:443/eos/cms/store/temp/sitetest/store/mc/RunIISummer20UL17MiniAODv2/GluGluSpin0ToGammaGamma_W-0p014_M-650_TuneCP2_13TeV_pythia8/MINIAODSIM/106X_mc2017_realistic_v9BACKFILL-v3/00000/9E814597-78C2-CE44-BDAB-C1F333E4BAF8.root

where the name of the replica is incorrect.

@ericvaandering @voetberg you mentioned that this is already known. Should we handle this as a 3rd step in [2] or is there already a ticket open for this in Rucio?

[1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L273
[2] rucio/rucio#6396

@voetberg
Copy link

@haozturk Yep, this is already known and there's an open PR with a solution. Issue rucio/rucio#6363 and PR rucio/rucio#6444. It's based on vo's instead of config settings but should have similar results.

@haozturk
Copy link
Contributor Author

Ah, thanks a lo @voetberg . I'll keep this issue in Waiting then. I can re-test it once the PR is merged and installed in the integration cluster.

@haozturk
Copy link
Contributor Author

Hi @voetberg I patched integration cluster with your PR 6444 for testing and I see that we still can't mark replicas suspicious. This is the relevant snippet from the finisher logs:

{"@timestamp": "2024-05-23T14:45:50.229Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Checking suspicious file for request: 22cc26303cb9423f8d67f273e263e476, transfer error: RequestErrMsg.TRANSFER_FAILED:TRANSFER [93] Protocol not supported or path/url invalid: gsiftp://gftp.t2.ucsd.edu:2811/hadoop/cms/phedex/store/test/rucio/int/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.233Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "Found suspicious urls: ['srm://cmsdcatape.fnal.gov:8443/srm/managerv2?SFN=/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root']", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}
{"@timestamp": "2024-05-23T14:45:50.253Z", "log": {"logger": "root", "level": "DEBUG"}, "message": "gfal.Default: parsing 1 pfns", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}

and this shows that we couldn't mark them suspicious:

$ rucio-int list-suspicious-replicas
RSE Expression:    Scope:    Created at:    Nattempts:    File Name:
-----------------  --------  -------------  ------------  ------------

rucio-int is an alias to alias rucio-int="rucio -H http://cms-rucio-int.cern.ch --auth-host https://cms-rucio-auth-int.cern.ch"

Can you please help me understand what's the issue?

@voetberg
Copy link

6444 only covers correctly parsing the scope and name in "declare bad replicas". Did you also patch with rucio/rucio#6764 ?

@haozturk
Copy link
Contributor Author

6764 shouldn't be necessary to mark replicas suspicious. The suspicious replica recoveror daemon doesn't mark replicas suspicious. It processes the replicas that are marked suspicious and declares them bad when necessary. I'm testing out the parts which mark replicas suspicious which happens in two places in rucio: kronos and conveyor-finisher. In the context of this issue, I'm testing the latter and simply this line [1] doesn't work

[1] https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L373

@voetberg
Copy link

Ah okay I misunderstood you there - was looking for the obvious solution.

Then I'm not immediately sure, I'm as stumped as you are. Can you verify the test rse is deterministic? Or else I can jump in and start adding logging statements everywhere to see where this gets stuck

@haozturk
Copy link
Contributor Author

Thanks Maggie, yes it's deterministic:

$ rucio-admin-int rse info T1_US_FNAL_Disk | grep determ
  deterministic: True

It'd indeed help adding more logging. Current logs don't help much. I couldn't figure out an easy way to add logging lines. Probably we need to add new log lines, make a temporary patch, deploy it and watch what's going on.

@haozturk
Copy link
Contributor Author

haozturk commented May 27, 2024

Hi @voetberg I've found what's wrong. scope and name extraction from pfn looks problematic. For this pfn

srm://cmsdcatape.fnal.gov/11/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root

it yields

name: B3FF92F9-855D-5144-BA28-3877560A93B2.root
scope: 11

whereas it should've been

scope = "cms"
name = "/store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root

which is computed here: https://github.com/rucio/rucio/blob/cf9c2c7afad37fc5b7bdd782e281918b4230b2f1/lib/rucio/core/replica.py#L275

As far as I'm aware scope cannot be extracted from PFN in CMS. Once you know the lfn and the rse_id, you can get it from the replicas table. If there are multiple replicas with the same name and rse_id but different scopes, we can default to cms which is the scope we use for all centrally produced data. Maybe @ericvaandering has a better idea?

Does it sound reasonable? You want me to fix this upstream?

@voetberg
Copy link

@haozturk The way the code is set up, you may be able to write a plugin that just returns "cms" for the scope for any pfn. Right now it's written so that it uses VO to determine the code to use in the scope from pfn. I think it would be as simple as patching in

class CMSScopeTranslation(RSEDeterministicScopeTranslation):

    def __init__(self, vo: str = 'def'):
        super().__init__()
        self.register("cms", CMSScopeTranslation._cms)   # <- Name here corresponds to vo

    @static_method
    def _cms(pfn:str) -> str: 
          return "cms"

@haozturk
Copy link
Contributor Author

Thanks @voetberg indeed this would be the most convenient. However, I'm a bit worried for the exceptions where scope isn't cms. If @ericvaandering has no other suggestion, I can apply it.

Lastly, we need to change the way we extract the name/lfn, but that should be easy.

@dynamic-entropy
Copy link
Contributor

Igor did something similar here, rucio/rucio#6350
Not sure if it's entirely the same though.

@haozturk
Copy link
Contributor Author

We fixed the relevant issues in upstream rucio. This is done and deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev development activity
Projects
None yet
Development

No branches or pull requests

3 participants