Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report redirects as broken links #25

Closed
gromakovsky opened this issue Feb 25, 2020 · 21 comments · Fixed by #264
Closed

Report redirects as broken links #25

gromakovsky opened this issue Feb 25, 2020 · 21 comments · Fixed by #264
Assignees
Labels
improvement Updates existing functionality
Milestone

Comments

@gromakovsky
Copy link
Member

Clarification and motivation

As far as I understand currently xrefcheck treats redirects as valid links, but usually they indicate that something is wrong. If you have a link x in your md file and it redirects to y, it will usually be the case that:

  1. You can safely replace x with y and eliminate unnecessary redirection step.
  2. y is some page that says "x does not exist, but you can enjoy z right here" (and does not make xrefcheck fail).

Acceptance criteria

By default 3xx codes should be reported as broken links (I don't know these codes well enough, maybe some specific codes are ok, you are welcome to consider them ok if you find it better).

Optional: make it possible (add a flag or something) to say that 3xx codes are valid.

Note that there is another (private) issue about ignoring specific broken links.

@Martoon-00
Copy link
Member

I encountered an interesting case with redirects:
3. For example, when accessing https://github.com/serokell/xrefcheck/issues/new/choose page usually I get it opened fine with 200, but if I'm not authorized (and xrefcheck is not), I get redirected to the login page.
Till now I was thinking that all problems with not being authorized could be mitigated just via treating 401 as ok, but now it gets ambiguous.

How do you think, is it the problem on that page and we should use some another page, or xrefcheck should be able to handle such cases? If the latter holds, would it be enough if I allow specifying exclusions list?

@Martoon-00
Copy link
Member

There is another possible solution to this which may be handy in general, that is, allow local errors suppressions.

For instance, prefixing a link with <!-- xrefcheck: allow-redirect --> comment allows this link to produce 302.

@kirelagin
Copy link

I would suggest treating 302, 304, and 307 as ok.

@kirelagin
Copy link

The semantics of 302 and 307 are temporary redirects, so if you get them, it means that, most likely, the URL is valid, it is just altered temporarily, but for future requests you should use this same URL.

304 is only used for caching, so it is equivalent to 2xx for our purposes.

@Martoon-00
Copy link
Member

Martoon-00 commented May 14, 2020

What about the case with gitlab, that on access to non-existing file it returns "302 Found" and redirects to the root of repo?

I guess I should allow adding it an exception written down in config then 🤔

@kirelagin
Copy link

As I said, I am for just accepting 302, 304, and 307 as valid unconditionally.

@Martoon-00
Copy link
Member

Ah okay, I got your point.

@gromakovsky what do you think?

@gromakovsky
Copy link
Member Author

I guess I should allow adding it an exception written down in config then thinking

I think this is what should be done. I am not sure what's the best syntax for that, but I believe you will figure it out :)
Probably the default config should come with this exception for gitlab because we are already treating gitlab specially there (mention some gitlab-specific files).

@Martoon-00 Martoon-00 added the improvement Updates existing functionality label Aug 25, 2021
@Martoon-00 Martoon-00 mentioned this issue Nov 29, 2022
3 tasks
@aeqz
Copy link
Contributor

aeqz commented Nov 29, 2022

While investigating #218, we have seen that xrefcheck does not consider redirects as valid links. The HTTP client is configured to automatically follow redirects so:

  • If it ends up reaching a response that is considered as valid, then it considers the link as valid.
  • Otherwise, as in the gitlab case, it considers the link as invalid.

If we currently configure the client to not follow redirects, then any redirect link is considered as invalid.

This changes the point of view that we can take for configurations:

  1. Do not follow redirects and consider them as valid.
  2. Do not follow redirects and consider them as invalid.
  3. Follow redirects and decide depending on the reached link (current behaviour).

As @gromakovsky pointed out, if xrefcheck reports an error due to a redirect from x to y, usually x can be replaced by y for fixing that, so I would prefer the second option by default instead of the third one. The difficult part would be then to come up with a practical way to configure xrefcheck to ignore problematic links (e.g. gitlab requiring authentication).

@Martoon-00
Copy link
Member

Let me first leave my thoughts on permanent redirects like with 301 error code, and I want to discuss 302 after that.

I personally like the logic of "if x redirects me to y, better substitute it with y". In this case, if we detect e.g. 301 response code, we should report this as error and suggest the user to replace the link.

However I'm thinking about a couple of edge cases when this does not seem to be perfect:

  • A temporarily-living link: perhaps there are cases when a concise link redirects to a temporarily-living verbose link, I remember a behaviour like that in sites serving as file storages, but I'm not sure. In this case there is no way to provide a link that would not result in a redirect.
  • Maybe some people prefer using URL-shorteners, e.g. to keep their documentation shorter. And they are ready to sacrifice bits of the traffic of their readers 🙈

How much this motivation makes sense?

If it sounds reasonable, then probably we want the user to have some control over how redirects are treated, and e.g. let'em say: "Expect a redirect here, I know what I'm doing". But I agree that the 2nd option from the comment above should rather be the default behaviour when getting 301.

a practical way to configure xrefcheck to ignore problematic links (e.g. gitlab requiring authentication).

Yeah, this is a concern for another issue.

Currently, we treat authentication errors as valid links, because one of the main goals - use in CI - assumes that we at all costs avoid false detection of invalid links. But I hope that one day we will also let the users to handle such links in a smarter way.

@aeqz
Copy link
Contributor

aeqz commented Nov 30, 2022

let'em say: "Expect a redirect here, I know what I'm doing". But I agree that the 2nd option from the comment above should rather be the default behaviour when getting 301.

It seems like a way of doing this in the same configuration style that is being used in the project would be with a new list in the configuration file, like redirectExternalRefsTo, and with an annotation like <!-- xrefcheck: redirect link -->. We could try to set de defaults in a way for this to be required only when dealing with edge cases.

I am thinking about a couple of possible default behaviours:

  • Do not follow redirects, fail for response codes that denote a permanent redirect and pass for codes that denote other kind of redirects.
  • Try first without following redirects, fail for response codes that do not denote a temporary redirect and try following redirects otherwise.

The first one is simpler and will lead to less false positives. The second one will keep failing for the gitlab case because it responds with a 503 after redirect.

@Martoon-00
Copy link
Member

would be with a new list in the configuration file, like redirectExternalRefsTo, and with an annotation like <!-- xrefcheck: redirect link -->.

Yep, looks so. Let's add these two ways to configure things.

@Martoon-00
Copy link
Member

Now on the temporary redirect error codes (302 and others).

I tried to read about it and understand a few use cases for it (yeah, this area is quite new to me). Those include (from this article):

  1. Some products are out of stock, and redirection follows to a page with other suggestions; currently the original page is unavailable, but later it may be back to live.

    Probably what GitLab does is similar to this scenario, and when a link refers to a non-existing file then 302 is returned meaning "this file is not here currently, but maybe it will appear later", and I get redirected to the repository root (example).

    I think, here we can report the link as invalid as this "maybe appear later" does not promise anything, more likely the link will never become valid; same logic holds for shops too. But it's true, by doing so we can produce many false positives too.

  2. Use 302 where 301 should in fact be used, in order to deceive Google's ranking system (kek).

    In such cases, we would like to treat 302s as 301s and probably report an invalid link.

  3. A/B testing: some particular users may get 302 redirecting them to the new version of the site, while other users will work with the site normally.

    The only sensible way to handle this is to always follow 302 redirects OR treat the link as valid since we likely got redirected to a sensible location.

And given all this diversity of the scenarios and how to adequately react on them, I see no other option than to put this on the user.

Sometimes the user wants to treat 302 as invalid (1 and 2 cases), and sometimes they don't want that.


Also, on whether to follow redirects, this also seems to differ depending on the site!

Sometimes it is reasonable to treat 302 as valid because the site knows what it is doing and the suggested new link is valid.

Sometimes it makes sense to follow redirects and check the target because 302 can be produced by one site that leads me to another site, and the link can easily become outdated.

  • E.g. this URL shortener on attempt to open via https first redirects to its own http version via 302, and that http version redirects to the target site with 301. Not sure how much is it sensible, but this is the real life.

Given all this, and assuming that my interpretation of the situation is correct, I propose the following:

  • Have "treat 302 as valid" default. This way we produce fewer false positives, and the user is free of an unnecessary hassle.

  • Let the user specify a different behaviour for 302 links in the way we discussed it, but probably allow both "follow redirect" and "treat as invalid".

    I really want us to handle bad links to GitLab.

Plus one note: if we allow "follow redirect", it would be best if we din't automatically follow the entire chain, but rather picked the immediate target, and re-run the check recursively. Making sure that all the rules (exclusions and redirects treatment) are applied to every link in the chain, which sounds good.

  • This way one can probably ignore that link that causes 503. Not sure it is the best way to handle [BUG] Investigate 503 #218, just a thought.

How does that sound?

If we go with this plan, it turns out to be a relatively large issue, but I would like us to try it unless we hit some severe difficulties implementing it.

@aeqz
Copy link
Contributor

aeqz commented Dec 1, 2022

I will try to sum up the conclusions here in order to check that it is clear. We can refine or correct it if I missed something:

  • Configure the HTTP client to not automatically follow redirects.
  • Set it to treat 301 and 308 as invalid, and 302, 304 and 307 as valid.
  • Add a new list configuration parameter named redirectExternalRefsTo, similar to ignoreExternalRefsTo, that tells xrefcheck to follow redirects for that links.
  • Add a new comment annotation <!-- xrefcheck: redirect link -->, similar to <!-- xrefcheck: ignore link -->, that tells xrefcheck to follow an annotated link.
  • This 'follow redirect' from the configuration will be explicitly performed by xrefcheck, instead of automatically by the client, in order to apply the same rules for every link in the redirection chain.

I am not sure about how to introduce the 'treat temporary redirects as invalid' configuration. Perhaps with a global parameter that is false by default?

This default behaviour will solve #218 and, if the configuration does not work in some case and produces a false positive, ignoreExternalRefsTo and <!-- xrefcheck: ignore link --> can continue to be used in order to ignore it. Also, the new annotations could be used to cover several cases:

  • To really follow a temporary redirect in order to verify it.
  • To follow a permanent redirect when it cannot be replaced by the final URL.

@Martoon-00
Copy link
Member

Looks so 👍 Only a couple of corrections:

On the exact codes: Wikipedia says that 303 should be treated as similar to 302, is that incorrect?

Also on 304 (we didn't discuss this yet) - could you elaborate on your suggestion?


On how to make this all configurable:

I would suggest going with config like

externalRefsRedirects:
  - to: *
    outcome: valid
  - to: gitlab.com/*  # pseudo-item, in reality should follow the format of `ignoreExternalRefsTo`
    outcome: invalid
  - to: xxx
    on: permanent
    outcome: follow
  - to: yyy
    on: 304
    outcome: follow

where to and on fields are optional. This will let the user specify the desired defaults for all sites at once too.

Probably the annotations should be made a bit more configurable too, e.g. <!-- xrefcheck: follow redirect in link --> or <!-- xrefcheck: follow 302 redirect in link -->. Such annotation will override the behaviour specified by config.

I suggest inserting follow not only for readability, but to also allow other behaviours in the future ("treat as valid" or "treat as invalid"). However at the moment, I don't see a use case for these two, so let's not support them, and support only follow (temporary|permanent|<numcode>)? redirect in <area> pattern.

But, I would actually leave these annotations to a separate pull request. This all is already very bug amount of work, and I'd like it to be split into two or more pieces if possible. It's fine to create several PRs for the same #25 ticket in such cases.

@aeqz
Copy link
Contributor

aeqz commented Dec 5, 2022

Also on 304 (we didn't discuss this yet) - could you elaborate on your suggestion?

Sorry, I took it from a comment above in this thread. For the defaults, I would agree with treating 301 and 308 as invalid (those that we consider to be permanent redirects), and treating 302, 303 and 307 as valid (those that we consider to be temporary redirects).

I like your proposal, it will allow to configure the redirect behaviour more precisely beyond the defaults. In case of rule overlapping in the externalRefsRedirects list, should we apply the first one that matches or the last one? In your example it would make sense to apply the last one because of the first rule containing *.

@Martoon-00
Copy link
Member

Martoon-00 commented Dec 5, 2022

Sorry, I took it from a comment above in this thread.

Ah I see, fair. I think we will have to discuss 304 separately, to me it seems like a completely different thing compared to temporary and permanent redirects.

In case of rule overlapping in the externalRefsRedirects list, should we apply the first one that matches or the last one?

I was thinking myself about picking the first matching, but somehow provided an example that illustrates the opposite 😄

I guess I did so because it feels natural to add new entries to the end of the list, and those are concrete patterns that will be added later, and the default rules are usually added in the very beginning. So picking the last matching may make sense.

I find this a relatively weak argument, though. What do you think?

aeqz added a commit that referenced this issue Dec 12, 2022
…nagement

[#218] Change redirects default behaviour
@aeqz
Copy link
Contributor

aeqz commented Dec 12, 2022

I have closed #218, which is included in the 0.3.0 milestone, with a first PR that changes the redirect links default behaviour. Should we also add this issue to 0.3.0 in order to include the work that remains to do?

@Martoon-00
Copy link
Member

Good point, please include it.

@aeqz aeqz added this to the 0.3.0 milestone Dec 12, 2022
@aeqz aeqz self-assigned this Dec 19, 2022
aeqz added a commit that referenced this issue Dec 19, 2022
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but do not provided a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
aeqz added a commit that referenced this issue Dec 20, 2022
Some tests for testing errors related to redirect chains: broken chains
and cycles.
aeqz added a commit that referenced this issue Dec 21, 2022
Proposal of showing redirect chains in the verify errors output.
aeqz added a commit that referenced this issue Dec 21, 2022
aeqz added a commit that referenced this issue Dec 21, 2022
aeqz added a commit that referenced this issue Dec 21, 2022
Adding some config redirect tests.
aeqz added a commit that referenced this issue Dec 21, 2022
Removing some repetitive code from tests.
aeqz added a commit that referenced this issue Dec 21, 2022
Changelog entry added.
aeqz added a commit that referenced this issue Dec 21, 2022
Redirect configuration explanation added in the FAQ section.
aeqz added a commit that referenced this issue Dec 22, 2022
aeqz added a commit that referenced this issue Dec 22, 2022
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but do not provided a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
aeqz added a commit that referenced this issue Dec 22, 2022
aeqz added a commit that referenced this issue Dec 22, 2022
aeqz added a commit that referenced this issue Dec 29, 2022
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but do not provided a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
aeqz added a commit that referenced this issue Dec 29, 2022
aeqz added a commit that referenced this issue Dec 29, 2022
aeqz added a commit that referenced this issue Dec 30, 2022
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but did not provide a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
aeqz added a commit that referenced this issue Dec 30, 2022
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but did not provide a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
aeqz added a commit that referenced this issue Dec 30, 2022
…rules

[#25] Redirect links with configuration rules
aeqz added a commit that referenced this issue Jan 10, 2023
Problem: after recent work on Xrefcheck redirect behavior, it remained
to discuss how to handle 304 redirects.

Solution: with the current default configuration, 304 redirects are
considered as valid, which seems appropriate taking into account that
304 responses usually mean that you previously received a successful
response for that same request and there is no need to retransmit the
resource again. We add a test case for this default config and update
the FAQ section.
@aeqz aeqz mentioned this issue Jan 10, 2023
8 tasks
aeqz added a commit that referenced this issue Jan 16, 2023
Problem: after recent work on Xrefcheck redirect behavior, it remained
to discuss how to handle 304 redirects.

Solution: with the current default configuration, 304 redirects are
considered as valid, which seems appropriate taking into account that
304 responses usually mean that you previously received a successful
response for that same request and there is no need to retransmit the
resource again. We add a test case for this default config and update
the FAQ section.
aeqz added a commit that referenced this issue Jan 16, 2023
Problem: After recent work on Xrefcheck redirect behavior, it remained
to discuss how to handle 304 redirects.

Solution: With the current default configuration, 304 redirects are
considered as valid, which seems appropriate taking into account that
304 responses usually mean that you previously received a successful
response for that same request and there is no need to retransmit the
resource again. We add a test case for this default config and update
the FAQ section.
PhilTaken pushed a commit that referenced this issue Jan 18, 2023
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but did not provide a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
@aeqz aeqz closed this as completed in #264 Jan 18, 2023
aeqz added a commit that referenced this issue Jan 18, 2023
@Martoon-00
Copy link
Member

Note: a real life use case when a permanent redirect was intentionally desired.

That measure is kinda about achieving the same goal as xrefcheck, but some users may still prefer using such proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Updates existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants