Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sources can be duplicated when similar sources for different chunks on !sources command #39

Open
bolinocroustibat opened this issue May 26, 2024 · 8 comments
Assignees

Comments

@bolinocroustibat
Copy link
Member

bolinocroustibat commented May 26, 2024

We should de-duplicate similar sources.

Screenshot 2024-05-26 at 18 50 27
@bolinocroustibat bolinocroustibat changed the title Sources can be duplicated when similar sources for different chunks Sources can be duplicated when similar sources for different chunks on !sources command May 26, 2024
@dtrckd
Copy link
Collaborator

dtrckd commented May 26, 2024

Hey, I don't see any duplicated sources.
They are duplicated sheets (e.g F[0-9]{6}), but sources are derived from chunks and sheets can contains multiple chunks. Si it is not surprising to find several times the same sheets referenced. But if you look at the context (the text in parenthesis, they are all different. The context is like a breadcrumb of the chunk inside the sheet. And finally, there is no direct links to chunks as they come from the same sheets, which explain why there are the se same links.

@dtrckd dtrckd closed this as completed May 26, 2024
@dtrckd dtrckd reopened this May 26, 2024
@dtrckd
Copy link
Collaborator

dtrckd commented May 26, 2024

Ah, I missed the one that are actually real duplicated.

EDIT: my bad, I don"t see duplicate in fact, I confuse the source inside the answer and the actual sources with !sources.

@dtrckd dtrckd closed this as completed May 26, 2024
@bolinocroustibat
Copy link
Member Author

bolinocroustibat commented May 26, 2024

@pedevineau Can you confirm you're OK with the current !sources behaviour?
@dtrckd This was opened after some user's feedback, might be better to reopen while we make sure we all agree on the current behaviour

@dtrckd
Copy link
Collaborator

dtrckd commented May 26, 2024

What we can be done is to add an anchor in links for each chunks. For example

https://www.service-public.fr/particuliers/vosdroits/F59#chunk1

Even if the anchor does not exist, it can give the user a hint of why this is the actual same URL.

Let me know if you have better idea.

@dtrckd dtrckd reopened this May 26, 2024
@pedevineau
Copy link
Member

How do we choose the titles related to chunks? My suggestion would be: let us return the title of the sheet once with the url. So it will be easy to deduplicate

@dtrckd
Copy link
Collaborator

dtrckd commented May 28, 2024

The title of a chunk, is the tittle of the sheet it comes from. The subtitle(context) is the path towards that chunks in the sheet, which is composed by the successive subtitles meet before reaching the chunk. The subtitle is the string that enable us to deduplicate (we also use a hash of the chunk as a unique identifier internally). But again, there are no duplicated chunks, they are already deduplicated in the backend.

@pedevineau
Copy link
Member

Yes I know there is no deduplicates of chunks, I was considering dedupling sheets, because at the end every url targets the same page. The anchor system doesn't work in general, because our chunks are not always related to the DILA webpage anchors, are them?

@dtrckd
Copy link
Collaborator

dtrckd commented May 28, 2024

Yes, you're right, the anchor idea was just to give a visual hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants