Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

learn.microsoft.com to big to index #1

Open
idnovic opened this issue Nov 5, 2023 · 11 comments
Open

learn.microsoft.com to big to index #1

idnovic opened this issue Nov 5, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@idnovic
Copy link

idnovic commented Nov 5, 2023

Greetings,

I am testing bloo on https://learn.microsoft.com/de-de/docs/

But it wants to index every language of the docs and not just german.
I am aware that I choose a large data set. I still wanted to ask for optimisations.

Bloo tries to index the other languages over the sitemap xml files.
Would it be possible to exclude other languages since the naming is somewhat standardised eg de_de en_us etc... and the languages are part of the sitemap xml files filenames.

I think if bloo would be able to handle microsoft docs it should be able to handle everything.
I am open to other suggestions.

@idnovic
Copy link
Author

idnovic commented Nov 5, 2023

My second test set is https://www.bsi.bund.de/DE/Home/home_node.html
It has only around 1% the size of the microsoft docs set.
I will give feedback after it finished indexing.

@ptsochantaris
Copy link
Owner

Hi @idnovic - thank you for your feedback, I'm sorry I missed it, so, sorry for the delay.

Correct me if I'm wrong, but it sounds like what you're looking for are "filters" for what is indexed (or what is ignored) for a specific domain. In your example, I assume it would involve only indexing items with a "de-de" in their path. Am I understanding your requirement correctly?

@ptsochantaris ptsochantaris self-assigned this Dec 18, 2023
@ptsochantaris ptsochantaris added the enhancement New feature or request label Dec 18, 2023
@idnovic
Copy link
Author

idnovic commented Dec 18, 2023

Hi @idnovic - thank you for your feedback, I'm sorry I missed it, so, sorry for the delay.

Correct me if I'm wrong, but it sounds like what you're looking for are "filters" for what is indexed (or what is ignored) for a specific domain. In your example, I assume it would involve only indexing items with a "de-de" in their path. Am I understanding your requirement correctly?

No problem. I did read that it is alpha software. I did not expect an answer right away.

Well, maybe. Filters do sound useful. For my test with the microsoft docs the url path itself already contained the german language.

It seems that bloo found the links for other translations in the site meta-data. And tried to index every language.

I think it would be best if try it yourself. To see what I mean.

My second test with the site bsi was not successful. Bloo was not able to index it. Maybe rate limiting or a parsing issue.

@ptsochantaris
Copy link
Owner

Thanks for the feedback; I think I understand the issue you're having well - you'd like to index only a specific language in the site - I guess my thinking is more about what would be the best solution that would both solve your issue and also make it as useful as possible in as many other situations/people as possible, but without being too general, if you get what I mean :)

In your mind what kind of feature/option would be the ideal solution for your issue?

@idnovic
Copy link
Author

idnovic commented Dec 20, 2023

I think that bloo should first create a list of all domains it wants to index. Present me this list sorted by domain/subdomain/directory.

I think this list needs to be in a tree view.

Let me disable branches.

That Way it would work for most websites and I could disable every branch of an other language.

I am just not sure how you can precreate the URL list without loading every sub-side.

@ptsochantaris
Copy link
Owner

Indeed, there is a chicken and egg issue there - although for some sites reading the sitemap could provide a starting point for something like this, but it wouldn't work in a consistent way across sites. In a way, a regex-based filter would accomplish the same thing as links come in (in fact, Bloo will do this already based on rules specified in the robots.txt file) - so perhaps a way for the user to add "extra" rules for a domain could make sense?

@idnovic
Copy link
Author

idnovic commented Dec 22, 2023

I think a single rule may already improve this situation.

  1. example.com/en/page
  2. example.com/de/page
  3. example.com/en/page/help

Let’s keep the microsoft docs situation in mind. We are on the 1. url and tell bloo to index. Bloo finds 2. url via meta-data.
This seems to be a logical error.
Because it is traversing upwards (from 1. to 2.) instead of downwards (from 1. to 3.).
If I go to any website and make an effort to find indexable content, than I probably do not want content upwards of my choosen directory.

I think custom rules may be to complex because bloo does not tell me why it wants to index something. But basic rules I can switch on/off to change the indexing seem to be a good idea.

Good switchable rules that come to mind are
traversal depth
traversal direction
minimum number of words to index (page)
maximum number of words to index (page)
cross domain index

Also since bloo runs mostly on apple platform it may be possible to detect the page language and offer a setting to only index pages of certain languages.

@ptsochantaris
Copy link
Owner

Hi, sorry for the slow reply, and hope you're having a happy holiday season. Those options do indeed sound very useful, so if I understand correctly we'd have:

  • Option to lock direction (e.g. don't index depth outside URL path prefix originally provided)
  • Maximum depth (I like that one, I'd definitely use it :))
  • Min/Max word count for page to be indexed (I would totally use that too :))

Can you tell me more about cross-domain indexing? Do you mean e.g. https://support.apple.com/de/data.html would also allow https://developer.apple.com/de/other_data.html? If so, how would the option work? Perhaps some wildcard? e.g. https://*.apple.com/de for instance?

Thanks again for these suggestions, I'll definitely be working on the first three, they sound very useful - when you find some time (no hurry) let me know more about cross-domain stuff.

@idnovic
Copy link
Author

idnovic commented Dec 27, 2023

Can you tell me more about cross-domain indexing? Do you mean e.g. https://support.apple.com/de/data.html would also allow https://developer.apple.com/de/other_data.html? If so, how would the option work? Perhaps some wildcard? e.g. https://*.apple.com/de for instance?

No problem. Holidays are more important than GitHub.
By cross-domain indexing I was thinking about articles that provide sources at the end of the article. Think of wikipedia. I may want to index wikipedia.com/cheesecake but I may also want to index all given sources at the end of the page. These sources are probably external.

(Let´s rename cross-domain to "follow sources")
With "follow sources" turned on I would expect the following:

  1. Index wikipedia.com/cheesecake
  2. Check Article markup for references of sources (example grandma-cocking.com/cheesecake)
  3. Follow the sources but only index a single document/page and do not traversal any further at the source page.

An other example would be an article at a news website.
I am reading about dogs and the article lists external sources in the markup.
If "follow sources" is enabled then also index the given source.

Important: Only follow sources from the article itself. Not from other components / divs of the website.

@ptsochantaris
Copy link
Owner

Just a note to let you know I haven't forgotten about this :) Time is very limited these days but I suspect the eventual solution will be to allow a custom extension to be added to the domain's "robots.txt", so for instance you can add extra rules, such as only allowing certain paths, or URLs that match specific regexes, etc. (Apologies in case you're no longer using Bloo, just ignore me then :) But I still want to implement this in some form as it's a great suggestion, and I want to use it too :))

@ptsochantaris
Copy link
Owner

ptsochantaris commented Feb 13, 2024

Ok, the initial hack is in the upcoming build 55. For example, if you want to add access rules to https://developer.apple.com so it only indexes the /de section, you can do this:

  • Create a file called local-robots.txt in ~/Library/Containers/build.bru.bloo.app/Data/Documents/storage.noindex/developer.apple.com/local-robots.txt
  • This file should contain a single agent, in the same format you'd write it in a normal robots.txt file. In this example:
User-agent: _bloo_local_domain_agent
Disallow: /
Allow: /de/

If you refresh, or pause and un-pause the indexing of that domain, it should no longer access URLs outside this scope. Note that if you already have indexed items outside this scope, they will only be removed if a full refresh of the domain is made.

Note that _bloo_local_domain_agent must be the name of the agent, anything else will get ignored.

I'm going to improve this with a bit of GUI when possible, but I wanted to get the ability to do this in initially so if you really need it, you can add that text file and get started. It supports all the syntax that a regular agent definition in a robots.txt file has, so if it can be done in a regular robots.txt file it can be done here too.

I hope this helps and sorry for taking ages to reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants