-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
learn.microsoft.com to big to index #1
Comments
My second test set is https://www.bsi.bund.de/DE/Home/home_node.html |
Hi @idnovic - thank you for your feedback, I'm sorry I missed it, so, sorry for the delay. Correct me if I'm wrong, but it sounds like what you're looking for are "filters" for what is indexed (or what is ignored) for a specific domain. In your example, I assume it would involve only indexing items with a "de-de" in their path. Am I understanding your requirement correctly? |
No problem. I did read that it is alpha software. I did not expect an answer right away. Well, maybe. Filters do sound useful. For my test with the microsoft docs the url path itself already contained the german language. It seems that bloo found the links for other translations in the site meta-data. And tried to index every language. I think it would be best if try it yourself. To see what I mean. My second test with the site bsi was not successful. Bloo was not able to index it. Maybe rate limiting or a parsing issue. |
Thanks for the feedback; I think I understand the issue you're having well - you'd like to index only a specific language in the site - I guess my thinking is more about what would be the best solution that would both solve your issue and also make it as useful as possible in as many other situations/people as possible, but without being too general, if you get what I mean :) In your mind what kind of feature/option would be the ideal solution for your issue? |
I think that bloo should first create a list of all domains it wants to index. Present me this list sorted by domain/subdomain/directory. I think this list needs to be in a tree view. Let me disable branches. That Way it would work for most websites and I could disable every branch of an other language. I am just not sure how you can precreate the URL list without loading every sub-side. |
Indeed, there is a chicken and egg issue there - although for some sites reading the sitemap could provide a starting point for something like this, but it wouldn't work in a consistent way across sites. In a way, a regex-based filter would accomplish the same thing as links come in (in fact, Bloo will do this already based on rules specified in the robots.txt file) - so perhaps a way for the user to add "extra" rules for a domain could make sense? |
I think a single rule may already improve this situation.
Let’s keep the microsoft docs situation in mind. We are on the 1. url and tell bloo to index. Bloo finds 2. url via meta-data. I think custom rules may be to complex because bloo does not tell me why it wants to index something. But basic rules I can switch on/off to change the indexing seem to be a good idea. Good switchable rules that come to mind are Also since bloo runs mostly on apple platform it may be possible to detect the page language and offer a setting to only index pages of certain languages. |
Hi, sorry for the slow reply, and hope you're having a happy holiday season. Those options do indeed sound very useful, so if I understand correctly we'd have:
Can you tell me more about cross-domain indexing? Do you mean e.g. Thanks again for these suggestions, I'll definitely be working on the first three, they sound very useful - when you find some time (no hurry) let me know more about cross-domain stuff. |
No problem. Holidays are more important than GitHub. (Let´s rename cross-domain to "follow sources")
An other example would be an article at a news website. Important: Only follow sources from the article itself. Not from other components / divs of the website. |
Just a note to let you know I haven't forgotten about this :) Time is very limited these days but I suspect the eventual solution will be to allow a custom extension to be added to the domain's "robots.txt", so for instance you can add extra rules, such as only allowing certain paths, or URLs that match specific regexes, etc. (Apologies in case you're no longer using Bloo, just ignore me then :) But I still want to implement this in some form as it's a great suggestion, and I want to use it too :)) |
Ok, the initial hack is in the upcoming build 55. For example, if you want to add access rules to
If you refresh, or pause and un-pause the indexing of that domain, it should no longer access URLs outside this scope. Note that if you already have indexed items outside this scope, they will only be removed if a full refresh of the domain is made. Note that I'm going to improve this with a bit of GUI when possible, but I wanted to get the ability to do this in initially so if you really need it, you can add that text file and get started. It supports all the syntax that a regular agent definition in a robots.txt file has, so if it can be done in a regular robots.txt file it can be done here too. I hope this helps and sorry for taking ages to reply! |
Greetings,
I am testing bloo on https://learn.microsoft.com/de-de/docs/
But it wants to index every language of the docs and not just german.
I am aware that I choose a large data set. I still wanted to ask for optimisations.
Bloo tries to index the other languages over the sitemap xml files.
Would it be possible to exclude other languages since the naming is somewhat standardised eg de_de en_us etc... and the languages are part of the sitemap xml files filenames.
I think if bloo would be able to handle microsoft docs it should be able to handle everything.
I am open to other suggestions.
The text was updated successfully, but these errors were encountered: