-
Notifications
You must be signed in to change notification settings - Fork 61
Reverse URLs
(LRU is a Reverse URL)
A Reverse URL or LRU is a way to rewrite a URL in the form of a path of stems sorted in a hierarchical order.
This is an actual URL
http://jiminy.medialab.sciences-po.fr/hci/index.php?title=Reverse_URLs#bottom
These are the tokens:
- http - type: scheme (or protocol)
- jiminy - type: host(or subdomain)
- medialab - type: host(also a subdomain)
- sciences-po - type: host (the actual domain)
- fr - type: host (actually it's the TLD)
- hci - type: path
- index.php - type: path
- title=Reverse_URLs - type: query
- bottom - type: fragment (anchor)
- NB: there is also a transport (hidden): the port 80, and there might also be some user:passwords strings.
The tokens are not sorted hierarchically. From the most generic to the most specific, we would have:
http > fr > sciences-po > medialab > jiminy > hci > index.php > title=Reverse_URLs > bottom
To be able to rebuild the URL from the LRU, we have to inform the type of each token (by a code letter). Once encoded, the LRU of our URL will be this one:
s:http|h:fr|h:sciences-po|h:medialab|h:jiminy|p:hci|p:index.php|q:title=Reverse_URLs|f:bottom
The all idea is to reverse the URL to more generic to specific order. HCI tools will need to test which of two LRUs is more specific.
We have to set a measure of specificity. It could be thought as the number of tokens.
s:http:|h:fr|h:sciences-po|h:medialab|h:jiminy|p:hci specificity = 6
But should not we set a different weight to the token types. Is having a fragment more specific than a subdomain ?
If yes, we could set those weight as an example !
s = 0 h = 1 p = 2 ...
then
s:http:|h:fr|h:sciences-po|h:medialab|h:jiminy|p:hci specificity_2 = 6
s:http:|h:fr|h:sciences-po|h:medialab|h:jiminy|h:hci specificity_2 = 5
The question remains open ?
In the crawling facilities offered by the HCI project by the hyphen crawler, the page level is far too specific to be useful. Instead, users tends to group pages into larger ensembles from domain name to any level of subdirectories in the path called Web entities.
The process of transforming a URL into a LRU is called Tokenization, for illustrative as well as historic reasons.
Each LRU is stored in the Memory structure as a tokenized string divided into several components excluding page level ones. The elementary components, or stems, are defined in the Uniform Resource Locators (URL), RFC 1738 and chosen to match with linkfluenceone:
- s - sheme or protocol. As each protocol may lead to different web pages, this has to be stored. It includes HTTP, HTTPS, FTP, etc.
- t - port. As each port may lead to different web pages, this has also to be stored. By default http respond on port 80 but some website host tools and on several concurrent port.
- h - domain. The domain is similar to the host name and is generally composed of 3 elements: a subdirectory, a domain name and an extension separated by a dot. The stem logic is to go from generality to specificity. In the host specificity levels goes from right to left so the host are stored reversely. Consequently the host www.sciences-po.fr becomes h:fr | h:sciences-po as www is ignored.
- p - path. The path is all that is contains between the host and last slashes. It corresponds to subdirectories. Each subdirectory is extract and append to the reverse host. In the web, path can be as long as one wants to but it seems irrelevant for HCI to go too deep as it generally reveal a too complex or lazy web structure. Therefore, the max depth of path is arbitrarily set to 8 which is relevant to consider the vast majority of web sites.
- q - query NOT USE AS DEFAULT FEATURE
- f - fragment (ancre) NOT USE AS DEFAULT FEATURE
Tokenization rules are useful to create web entities from known web platform such as blog platform or specifics university or corporation internal service websites. These rules are created as heuristics that have to be defined manually and whose implementation depends on the future choices of HCI. It will be either HCI server specific or shared among servers or both depending mostly on the community needs and feeds.
Despite the fact that our goal is to consider the largest amount of web sites programming possible, some issues are still at stake as technical dead ends. Issues include:
- fragment marker and query used for ajax (on sites like twitter.com/ic05 for example)
- javascript link
- need to establish whether the anchors id letter is "r" or "f" (different in the definition and the examples)
- how to handle order of queries, maybe always reorder alphabetically by argument?
In Heritrix project, LRU principle is called SURT = Sort-friendly URI Reordering Transform SURT Java Api