-
Notifications
You must be signed in to change notification settings - Fork 61
Software architecture group
Benjamin Ooghe-Tabanou edited this page Dec 21, 2012
·
1 revision
Paul, Erik, Guillaume, Thomas, Elias, Sébastian, Camille, Mathieu, Guillaume, Raphaël
- What to do with web contents? Indexation vs archiving? Scraping?
- How should qualitative navigation and quantitative crawling work with each other?
- What place should take the explorative tools in the method?
- server side - client side issue
- which crawler ?
- standalone version ?
- written in JAVA
- only crawls
- firefox plugin
- XUL and javascript
- crawls using Firefox browser engine to harvest links
- set a corpus : boundaries + tag system
- export in GDF for graph view in Gephi
- save in WXSF xml format
- crawls and archives
- bots (phagosite, heretrix, automated firefox), connects to scheduler which distributes jobs, has crawl policy on top (prospect, archive, ...)
- output = xml
- easier to distribute on different machines
- uses DAFF (handles redundancy, in contrast to ARC...
- VORTEX (proxy, scheduler, application proxy)
- lot of libraries to parse links and provide ‘services’
- can be a shared server / repository
- installable on ‘any’ linux machine, turning that machine into a proxy which properly archives what is requested through it. It spits out DAFF or ARC, the latter which can be converted into WARC
- Heritrix : http://crawler.archive.org/
- Crawler used by IIPC and BNF. For big crawls, difficult to distribute, centralized architecture.
- IIPC Web Curator Tool http://webcurator.sourceforge.net/
- Topology: GEXF, Guess .GDF, List of Gephi supported formats
- web corpus:
- ARC
- internet archive
- very simple
- record based HTTP response storage
- WARC :
- IIPC, Heritrix
- iso
- complex
- must define a policy
- issuecrawler format (based on xml)
- navicrawler format GDF, WXSF (based on xml)
- CSV,
- DAFF
- ARC
- Analysis: statistics
- live crawling
- archiving
- web corpus handler
- web corpus definition
- exploration tool
- granularity of crawling
- The granularity defines a tree of “grains”
- Granularity is the technical precision (makes an approximation that may be necessary in large crawls)
- Examples:
- “Full” precision: we take all URLs, and consider them just as strings.
- The graph of pages is stored
- We can define WebEntities by regexp
- “Stem” precision: URLs are considered as a serie of stems.
- The graph of stemmed URLs (and links from a stemmed URL to stemmed URLs) is stored
- We can define WebEntities as all the stemmed URLs that begin with the same series of stems. Like “blogspot.com” followed by anything.
- boundaries on the URLS
- qualification
- yes in the limit of granularity
- before crawling iteration
depends on granularity