Skip to content

Latest commit

 

History

History
111 lines (81 loc) · 3.56 KB

0140-cl-mechanize.org

File metadata and controls

111 lines (81 loc) · 3.56 KB

cl-mechanize

The README says this library tries to be a Perl’s WWW:Mechanize clon. There is also Python library mechanize as well. Seems the stateful web scrapers is popular among some developers.

When I tried cl-mechanize to log into Reddit, it didn’t work. The fetch function should discover all forms with their inputs but the login form was empty. Without CSRF token I wasn’t able to log in.

But I found a fork https://github.com/ilook/cl-mechanize where this problem was fixed.

Let’s create a program which will fetch your karma and latest comments from the Reddit!

First, we need to log in. Mechanize operates on the browser object which keeps the information about the current page and cookies:

POFTHEDAY> (defparameter *browser*
             (make-instance 'cl-mechanize:browser))

POFTHEDAY> (cl-mechanize:fetch "https://www.reddit.com/login/"
                               *browser*)
#<CL-MECHANIZE:PAGE {100A2D7FA3}>

POFTHEDAY> (mechanize:page-forms *)
(#<CL-MECHANIZE:FORM {100A2D4923}>)

POFTHEDAY> (defparameter *login-form* (first *))

POFTHEDAY> (mechanize:form-inputs *login-form*)
(("otp-type" . "app") ("otp" . "") ("password" . "") ("username" . "")
 ("is_mobile_ui" . "False") ("ui_mode" . "") ("frontpage_signup_variant" . "")
 ("is_oauth" . "False")
 ("csrf_token" . "ba038152b86951ab28725c37ed0b3e96d640d083")
 ("dest" . "https://www.reddit.com") ("cookie_domain" . ".reddit.com"))

POFTHEDAY> (setf (alexandria:assoc-value
                  (mechanize:form-inputs *login-form*)
                  "username" :test #'string=)
                 "svetlyak40wt")

POFTHEDAY> (setf (alexandria:assoc-value
                  (mechanize:form-inputs *login-form*)
                  "password" :test #'string=)
                 "********")

However, ilook’s version of the cl-mechanize does not work either. It fails on form submission with the following error:

“Don’t know how to handle method :|post|.”

To overcome this issue we’ll set the method to the proper keyword:

POFTHEDAY> (setf (mechanize:form-method *login-form*)
                 :post)

POFTHEDAY> (mechanize:submit *login-form* *browser*)

POFTHEDAY> (cl-mechanize:fetch "https://www.reddit.com/"
                               *browser*)

POFTHEDAY> (cl-ppcre:scan-to-strings
            "(\\d+) karma"
            (mechanize:page-content *))
"708 karma"
#("708")

Now we’ll fetch last 3 comments:

;; Mechanize can be enchanced to handle relative URLs:
POFTHEDAY> (cl-mechanize:fetch "/message/inbox"
                               *browser*)
; Debugger entered on #<DRAKMA:PARAMETER-ERROR
; "Don't know how to handle scheme ~S." {100252AA63}>

;; I found that page /message/inbox does not countain messages
;; and you have to fetch this instead:
POFTHEDAY> (cl-mechanize:fetch "https://www.reddit.com/message/inbox?embedded=true"
                               *browser*)
; Debugger entered on #<TYPE-ERROR expected-type: STRING datum: NIL>

As you see, cl-mechanize failed on fetching this simple page. This library is 10 years old and still has so many bugs :(

Also, I found very unpleasant to work with cxml-stp’s API. CL-Mechanize parses the page’s body into cxml data structures and it was hard to figure out how to search the nodes I need.

If you know about some other Common Lisp library that is able to keep cookies and suitable for web scraping, please, let me know.