Skip to content

cweatureapps/SwiftScraper

Repository files navigation

SwiftScraper

Web scraping library for Swift.

Overview

This framework provides a simple way to declaratively define a series of steps in Swift that represent how to scrape a web site, allowing the app to read this web page data.

Features

  • Declarative API - clearly define the steps to run and avoid the spaghetti 🍝 code that comes with using the WebView and the delegate pattern
  • Custom JavaScript integration - Simple integration with custom JavaScript to perform complicated scraping, using the language of the web to process the web page
  • Perform custom processing at each step
  • Passing data between steps
  • Control flow to determine which step to run next, allowing basic conditionals and loops

Example

If you want to read the finished code example straight away:

For a step by step guide on how to implement these, read the next two sections.

Tutorial

In this tutorial, we'll cover the basic usage of this framework by performing a search on the google web site.

CocoaPod integration

Reference this pod in your Podfile:

pod "SwiftScraper", git: "https://github.com/cweatureapps/SwiftScraper.git"

JavaScript setup

By convention, all the steps will use the functions exposed in a single module which is defined in a single JavaScript file.

For this exercise, create a new file, GoogleSearch.js

Start by creating the blank JavaScript module structure, making sure the module name is the same as the file name:

var GoogleSearch = (function() {
    return {
    };
})()

Loading a web page

Create a new view controller.

Import the framework:

import SwiftScraper

In the view controller, we'll create a step and run it:

var stepRunner: StepRunner!

override func viewDidLoad() {
    super.viewDidLoad()
    let step1 = OpenPageStep(path: "https://www.google.com")
    stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1])
    stepRunner.insertWebViewIntoView(parent: view)
    stepRunner.run()
}

When you run this, you will see a web view opening the Google home page.

The web view typically needs to be have a visible frame size, because web sites often use responsive breakpoints and will even sometimes change the HTML structure based on the dimensions of the page.

The insertWebViewIntoView method helps you to easily insert the web view into any UIView that you have. It is up to you to set up the dimensions of the parent view, or you can even hide it where the user cannot see it.

Check that the page loaded

We can add an assertion to run some JavaScript code when the page loads, to make sure the page that loaded is expected. We can do this by referencing a JavaScript function which is exposed by the module.

In the GoogleSearch.js file, add the following function which will just check the title of the page is correct.

var GoogleSearch = (function() {
    function assertGoogleTitle() {
        return document.title == "Google";
    }
    return {
        assertGoogleTitle: assertGoogleTitle
    };
})()

In the view controller where you created the step, include the name of the assertion function:

let step1 = OpenPageStep(
                path: "https://www.google.com",
                assertionName: "assertGoogleTitle")

The assertion function runs immediately when the page loads. Sometimes, what you are asserting may not be ready at the point when the page loads, as the website may modify the page asynchronously after loading.

Observe progress of the run

You can observe the progress of the execution by observing the state property of the StepRunner object.

    stepRunner.state.afterChange.add { change in
        print("-----", change.newValue, "-----")
        switch change.newValue {
        case .inProgress(let index):
            print("About to run step at index", index)
        case .failure(let error):
            print("Failed: ", error.localizedDescription)
        case .success:
            print("Finished successfully")
        default:
            break
        }
    }
    stepRunner.run()

Run script that loads page

Let's now run some custom JavaScript to submit a google search. This is the PageChangeStep which runs some JavaScript, which will result in a new page being loaded. When the page is loaded, it will proceed to the next step.

Firstly, in the GoogleSearch.js file, add the following 2 functions which perform the search, and exposes them in the module:

var GoogleSearch = (function() {

    // ...

    function performSearch(searchText) {
        document.querySelector('input[type="text"], input[type="Search"]').value = searchText;
        document.forms[0].submit();
    }    
    function assertSearchResultTitle() {
        return document.title == "SwiftScraper iOS - Google Search";
    }  
    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle
    };
})()

In the view controller, add step 2 which is the PageChangeStep, referencing the JavaScript functions you just implemented:

let step2 = PageChangeStep(
                functionName: "performSearch",
                params: "SwiftScraper iOS",
                assertionName: "assertSearchResultTitle")

Notice the params parameter in the initializer, which allows you to pass data to the JavaScript function.

Make sure to include this in the array of steps when you create the StepRunner:

stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1, step2])

Run script and process

We're at the last step - we can run a script to scrape the contents of the page. Add the following JavaScript function which will get the search results, and return an Array of JSON objects with the text and href of each link.

var GoogleSearch = (function() {

    // ...

    function getSearchResults() {
        var headings = document.querySelectorAll('h3.r');
        return Array.prototype.slice.call(headings).map(function (h3) {
            return { 'text': h3.innerText, 'href': h3.childNodes[0].href };
        });
    }

    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle,
        getSearchResults: getSearchResults
    };
})()

In the Swift code, add the 3rd step which is a ScriptStep, a step which runs a JavaScript function and returns the response that the function returns.

let step3 = ScriptStep(functionName: "getSearchResults") { response, _ in
    if let responseArray = response as? [JSON] {
        responseArray.forEach { json in
            if let text = json["text"], let href = json["href"] {
                print(text, "(", href, ")")
            }
        }
    }
    return .proceed
}

And make sure to include this in the array of steps when you create the StepRunner:

stepRunner = StepRunner(moduleName: "GoogleSearch", steps: [step1, step2, step3])

Run this. You should see the steps complete successfully, and print the search results to the console.

Congratulations! πŸŽ‰ You've finished the tutorial on the basic usage of this library! πŸ‘

Advanced Usage

Run script that returns data async

It is possible to run some JavaScript that does not return immediately, and wait for it to asynchronously call back the Swift code after some time has passed. For example, you may need to do something on the web page, poll for the operation to complete, and then pass the data back to Swift.

To pass data back to Swift world, call SwiftScraper.postMessage(), passing a single object that can be serialized back to a Swift object.

In this example, we'll do a google image search, and then scroll down to the bottom. The infinite scroll pattern employed here will load more images when we do this, and we'll do a count of the images before and after the scroll.

var GoogleSearch = (function() {

    // ...

    function scrollAndCountImages() {
        var firstCount = document.querySelectorAll('img').length;
        window.scrollTo(0, document.body.scrollHeight);
        setTimeout(function () {
            var secondCount = document.querySelectorAll('img').length;
            SwiftScraper.postMessage({'first': firstCount, 'second': secondCount});
        }, 2000);        
    }

    return {
        assertGoogleTitle: assertGoogleTitle,
        performSearch: performSearch,
        assertSearchResultTitle: assertSearchResultTitle,
        getSearchResults: getSearchResults,
        scrollAndCountImages: scrollAndCountImages
    };
})()

For those familiar with WKWebView, the SwiftScraper.postMessage() function is just an alias for webkit.messageHandlers.swiftScraperResponseHandler.postMessage()

In Swift, use the AsyncScriptStep, which is used in the same way as ScriptStep, with the difference being the handler is not called until SwiftScraper.postMessage is called. It is expected that the JavaScript function itself does not return anything.

let step1 = OpenPageStep(path: "https://www.google.com.au/search?tbm=isch")

let step2 = PageChangeStep(
    functionName: "performSearch",
    params: "ankylosaurus")

let step3 = AsyncScriptStep(functionName: "scrollAndCountImages") { response, _ in
    if let json = response as? JSON {
        if let first = json["first"], let second = json["second"] {
            print("first: ", first, "second: ", second)
        }
    }
    return .proceed
}

Process Step

Use the ProcessStep when you need a step that requires some custom action to be performed.

let processStep = ProcessStep { model in
    // perform some custom action here
    return .proceed
}

Two main concepts to note here are:

  • The model parameter, used for passing model data between steps
  • The return value, which can be used for control flow

These concepts apply to the ProcessStep, ScriptStep and AsyncScriptStep. We'll explore them in the next two sections.

Passing model data

The ProcessStep, ScriptStep and AsyncScriptStep all have a handler closure to perform processing, and these handlers all have a model parameter of type inout JSON. Modify this JSON dictionary to save data during one step, and then read it in another step.

Let's modify the AsyncScriptStep from the previous section to save the before and after counts to the dictionary.

let step3 = AsyncScriptStep(functionName: "scrollAndCountImages") { response, model in  // notice the model param
    if let json = response as? JSON {
        if let first = json["first"], let second = json["second"] {
            print("first: ", first, "second: ", second)

            // Save the data to the model dictionary
            model["first"] = first
            model["second"] = second            
        }
    }
    return .proceed
}

Control Flow

The return value is an enum which can be used for rudimentary control flow. We've seen .proceed which means to go to the next step. The .jumpToStep(n) allows you to jump to another step, either before or after the current step. This allows you to define loops (by jumping back) as well as conditionals (by jumping forward).

Let's continue the infinite scrolling image search example, and add a ProcessStep that will keep looping back to step3 until the before count and after count are the same, meaning there are no more images on the page to load.

Add this step as the last step to run. When you run this, you should see the screen keep scrolling down until no more images can be found.

let conditionStep = ProcessStep { model in
    if let first = model["first"] as? Int,
        let second = model["second"] as? Int,
        first == second {
        return .proceed
    } else {
        return .jumpToStep(2) // This is a zero-based index, i.e. step3
    }
}

This technique is most useful for repeating a sequence of steps. While it can also be used to model IF-THEN style conditionals, it is essentially a GOTO construct and can easily lead to unmaintainable spaghetti 🍝 steps.

You can also have an early exit from the steps. The return value of .finish will stop execution as a success, while .failure(Error) will stop execution with a failure.

Wait Step

A step that waits for a set period of time.

let waitStep = WaitStep(waitTimeInSeconds: 0.5)

Wait for Condition Step

This is a step that waits for a condition to become true before proceeding, or it will fail if the condition is still false when the timeout occurs.

In this example, the iOS code will repeatedly call the JavaScript function testThatStuffIsReady, proceeding as soon as it returns true, or failing with a timeout if it doesn't return true within 2 seconds.

let waitForConditionStep = WaitForConditionStep(
    assertionName: "testThatStuffIsReady",
    timeoutInSeconds: 2)

FAQ

I'm getting the error: "An SSL error has occurred and a secure connection to the server cannot be made."

App Transport Security (ATS) rules apply web views as well. If the website you are loading is not HTTPS, or uses outdated security protocols, then iOS will refuse to load it.

The quick workaround is to disable ATS by putting the following setting in your Info.plist

<key>NSAppTransportSecurity</key>
<dict>
    <key>NSAllowsArbitraryLoads</key>
    <true/>
</dict>

However, at some point in the future, Apple may require that all apps submitted to the App Store support ATS.

For more information, see the following links:

License

SwiftScraper is available under the MIT license. See the LICENSE file for more info.