Splash 用来执行爬取的 JS 代码,可渲染,也可直接获取到变量值,就跟在 Chrome console 执行 JS 一样!
首先安装 Sqlash 服务,Sqlash 最好用 Docker 安装,这里我使用 K8s 部署的:
splash-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: app
name: splash
namespace: ns-app
spec:
type: NodePort
ports:
- name: "8050"
port: 8050
targetPort: 8050
protocol: TCP
nodePort: 30001
selector:
app: splash
splash-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: my-app
name: splash
namespace: ns-app
spec:
replicas: 1
strategy: {}
template:
metadata:
labels:
app: splash
spec:
containers:
- image: scrapinghub/splash:3.2
name: splash
ports:
- containerPort: 8050
resources: {}
restartPolicy: Always
status: {}
然后创建命名空间 kubectl create namespace ns-app
再执行 kubectl apply -f .
即可,验证安装完成:
$ kubectl get pods -n ns-app
NAME READY STATUS RESTARTS AGE
splash-675d87bdfb-x87v6 1/1 Running 0 26h
kubectl get svc -n ns-app
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
splash NodePort 10.191.47.80 <none> 8050:30001/TCP 26h
这样就可以使用了,访问 Pod 所在宿主机的 ip 加上端口 30001 即可,比如我的是:http://fueltank-1:30001
我这里是使用的 Scrapy 搭配 Splash 来使用的。
首先安装客户端库 pip3 install scrapy-splash
然后在 setting.py 中加入以下代码:
DOWNLOADER_MIDDLEWARES = {
# 'yarn_spider.middlewares.YarnSpiderDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://fueltank-1:30001'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
OK,可以写代码了。爬虫代码:
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy_splash import SplashRequest
class YarnSpider(scrapy.Spider):
name = 'yarn'
allowed_domains = ['fueltank-2.cloud.bbdops.com']
lua_script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.1)
local result = splash:evaljs("appsTableData")
return result
end
"""
def start_requests(self):
yield SplashRequest(
"http://10.28.102.152:8088/cluster/apps/RUNNING", self.parse,
endpoint='execute',
args={
'lua_source': self.lua_script
}
)
def parse(self, response):
# print(response.body)
cell_list = json.loads(response.body)
# print(cell_list)
for cell in cell_list:
app_id = cell[0].split(">")[1].split("<")[0]
user = cell[1]
name = cell[2]
application_type = cell[3]
queue = cell[4]
start_time = cell[5]
finish_time = cell[6]
state = cell[7]
final_status = cell[8]
running_containers = cell[9]
allocated_cpu_vcores = cell[10]
allocated_memory_mb = cell[11]
tracking_ui_url = cell[13].split("href='")[1].split("'")[0]
json_str = json.dumps({"app_id": app_id, "user": user, "name": name, "application_type": application_type,
"queue": queue, "start_time": start_time, "finish_time": finish_time, "state": state,
"final_status": final_status, "running_containers": running_containers,
"allocated_cpu_vcores": allocated_cpu_vcores,
"allocated_memory_mb": allocated_memory_mb,
"tracking_ui_url": tracking_ui_url
}).encode("utf-8").decode('unicode_escape')
print(json_str)
python 里面写了一段 lua 脚本。
具体使用方式看官方教程:https://splash.readthedocs.io/en/stable/