爬虫学习

Publish Date: 2022-05-07

Selenium的使用

Selenium是一个自动化测试工具,借助浏览器驱动可以模拟用户对浏览器的操作.因此,爬虫也可以直接使用它爬取渲染好的结果,不需要去分析加密方式.
官方文档

浏览器对象初始化

from selenium import webdriver
browser = webdriver.Chorme() ##初始化一个Chorme浏览器对象
browser = webdriver.Edge() ##初始化一个Edge浏览器对象

访问页面

使用get方法发送一个get请求访问页面

browser.get("www.baidu.com")
print(browser.page_source) ##获得网页源码

查找节点

单个节点

selenium提供了find_element_by_name和find_element_by_id等方法来寻找,而且还支持XPath(find_element_by_xpath)和CSS选择器(find_element_by_css_selector)等方式寻找
另外,selenium还提供了find_element()函数,传递两个值,第一个值是查找方式,第二个是取值.建议使用这个方法

find_element_by_id('q')
## equals
find_element(By.ID,'q');

多个节点

对于多个节点,find_element方法只能返回第一个节点,需要使用find_elements方法,其余同上

节点交互

selenium还能实现浏览器交互

send_keys()向表格输入文字(需要寻找input标签)
click()点击按钮(需要寻找button标签)
screenshot()保存当前元素为PNG文件,该方法需要传入文件路径

submit()提交表格

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

url = "https://www.taobao.com"
browser = webdriver.Chrome()
browser.get(url)
input = browser.find_element(By.ID,'q')
input.send_keys("IPhone")
time.sleep(2)
button = browser.find_element(By.CLASS_NAME,'btn-search')
button.click()

获取节点信息

获取节点属性

可以使用get_attribute方法在选中结点之后获得节点属性.

from selenium import webdriver
from selenium.webdriver.common.by import By

url = "https://spa2.scrape.center/"
browser = webdriver.Chrome()
browser.get(url)
logo = browser.find_element(By.CLASS_NAME,'logo-image')
print(logo.get_attribute('src'))

获取节点文本信息

每个webElement节点都有一个text属性,直接调用这个属性即可

获取ID,位置,标签名和大小

webElement节点的id,location,tag_nanme,size属性分别存储了对应的信息.

动作链

节点交互的方法都依赖于节点,但是一些操作没有节点依赖.比如键盘输入,鼠标移动,这些就需要动作链方法

drag_and_drop(source,target)拖曳节点,把节点source拖到节点target所在的位置
drag_and_drop_by_offset(source, xoffset, yoffset)拖曳source节点,移动offset的距离
move_to_element(target)将鼠标移动到target节点的位置
click()左键点击当前鼠标所在位置的节点
click_and_hold()左键点击鼠标所在位置节点保持不放
context_click()右键点击
double_click()双击
key_down(value)按下键盘每个键不放,key_up(value)放开某个键
example: 输入ctrl+c
```
ActionChains(driver).key_down(Keys.CONTROL).send_keys('c').key_up(Keys.CONTROL).perform()
```

使用方法:

使用ActionChains实例化一个对象出来,该构造方法需要提供两个参数,一个driver,一个duration.driver必须提供,duration默认为250.

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By

url = "https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable"
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to.frame("iframeResult")
action = ActionChains(browser,duration=1000)
source = browser.find_element(By.ID,"draggable")
target = browser.find_element(By.ID,"droppable")
action.drag_and_drop(source,target).perform()

运行js

对于一些行为,selenium没有提供对应方法,比如alert(),比如进度条调整,但是selenium提供了execute_script()方法,在里面可以直接写入js.

from selenium import webdriver

browser = webdriver.Chrome()
url = "https://www.zhihu.com/explore"
browser.get(url)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

切换frame

网页中一种节点叫做iframe,既子Frame,相当于页面的子页面.在主页面无法通过find_element方法找到子页面的节点,需要切换frame才可获取.
switch_to.frame()切换到指定frame,switch_to.parent_frame()切换到父页面
同时,switch_to也支持切换到其他东西,比如switch_to.window()能切换到指定选项卡.

延时等待

浏览器并不会一次性渲染完所有的信息,有些Ajax的信息需要额外时间等待.因此在查找节点的时候,必要的等待就很有必要.

隐式等待

如果查询节点失败,selenium会等待特定时间,再进行下一次的查询,如果还是失败,则抛出异常,该默认时间就是隐式等待,默认值为0.

from selenium import webdriver

browser = webdriver.Chrome()
brower.implicitly_wait(100) ##设定隐式等待时间为100ms

显式等待

对于不同的查询设定不同的等待时间就是显式等待

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.taobao.com/"
browser = webdriver.Chrome()
brower.get(url)
wait = WebDriverWait(brower,10) ##实例化一个WebDriverWait对象,该对象实例化需要指定显示等待时间
input = wait.until(EC.presence_of_element_located((By.ID,'q')))## 显式等待时间为该节点出现,如果超过10s没出现则抛出异常
click = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,'.bin-search'))) ## element_to_be_clickable,等待节点可点击

前进和后退

forward方法实现前进,back实现后退

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url1 = "https://www.taobao.com/"
url2 = ""
url3 = ""
browser = webdriver.Chrome()
browser.get(url1)
browser.get(url2)
browser.get(url3)
browser.forward()
browser.back()

selenium也有对cookie管理的函数

get_cookies():获取cookies
add_cookies():增加cookies
delete_all_cookies():删除所有的cookies

选项卡管理

一个浏览器中会有多个选项卡,一次get方法会增加一个选项卡,selenium支持对选项卡进行操作.

window_handles:browser的属性,存储了该browser所有选项卡
switch_to.window()切换到指定选项卡

异常处理

selenium内置了几种异常,下面列举出几种常见的异常,详见官方文档

ElementNotInteractableException:不能互动的节点,有可能是因为节点没有加载完成.
NoSuchElementException:没有找到节点
InvalidSelectorException:选择器出现异常,可能是语法不对,也有可能是该表达式什么都不寻找

反屏蔽

很多网站增加对selenium的检测,防止爬虫的爬取,这里介绍一种常用的反屏蔽策略.
在大多数情况下,检测的基本原理是检测当前浏览器窗口下的window.navigator对象中是否包含webdriver属性,因此可以使用CDP(Chrome开发工具协议)或者其他对应的浏览器开发工具协议

from selenium import webdriver
from selenium.webdriver import ChromeOptions

option = ChromeOptions()
option.add_experimental_option('excludeSwitches',['enable-automation'])
option.add_experimental_option('useAutomationExtension',False)
browser = webdriver.Chrome(options = option)
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument',{
  'source':'Object.defineProperty(navigator,"webdriver",{get:() => undefined})'
}
browser.get(url)
)

Dovahkiin

https://the-tarnished.github.io/2022/05/07/Selenium%E7%9A%84%E4%BD%BF%E7%94%A8/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Dovahkiin !

python selenium

Splash的使用

2022-05-08 爬虫学习

python Lua splash

OO_Unit2_电梯难题

2022-04-25 面向对象

面向对象

Selenium的使用

Selenium的使用

浏览器对象初始化

访问页面

查找节点

单个节点

多个节点

节点交互

获取节点信息

获取节点属性

获取节点文本信息

获取ID,位置,标签名和大小

动作链

运行js

切换frame

延时等待

隐式等待

显式等待

前进和后退

Cookie

选项卡管理

异常处理

反屏蔽