个人认为爬虫是网页技术,实用性前几名的。举凡股票价格抓取,104人力银行职缺,後台自动登入,591租屋资讯查询,使用爬虫都可以让人事半功倍。
Ruby 可以做爬虫的 Gem 不少,例如:
今天要用的是另一款 Gem:Selenium,并以Chrome为基底驱动
首先要先安装 chromedirver,安装过程网路上有,这里就不详述
Gemfile
gem 'selenium-webdriver'
记得要
bundle install
require 'selenium-webdriver'
# 爬虫的目标网址
login_url = 'https://www.example.com/admin'
# 透过 options 设定 driver
options = Selenium::WebDriver::Chrome::Options.new
# 不用打开图形介面,开发前期先不加才看的到画面
options.add_argument('--headless')
# 指定浏览器的解析度
options.add_argument('--window-size=1440,900')
# docker原本的分享记忆体在 /dev/shm 是 64MB,会造成chorme crash,所以要改成写入到 /tmp
options.add_argument('--disable-dev-shm-usage')
# 以最高权限运行
options.add_argument('--no-sandbox')
# google document 提到需要加上这个属性来规避 bug
options.add_argument('--disable-gpu')
# 指定使用 chrome 为基底
@driver = Selenium::WebDriver.for :chrome, options: options
# 设定bridge,让headless时可以下载档案
bridge = @driver.send(:bridge)
path = '/session/:session_id/chromium/send_command'
path[':session_id'] = bridge.session_id
bridge.http.call(:post,
path,
cmd: 'Page.setDownloadBehavior',
params: {
behavior: 'allow',
# 指定下载到 tmp 资料夹
downloadPath: Dir.pwd + '/tmp/',
},)
# 导向指定连结
@driver.navigate.to login_url
find_element -- 找到第一个符合的 element
find_elements -- 找到所有符合的 elements
element = driver.find_element(:id, "q")
element = driver.find_element(:class, 'highlight')
# or
element = driver.find_element(:class_name, 'highlight')
# <div class="highlight" style="display: none; ">...</div>
element = driver.find_element(:tag_name, 'div')
ps. display: none
也抓得到
# <input id="q" name='search' type='text'>…</input>
element = driver.find_element(:name, 'search')
# <a href="http://www.google.com/search?q=cheese">cheese</a>
element = driver.find_element(:link, 'cheese')
# or
element = driver.find_element(:link_text, 'cheese')
用在对方的html结构可能改变,但是文字不变
# <a href="http://www.google.com/search?q=cheese">search for cheese</a>
element = driver.find_element(:partial_link_text, 'cheese')
# <ul class="dropdown-menu">
# <li><a href="/login/form">Login</a></li>
# <li><a href="/logout">Logout</a></li>
# </ul>
element = driver.find_element(:xpath, '//a[@href='/logout']')
# <div id="food">
# <span class="dairy">milk</span>
# <span class="dairy aged">cheese</span>
# </div>
element = driver.find_element(:css, '#food span.dairy')
ps. 跟 document.querySelector一样
driver.find_element(:id, 'BUTTON_ID).click
# input some text
driver.find_element(:id, 'TextArea').send_keys 'InputText'
driver.find_element(:id,'Element').displayed?
driver.find_element(:id,'Element').text
driver.find_element(:id, 'Element').attribute('class')
# check if it is selected
driver.find_element(:id, 'CheckBox').selected?
# select the element
driver.find_element(:id, 'CheckBox').click
# deselect the element
driver.find_element(:id, 'CheckBox').clear
@driver.find_element(:xpath, "//option[@value=#{start_day.year}]").click
# get the select element; then get all the options for this element
all_options = driver.find_element(:tag_name, "select").find_elements(:tag_name, "option")
# select the options
all_options.each do |option|
puts "Value is: " + option.attribute("value")
option.click
end
driver.execute_script("return window.location.pathname")
# 滑到特定位置
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# set the timeout to 10 seconds
wait = Selenium::WebDriver::Wait.new(:timeout => 10)
# wait 10 seconds until the element appear
wait.until { driver.find_element(:id => "foo") }
# set the timeout for implicit waits as 10 seconds
driver.manage.timeouts.implicit_wait = 10
#获取开启的多个视窗控制代码
windows=driver.window_handles
#切换到当前最新开启的视窗
driver.switch_to(windows[-1])
driver.switch_to.window(driver.window_handles.last )
driver.close()
driver.quit()
# switch to a frame
driver.switch_to.frame "some-frame" # name or id
driver.switch_to.frame driver.find_element(:id, 'some-frame') # frame element
# switch back to the main document
# 不切换回来,就困在iframe
driver.switch_to.default_content
alert = @driver.switch_to.alert
if alert.text.include? '同意请点选「确定」完成登入;不同意请点选「取消」停止登入'
alert.accept
end
# 指定 照片储存路径
screenshot_path = 'tmp/reconcile_task_files/screenshot.png'
driver.save_screenshot(screenshot_path)
<<: 用 tkinter 实现选择路径打开 excel ,并用 tree view 显示
如果说可以让模型缩小10倍,精度还维持水准,这是什麽巫术? 延续 Day 20 的模型优化作法,本...
Checkbox(可复选按钮) Checkbox是可复选按钮,不同於前一章的RadioButton,...
好的网站除了内容传达之外,颜色是进入网站的第一印象,可以针对文字大小、框线、背景色...等做变化,是...
JIT 即时模式 继上一篇提到开启 JIT 模式有许多优点,今天威尔猪就来浅谈这个有点厉害的新即时编...
记得在日本的一本 SEO 书写了一个很生动的范例,一间公司业绩要成长,取决於业务员的数量,而网站的...