2020-06-04

Ubuntu 18.04服务器上安装 Chrome Headless+Selenium

安装 Chrome Headless

Chrome 在最近推出了headless模式。原生的Chrome，更好的通用性，更快的速度……这些优点都足以表名目前来说 PhantomJS 已经要被取代了，果不其然，在最新版中的 Selenium 中已经不支持 PhantomJS了。

因此，为了学习 web2.0 的爬虫，必须得将 Chrome Headless 安装到服务器版的linux中运行。

在服务器的 Ubuntu 版本中必须得通过命令行安装：

sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb  # Might show "errors", fixed by next line
sudo apt-get install -f

测试安装

启动 Chrome

1	google-chrome --headless --remote-debugging-port=9222 https://chromium.org --disable-gpu

这里是使用headless模式进行远程调试，ubuntu 上大多没有 gpu，所以–disable-gpu以免报错。

之后另开一个连接端口来访问本地的9222端口：

1	curl http://localhost:9222

看到如下信息就表明安装成功了

<html>
<head>
<title>Headless remote debugging</title>
<style>
</style>

<script>
const fetchjson = (url) => fetch(url).then(r => r.json());

function loadData() {
  const getList = fetchjson("/json/list");
  const getVersion = fetchjson('/json/version');
  Promise.all([getList, getVersion]).then(parseResults);
}

function parseResults([listData, versionData]){
    const version = versionData['WebKit-Version'];
    const hash = version.match(/\s\(@(\b[0-9a-f]{5,40}\b)/)[1];
    listData.forEach(item => appendItem(item, hash));
}

function appendItem(item, hash) {
  let link;
  if (item.devtoolsFrontendUrl) {
    link = document.createElement("a");
    var devtoolsFrontendUrl = item.devtoolsFrontendUrl.replace(/^\/devtools\//,'');
    link.href = `https://chrome-devtools-frontend.appspot.com/serve_file/@${hash}/${devtoolsFrontendUrl}&remoteFrontend=true`;
    link.title = item.title;
  } else {
    link = document.createElement("div");
    link.title = "The tab already has active debugging session";
  }

  var text = document.createElement("div");
  if (item.title)
    text.textContent = item.title;
  else
    text.textContent = "(untitled tab)";
  if (item.faviconUrl)
    text.style.cssText = "background-image:url(" + item.faviconUrl + ")";
  link.appendChild(text);

  var p = document.createElement("p");
  p.appendChild(link);

  document.getElementById("items").appendChild(p);
}
</script>
</head>
<body onload='loadData()'>
  <div id='caption'>Inspectable WebContents</div>
  <div id='items'></div>
</body>
</html>

下载 chromedriver

chromedriver 提供了操作 Chrome 的api，是 Selenium 控制Chrome 的桥梁。查看最新的Chrome版本

下载并解压：

1 2	wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip unzip chromedriver_linux64.zip

安装完之后将解压出来的文件配置到环境变量中去

1	sudo vi ~/.profile

将下载到的 chromedriver 的路径添加进去

1	export PATH="$PATH:/home/wx/application/chromedriver"

更新环境变量

1	source ~/.profile

使用 Selenium 和 Chrome Headless 来访问网页

from  selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
client = webdriver.Chrome(chrome_options=chrome_options,executable_path='/home/wx/application/chromedriver')# 如果没有把chromedriver加入到PATH中，就需要指明路径
client.get("https://www.baidu.com")
content = client.page_source.encode('utf-8')
print (content)
client.quit()