Scrapy 的坑【1】

Caveats with inspecting the live browser DOM

Since Firefox add-ons operate on a live browser DOM, what you’ll actually see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing Javascript code. Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

Therefore, you should keep in mind the following things when working with Firefox and XPath:

  • Disable Firefox Javascript while inspecting the DOM looking for XPaths to be used in Scrapy
  • Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image').
  • Never include <tbody> elements in your XPath expressions unless you really know what you’re doing

在浏览器中检查DOM的注意事项

Firefox插件操作的是活动的浏览器DOM(live browser DOM),这意味着当您检查网页源码的时候, 其已经不是原始的HTML,而是经过浏览器清理并执行一些Javascript代码后的结果。 Firefox是个典型的例子,其会在table中添加 <tbody> 元素。 而Scrapy相反,其并不修改原始的HTML,因此如果在XPath表达式中使用 <tbody> ,您将获取不到任何数据。

所以,当XPath配合Firefox使用时您需要记住以下几点:

  • 当检查DOM来查找Scrapy使用的XPath时,禁用Firefox的Javascrpit。
  • 永远不要用完整的XPath路径。使用相对及基于属性(例如 id , class , width 等)的路径或者具有区别性的特性例如contains(@href, 'image') 。
  • 永远不要在XPath表达式中加入 <tbody> 元素,除非您知道您在做什么

Actually,

  1. Chrome 也会这样
  2. 使用 .css() CSS 选择器也会有同样的问题

from: https://doc.scrapy.org/en/1.3/topics/firefox.html 中文文档

添加新评论