from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'mzitu'])
其中的 mzitu 就为待会儿 spider.py 文件中的 name 属性。这点请务必记住哦!不然是跑不起来的。 在 mzitu_scrapy\spider 目录中创建 spider.py。文件作为爬虫文件。 好了!现在我们来想想,怎么来抓 mzitu.com 了。 首先我们的目标是当然是全站的妹子图片!!! 但是问题来了,站长把之前那个 mzitu.com\all 这个 URL 地址给取消了,我们没办法弄到全部的套图地址了! 我们可以去仔细观察一下站点所有套图的地址都是:http://www.mzitu.com/ 几位数字结尾的。 这种格式地址。 有木有小伙伴儿想到了啥? CrawlSpider !!!就是这玩儿!! 有了它我们就能追踪 “http://www.mzitu.com/ 几位数字结尾的” 这种格式的 URL 了。 Go Go Go Go!开始搞事。 首先在 item.py 中新建我们需要的字段。我们需要啥?我们需要套图的名字和图片地址!! 那我们新建三个字段:
1 2 3 4 5 6 7 8 9 10
import scrapy class MzituScrapyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() image_urls = scrapy.Field() url = scrapy.Field() pass
第一步完成啦!开始写 spider.py 啦! 首先导入我们需要的包:
1 2 3 4
from scrapy import Request from scrapy.spider import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from mzitu_scrapy.items import MzituScrapyItem
from scrapy import Request from scrapy.spider import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from mzitu_scrapy.items import MzituScrapyItem
from scrapy import Request from scrapy.spider import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from mzitu_scrapy.items import MzituScrapyItem
from scrapy import Request from scrapy.spider import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from mzitu_scrapy.items import MzituScrapyItem
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy import Request from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem import re
defitem_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] ifnot image_paths: raise DropItem("Item contains no images") return item
Warning: get_headers(): SSL operation failed with code 1. OpenSSL Error messages:
error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed in /mydata/web/wwwshanhubei/web/wp-content/themes/shanhuke/single.php on line 57
Warning: get_headers(): Failed to enable crypto in /mydata/web/wwwshanhubei/web/wp-content/themes/shanhuke/single.php on line 57
Warning: get_headers(https://static.shanhubei.com/qrcode/qrcode_viewid_11223.jpg): failed to open stream: operation failed in /mydata/web/wwwshanhubei/web/wp-content/themes/shanhuke/single.php on line 57