Python学习(5) – 解决写入CSV中文乱码
这次的乱码问题从前天开始折腾我,期间使用了各种方式,方法都来源于Google
,在Google
使用的过程中也发现stackoverflow
这个网站简直就是个教学库,基本上只要搜任何报错代码,都能找到对应的解决方案,苦于英语不给力,有时候理解上不如中文轻松,不过至少是有方法可循的。
乱码的问题对于我这种外行人来说真的是一个非常头痛的问题,有时候也搞不懂为什么要在编码上设置那么多坑,外行撸代码基本上一踩一个准,心好累,不过好在最后完美解决,最后的解决方案也是简单的不能再简单。
Code
DoubanSpider.py
#coding=utf-8
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from importlib import reload
from douban.items import DoubanItem
import sys
import imp
reload(sys)
#sys.setdefaultencoding('utf-8') # define coding
class DoubanSpider(CrawlSpider):
name = "douban"
start_urls=['https://book.douban.com/top250']
def parse(self, response):
selector = Selector(response)
infos = selector.xpath('//tr[@class="item"]')
item = DoubanItem()
for info in infos:
bookname = info.xpath('td/div/a/@title').extract()[0] # get the book title
url = info.xpath('td/div/a/@href').extract()[0] # get the book url
author_info = info.xpath('td/p/text()').extract()[0] # get the book anthor
author_info = str(author_info)
author_infos = author_info.split('/')
price = str(author_infos[len(author_infos)-1])
rating = info.xpath('td/div/span[2]/text()').extract()[0]
comment_nums = info.xpath('td/div/span[3]/text()').extract()[0]
quote = info.xpath('td/p/span/text()').extract()
if len(quote)>0 :
quote = quote[0]
else:
quote = ''
item['bookname']= bookname
item['author']=author_infos[0]
item['rating_nums']=rating
item['quote']=quote
item['comment_nums'] = filter(str.isdigit, (str(comment_nums)))
item['pubday']=author_infos[len(author_infos)-2]
item['price'] = price
item['url']=url
yield item
for i in range(25,250,25):
url = 'https://book.douban.com/top250?start=%s'%i
yield Request(url,callback=self.parse)
setting.py
# -*- coding: utf-8 -*-
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban.pipelines.SomePipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY = 1 # 250 ms of delay
FEED_EXPORT_ENCODING = 'utf-8-sig'
FEED_URI=u'/Users/sig_2/Desktop/douban-top250.csv'
FEED_FORMAT='CSV'
Note:只贴出涉及到地代码部分,另外部分未涉及的代码可去Github原项目查看。
Process
1、用Pydev
运行豆瓣书单TOP250
项目的时候首先sys的setdefaultencoding函数
报错,google了一下报错内容发现这个代码是用Python2X版本
写的,而这个函数再Python3X
是不再支持了,专业地解释是Python3字符串默认编码unicode, 所以sys.setdefaultencoding也不存在了
;备注掉这行之后报错却似取消了,顺便修改了存储地址之后顺利下载成功了。
2、本以为大功告成的时候,打开采集好的csv文件,发现中文都是乱码,乱码肯定是编码环节出了问题,最开始的思路是Google找备注掉的编码函数的Python3X版本替代品,结果找到一个方法:再setting.py中加了一句:
FEED_EXPORT_ENCODING = 'utf-8'
发现还是不行,继续Google,这里问题就来了,当你使用搜索引擎的时候,你很容易会被带偏,就像程序里的迭代功能一般,我在最开始是搜索Python3X的编码语句,后面发现搜索结果有反馈豆瓣抓取的数据乱码,初步怀疑是不是豆瓣做了反爬虫处理
,查看页面编码发现也没有太多问题,后面又找到了一个Wind IDE
的Python编译器乱码
问题,折腾了半天安装破解这个软件,发现这个玩意就是跟我现在用的这个Pydev一个功能,只不过功能更加强大,不过也没有破解成功,网上的所有破解方法都在最新版面前都用不了。
3、到这个时候开始考虑是不是编译器的问题,按照网上的教程设置了Eclipse
输出为UTF-8模式,以为终于解决了问题,结果还是乱码;既然工具没问题,那就肯定是代码的问题,用Notepad++
查看csv文件发现竟然没有乱码,这里我基本上可以确认问题出在写入csv文件这里了,确认了问题点就好了,csv文件要求写入的UTF-8必须带BOM
,不然中文会乱码,将前面那个编码改成:
FEED_EXPORT_ENCODING = 'utf-8-sig'
再保存运行代码,发现乱码问题就完美解决了,成果如下图:
Note
1、编码问题是个非常麻烦的问题,不过也非常重要,把握一个基本原则就是坚持使用UTF-8,如果还是报错就分析错误点在哪个环节,针对特定环节去找答案效率更高。
2、解决问题思路要清晰,善用Google,同时也不要被大量的解决方法带偏的,切记工具只是辅助。
3、Eclipse的Pydev真的是极好用,以至于我用之前的cmd打开这个douban250的项目都报错了,可能是因为没办法识别文件结构,这点比较费劲,暂时先不纠结吧。
Source
Python写入csv编码用utf-8-sig
抓取豆瓣读书TOP250
Eclipse编码格式设置
Python中utf-8和utf-8-sig区别