VookLess

Menu

Python学习(5) – 解决写入CSV中文乱码

这次的乱码问题从前天开始折腾我,期间使用了各种方式,方法都来源于Google,在Google使用的过程中也发现stackoverflow这个网站简直就是个教学库,基本上只要搜任何报错代码,都能找到对应的解决方案,苦于英语不给力,有时候理解上不如中文轻松,不过至少是有方法可循的。
乱码的问题对于我这种外行人来说真的是一个非常头痛的问题,有时候也搞不懂为什么要在编码上设置那么多坑,外行撸代码基本上一踩一个准,心好累,不过好在最后完美解决,最后的解决方案也是简单的不能再简单。

Code

DoubanSpider.py

#coding=utf-8
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from importlib import reload

from douban.items import DoubanItem
import sys
import imp

reload(sys)
#sys.setdefaultencoding('utf-8')    # define coding

class DoubanSpider(CrawlSpider):

    name = "douban"

    start_urls=['https://book.douban.com/top250']


    def parse(self, response):   

        selector = Selector(response)


        infos = selector.xpath('//tr[@class="item"]')

        item = DoubanItem()

        for info in infos:

            bookname = info.xpath('td/div/a/@title').extract()[0]  # get the book title

            url = info.xpath('td/div/a/@href').extract()[0]  # get the book url

            author_info = info.xpath('td/p/text()').extract()[0]  # get the book anthor

            author_info = str(author_info)

            author_infos = author_info.split('/')

            price = str(author_infos[len(author_infos)-1])



            rating = info.xpath('td/div/span[2]/text()').extract()[0]
            comment_nums = info.xpath('td/div/span[3]/text()').extract()[0]

            quote = info.xpath('td/p/span/text()').extract()

            if len(quote)>0 :
                quote = quote[0]
            else:
                quote = ''


            item['bookname']= bookname
            item['author']=author_infos[0]
            item['rating_nums']=rating
            item['quote']=quote
            item['comment_nums'] = filter(str.isdigit, (str(comment_nums)))
            item['pubday']=author_infos[len(author_infos)-2]
            item['price'] = price
            item['url']=url
            yield item



        for i in range(25,250,25):

            url = 'https://book.douban.com/top250?start=%s'%i

            yield Request(url,callback=self.parse)

setting.py

# -*- coding: utf-8 -*-
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douban.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'douban.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'douban.pipelines.SomePipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY = 1 # 250 ms of delay

FEED_EXPORT_ENCODING = 'utf-8-sig'
FEED_URI=u'/Users/sig_2/Desktop/douban-top250.csv'
FEED_FORMAT='CSV'

Note:只贴出涉及到地代码部分,另外部分未涉及的代码可去Github原项目查看。

Process

1、用Pydev运行豆瓣书单TOP250项目的时候首先sys的setdefaultencoding函数报错,google了一下报错内容发现这个代码是用Python2X版本写的,而这个函数再Python3X是不再支持了,专业地解释是Python3字符串默认编码unicode, 所以sys.setdefaultencoding也不存在了;备注掉这行之后报错却似取消了,顺便修改了存储地址之后顺利下载成功了。
2、本以为大功告成的时候,打开采集好的csv文件,发现中文都是乱码,乱码肯定是编码环节出了问题,最开始的思路是Google找备注掉的编码函数的Python3X版本替代品,结果找到一个方法:再setting.py中加了一句:

FEED_EXPORT_ENCODING = 'utf-8'

发现还是不行,继续Google,这里问题就来了,当你使用搜索引擎的时候,你很容易会被带偏,就像程序里的迭代功能一般,我在最开始是搜索Python3X的编码语句,后面发现搜索结果有反馈豆瓣抓取的数据乱码,初步怀疑是不是豆瓣做了反爬虫处理,查看页面编码发现也没有太多问题,后面又找到了一个Wind IDE的Python编译器乱码问题,折腾了半天安装破解这个软件,发现这个玩意就是跟我现在用的这个Pydev一个功能,只不过功能更加强大,不过也没有破解成功,网上的所有破解方法都在最新版面前都用不了。
3、到这个时候开始考虑是不是编译器的问题,按照网上的教程设置了Eclipse输出为UTF-8模式,以为终于解决了问题,结果还是乱码;既然工具没问题,那就肯定是代码的问题,用Notepad++查看csv文件发现竟然没有乱码,这里我基本上可以确认问题出在写入csv文件这里了,确认了问题点就好了,csv文件要求写入的UTF-8必须带BOM,不然中文会乱码,将前面那个编码改成:

FEED_EXPORT_ENCODING = 'utf-8-sig'

再保存运行代码,发现乱码问题就完美解决了,成果如下图:
douban_top250 (2).png

Note

1、编码问题是个非常麻烦的问题,不过也非常重要,把握一个基本原则就是坚持使用UTF-8,如果还是报错就分析错误点在哪个环节,针对特定环节去找答案效率更高。
2、解决问题思路要清晰,善用Google,同时也不要被大量的解决方法带偏的,切记工具只是辅助。
3、Eclipse的Pydev真的是极好用,以至于我用之前的cmd打开这个douban250的项目都报错了,可能是因为没办法识别文件结构,这点比较费劲,暂时先不纠结吧。

Source

Python写入csv编码用utf-8-sig
抓取豆瓣读书TOP250
Eclipse编码格式设置
Python中utf-8和utf-8-sig区别

— 于 共写了5777个字
— 文内使用到的标签:

发表评论

电子邮件地址不会被公开。 必填项已用*标注