python 拾遗2

Posted on 2014-12-04 In program Symbols count in article: 2.4k Reading time ≈ 2 mins.

本文是上一篇文章 python 拾遗的延续，继续整理 python 的一些使用技巧，以及一些可能被忽略的细节

注意: 以下讨论主要为 Python2.7 版本， Python 3 的内容有待跟进

Get MD5 hash of big files

当我们需要通过 python 得到一个很大文件的 md5 值的时候，我们可以通过分段读取文件的方法来节约内存，选择合适的分段大小还会适当提高计算效率。
chksum.py 通过 memory_profiler 统计执行过程中内存的使用情况并统计每一次计算的执行时间，同时给出了1Gb 数据的测试结果。

stackoverflow 上的一些讨论：Get MD5 hash of big files in Python, Lazy Method for Reading Big File in Python

io.BytesIO vs cString.StringIO

python2 和 python3 在 StringIO 和 BytesIO 之间有诸多不同，six 是一个提供同时兼容 py2 和 py3 的解决方案，这个几个模块的具体区别参考下边的表格。

模块	Python 2	Python 3
StringIO.StringIO	内存中的字符串缓存，可以存储字符串或Unicode 类型	删除
cStringIO.StringIO	基于C实现提供类似StringIO.StringIO的接口且更高效，但是相比StringIO.StringIO使用有一定限制	删除
io.StringIO	对 Unicode文本内容的内存缓存，只能存储 Unicode 对象	对文本数据的内存缓存，不能接收 Unicode 类型
io.BytesIO	存储字节的内存缓存	存储字节的内存缓存
six.StringIO	StringIO.StringIO	io.StringIO
six.BytesIO	StringIO.StringIO	io.BytesIO

在性能上：通常 cStringIO.StringIO 是最快的。io.Bytes 同样是通过 C 实现的，但是例如通过 io.BytesIO(b'data') 初始化 BytesIO 对象时会对数据进行一次复制，这会引起性能上的损失。

关于 StringIO 和 BytesIO 的性能区别，对于 IO 性能敏感的场景还是有很大影响，例如在 tornado，scrapy 的项目中以及 Python 邮件列表中都有相关讨论。
在未来 Python3.5 版本中将会对 io.BytesIO 进行 copy-on-write 的优化，详见：Python Issue22003。

当具体需要创建 file-like 的数据流时并且需要考虑对 Python2 和 Python3 代码的兼容性时，我们需要根据具体的数据类型（字符串或者 Unicode 或者 Bytes），以及使用场景对性能的要求选择合适的模块。

List comprehensions leak the loop control variable

看一段很简单的列表生成的代码：

>>> x = 'before'
>>> a = [x for x in range(5)]
>>> x
4
>>> x = 'before'
>>> a = (x for x in range(5))
>>> x
'before'

在 python2.x 中，list comprehension 中变量的作用域并不仅限于 [] 中，而是会泄露出来，而 Generator expressions 执行时会创建一个独立的运行域，因而不会发生变量泄露。在 Python3 中 list comprehension 变量泄露已经得到了修改。

下边是 Python History 中的原文

This was an artifact of the original implementation of list comprehensions; it was one of Python’s “dirty little secrets” for years. It started out as an intentional compromise to make list comprehensions blindingly fast, and while it was not a common pitfall for beginners, it definitely stung people occasionally. For generator expressions we could not do this. Generator expressions are implemented using generators, whose execution requires a separate execution frame. Thus, generator expressions (especially if they iterate over a short sequence) were less efficient than list comprehensions.

socket.settimeout(value)

socket 设置超时之后，该 socket 就是 non-blocking 模式

Timeout mode internally sets the socket in non-blocking mode. The blocking and timeout modes are shared between file descriptors and socket objects that refer to the same network endpoint. A consequence of this is that file objects returned by the makefile() method must only be used when the socket is in blocking mode; in timeout or non-blocking mode file operations that cannot be completed immediately will fail.