Levenshtein 距离（编辑距离，动归） Python 实现

发表于 2016-09-05 | 分类于 Python | 暂无评论

Levenshtein 距离

叫编辑距离大家就都秒 get√ 了，经典动归。
这里给出 Python 的实现。

def levenshtein(first, second):
    first_length = len(first)
    second_length = len(second)
    if first_length > second_length:
        first, second = second, first
        first_length, second_length = second_length, first_length
    if first_length == 0:
        return second_length
    if second_length == 0:
        return first_length
    distance_matrix = [range(second_length+1) for x in range(first_length+1)]
    # print distance_matrix
    for i in range(1, first_length+1):
        for j in range(1, second_length+1):
            deletion = distance_matrix[i - 1][j] + 1
            insertion = distance_matrix[i][j - 1] + 1
            substitution = distance_matrix[i - 1][j - 1]
            if first[i - 1] != second[j - 1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    print distance_matrix
    return distance_matrix[first_length][second_length]

有需要的自取。

Transforming Code into Beautiful, Idiomatic Python

发表于 2016-08-28 | 分类于 Python | 暂无评论

本文内容可以在这里下载 http://tech.gladuo.com/beautiful_idiomatic_python.html

Notes from Raymond Hettinger's talk at pycon US 2013 video, slides.

The code examples and direct quotes are all from Raymond's talk. I've reproduced them here for my own edification and the hopes that others will find them as handy as I have!

Looping over a range of numbers

for i in [0, 1, 2, 3, 4, 5]:
    print i**2

for i in range(6):
    print i**2

Better

for i in xrange(6):
    print i**2

xrange creates an iterator over the range producing the values one at a time. This approach is much more memory efficient than range. xrange was renamed to range in python 3.

Looping over a collection

colors = ['red', 'green', 'blue', 'yellow']

for i in range(len(colors)):
    print colors[i]

Better

for color in colors:
    print color

Looping backwards

colors = ['red', 'green', 'blue', 'yellow']

for i in range(len(colors)-1, -1, -1):
    print colors[i]

Better

for color in reversed(colors):
    print color

Looping over a collection and indices

colors = ['red', 'green', 'blue', 'yellow']

for i in range(len(colors)):
    print i, '--->', colors[i]

Better

for i, color in enumerate(colors):
    print i, '--->', color

It's fast and beautiful and saves you from tracking the individual indices and incrementing them.
Whenever you find yourself manipulating indices [in a collection], you're probably doing it wrong.

Looping over two collections

names = ['raymond', 'rachel', 'matthew']
colors = ['red', 'green', 'blue', 'yellow']

n = min(len(names), len(colors))
for i in range(n):
    print names[i], '--->', colors[i]

for name, color in zip(names, colors):
    print name, '--->', color

Better

for name, color in izip(names, colors):
    print name, '--->', color

zip creates a new list in memory and takes more memory. izip is more efficient than zip.
Note: in python 3 izip was renamed to zip and promoted to a builtin replacing the old zip.

Looping in sorted order

colors = ['red', 'green', 'blue', 'yellow']

# Forward sorted order
for color in sorted(colors):
    print colors

# Backwards sorted order
for color in sorted(colors, reverse=True):
    print colors

Custom Sort Order

colors = ['red', 'green', 'blue', 'yellow']

def compare_length(c1, c2):
    if len(c1) < len(c2): return -1
    if len(c1) > len(c2): return 1
    return 0

print sorted(colors, cmp=compare_length)

Better

print sorted(colors, key=len)

The original is slow and unpleasant to write. Also, comparison functions are no longer available in python 3.

Call a function until a sentinel value

blocks = []
while True:
    block = f.read(32)
    if block == '':
        break
    blocks.append(block)

Better

blocks = []
for block in iter(partial(f.read, 32), ''):
    blocks.append(block)

iter takes two arguments. The first you call over and over again and the second is a sentinel value.

Distinguishing multiple exit points in loops

def find(seq, target):
    found = False
    for i, value in enumerate(seq):
        if value == target:
            found = True
            break
    if not found:
        return -1
    return i

Better

def find(seq, target):
    for i, value in enumerate(seq):
        if value == target:
            break
    else:
        return -1
    return i

Inside of every for loop is an else.

Looping over dictionary keys

d = {'matthew': 'blue', 'rachel': 'green', 'raymond': 'red'}

for k in d:
    print k

for k in d.keys():
    if k.startswith('r'):
        del d[k]

When should you use the second and not the first? When you're mutating the dictionary.

If you mutate something while you're iterating over it, you're living in a state of sin and deserve what ever happens to you.

d.keys() makes a copy of all the keys and stores them in a list. Then you can modify the dictionary.
Note: in python 3 to iterate through a dictionary you have to explicidly write: list(d.keys()) because d.keys() returns a "dictionary view" (an iterable that provide a dynamic view on the dictionary’s keys). See documentation.

Looping over dictionary keys and values

# Not very fast, has to re-hash every key and do a lookup
for k in d:
    print k, '--->', d[k]

# Makes a big huge list
for k, v in d.items():
    print k, '--->', v

Better

for k, v in d.iteritems():
    print k, '--->', v

iteritems() is better as it returns an iterator.
Note: in python 3 there is no iteritems() and items() behaviour is close to what iteritems() had. See documentation.

Construct a dictionary from pairs

names = ['raymond', 'rachel', 'matthew']
colors = ['red', 'green', 'blue']

d = dict(izip(names, colors))
# {'matthew': 'blue', 'rachel': 'green', 'raymond': 'red'}

For python 3: d = dict(zip(names, colors))

Counting with dictionaries

colors = ['red', 'green', 'red', 'blue', 'green', 'red']

# Simple, basic way to count. A good start for beginners.
d = {}
for color in colors:
    if color not in d:
        d[color] = 0
    d[color] += 1

# {'blue': 1, 'green': 2, 'red': 3}

Better

d = {}
for color in colors:
    d[color] = d.get(color, 0) + 1

# Slightly more modern but has several caveats, better for advanced users
# who understand the intricacies
d = defaultdict(int)
for color in colors:
d[color] += 1

Grouping with dictionaries -- Part I and II

names = ['raymond', 'rachel', 'matthew', 'roger',
         'betty', 'melissa', 'judith', 'charlie']

# In this example, we're grouping by name length
d = {}
for name in names:
    key = len(name)
    if key not in d:
        d[key] = []
    d[key].append(name)

# {5: ['roger', 'betty'], 6: ['rachel', 'judith'], 7: ['raymond', 'matthew', 'melissa', 'charlie']}

d = {}
for name in names:
    key = len(name)
    d.setdefault(key, []).append(name)

Better

d = defaultdict(list)
for name in names:
    key = len(name)
    d[key].append(name)

Is a dictionary popitem() atomic?

d = {'matthew': 'blue', 'rachel': 'green', 'raymond': 'red'}

while d:
    key, value = d.popitem()
    print key, '-->', value

popitem is atomic so you don't have to put locks around it to use it in threads.

Linking dictionaries

defaults = {'color': 'red', 'user': 'guest'}
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--user')
parser.add_argument('-c', '--color')
namespace = parser.parse_args([])
command_line_args = {k:v for k, v in vars(namespace).items() if v}

# The common approach below allows you to use defaults at first, then override them
# with environment variables and then finally override them with command line arguments.
# It copies data like crazy, unfortunately.
d = defaults.copy()
d.update(os.environ)
d.update(command_line_args)

Better

d = ChainMap(command_line_args, os.environ, defaults)

ChainMap has been introduced into python 3. Fast and beautiful.

Improving Clarity

Positional arguments and indicies are nice
Keywords and names are better
The first way is convenient for the computer
The second corresponds to how human’s think

Clarify function calls with keyword arguments

twitter_search('@obama', False, 20, True)

Better

twitter_search('@obama', retweets=False, numtweets=20, popular=True)

Is slightly (microseconds) slower but is worth it for the code clarity and developer time savings.

Clarify multiple return values with named tuples

# Old testmod return value
doctest.testmod()
# (0, 4)
# Is this good or bad? You don't know because it's not clear.

Better

# New testmod return value, a namedTuple
doctest.testmod()
# TestResults(failed=0, attempted=4)

A namedTuple is a subclass of tuple so they still work like a regular tuple, but are more friendly.

To make a namedTuple:

TestResults = namedTuple('TestResults', ['failed', 'attempted'])

Unpacking sequences

p = 'Raymond', 'Hettinger', 0x30, 'python@example.com'

# A common approach / habit from other languages
fname = p[0]
lname = p[1]
age = p[2]
email = p[3]

Better

fname, lname, age, email = p

The second approach uses tuple unpacking and is faster and more readable.

Updating multiple state variables

def fibonacci(n):
    x = 0
    y = 1
    for i in range(n):
        print x
        t = y
        y = x + y
        x = t

Better

def fibonacci(n):
    x, y = 0, 1
    for i in range(n):
        print x
        x, y = y, x + y

Problems with first approach

x and y are state, and state should be updated all at once or in between lines that state is mis-matched and a common source of issues
ordering matters
it's too low level

The second approach is more high-level, doesn't risk getting the order wrong and is fast.

Simultaneous state updates

tmp_x = x + dx * t
tmp_y = y + dy * t
tmp_dx = influence(m, x, y, dx, dy, partial='x')
tmp_dy = influence(m, x, y, dx, dy, partial='y')
x = tmp_x
y = tmp_y
dx = tmp_dx
dy = tmp_dy

Better

x, y, dx, dy = (x + dx * t,
                y + dy * t,
                influence(m, x, y, dx, dy, partial='x'),
                influence(m, x, y, dx, dy, partial='y'))

Efficiency

An optimization fundamental rule
Don’t cause data to move around unnecessarily
It takes only a little care to avoid O(n**2) behavior instead of linear behavior

Basically, just don't move data around unecessarily.

Concatenating strings

names = ['raymond', 'rachel', 'matthew', 'roger',
         'betty', 'melissa', 'judith', 'charlie']

s = names[0]
for name in names[1:]:
    s += ', ' + name
print s

Better

print ', '.join(names)

Updating sequences

names = ['raymond', 'rachel', 'matthew', 'roger',
         'betty', 'melissa', 'judith', 'charlie']

del names[0]
# The below are signs you're using the wrong data structure
names.pop(0)
names.insert(0, 'mark')

Better

names = deque(['raymond', 'rachel', 'matthew', 'roger',
               'betty', 'melissa', 'judith', 'charlie'])

# More efficient with deque
del names[0]
names.popleft()
names.appendleft('mark')

Decorators and Context Managers

Helps separate business logic from administrative logic
Clean, beautiful tools for factoring code and improving code reuse
Good naming is essential.
Remember the Spiderman rule: With great power, comes great responsibility!

Using decorators to factor-out administrative logic

# Mixes business / administrative logic and is not reusable
def web_lookup(url, saved={}):
    if url in saved:
        return saved[url]
    page = urllib.urlopen(url).read()
    saved[url] = page
    return page

Better

@cache
def web_lookup(url):
    return urllib.urlopen(url).read()

Note: since python 3.2 there is a decorator for this in the standard library: functools.lru_cache.

Factor-out temporary contexts

# Saving the old, restoring the new
old_context = getcontext().copy()
getcontext().prec = 50
print Decimal(355) / Decimal(113)
setcontext(old_context)

Better

with localcontext(Context(prec=50)):
    print Decimal(355) / Decimal(113)

How to open and close files

f = open('data.txt')
try:
    data = f.read()
finally:
    f.close()

Better

with open('data.txt') as f:
    data = f.read()

How to use locks

# Make a lock
lock = threading.Lock()

# Old-way to use a lock
lock.acquire()
try:
    print 'Critical section 1'
    print 'Critical section 2'
finally:
    lock.release()

Better

# New-way to use a lock
with lock:
    print 'Critical section 1'
    print 'Critical section 2'

Factor-out temporary contexts

try:
    os.remove('somefile.tmp')
except OSError:
    pass

Better

with ignored(OSError):
    os.remove('somefile.tmp')

ignored is is new in python 3.4, documentation.
Note: ignored is actually called suppress in the standard library.

To make your own ignored context manager in the meantime:

@contextmanager
def ignored(*exceptions):
    try:
        yield
    except exceptions:
        pass

Stick that in your utils directory and you too can ignore exceptions

Factor-out temporary contexts

# Temporarily redirect standard out to a file and then return it to normal
with open('help.txt', 'w') as f:
    oldstdout = sys.stdout
    sys.stdout = f
    try:
        help(pow)
    finally:
        sys.stdout = oldstdout

Better

with open('help.txt', 'w') as f:
    with redirect_stdout(f):
        help(pow)

redirect_stdout is proposed for python 3.4, bug report.

To roll your own redirect_stdout context manager

@contextmanager
def redirect_stdout(fileobj):
    oldstdout = sys.stdout
    sys.stdout = fileobj
    try:
        yield fieldobj
    finally:
        sys.stdout = oldstdout

Concise Expressive One-Liners

Two conflicting rules:

Don’t put too much on one line
Don’t break atoms of thought into subatomic particles

Raymond’s rule:

One logical line of code equals one sentence in English

List Comprehensions and Generator Expressions

result = []
for i in range(10):
s = i ** 2
    result.append(s)
print sum(result)

Better

print sum(i**2 for i in xrange(10))

First way tells you what to do, second way tells you what you want.

Pillow installation on OS X

发表于 2016-04-03 | 分类于问题解决 | 暂无评论

ValueError: zlib required for Pillow on OS X

$ brew tap homebrew/dupes 
$ brew install zlib 
$ brew link --force zlib

Then install Pillow:

$ pip install Pillow

参考链接：
http://pillow.readthedocs.org/en/3.1.x/installation.html
http://stackoverflow.com/questions/19532125/cant-install-pil-after-mac-os-x-10-9
http://stackoverflow.com/questions/18246211/no-zlib-support-with-pillow-w-python3-3

Python 的变量作用域

发表于 2016-04-01 | 分类于 python知识库 , Python | 暂无评论

几个概念：

python能够改变变量作用域的代码段是def、class、lamda
if/elif/else、try/except/finally、for/while 并不能涉及变量作用域的更改，也就是说他们的代码块中的变量，在外部也是可以访问的
变量搜索路径是：本地变量->全局变量

这意味着当你要用全局变量的时候，务必记得global var，这样才能保证你此刻操作的变量是哪个全局变量，否则相当于你在此刻新产生了一个变量，跟 C/C++ 是有不同的，需要注意。

更多实验可以看这里 Links，通过实验应该可以更清楚一点吧。

django.core.exceptions.ImproperlyConfigured: Requested setting LOGGING_CONFIG... 错误的两种解决办法

发表于 2016-03-28 | 分类于 Python , 问题解决 | 2 条评论

在网上取数据，用 Django 存储的时候遇到的错误提示
django.core.exceptions.ImproperlyConfigured: Requested setting LOGGING_CONFIG, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

根据提示给出的两种办法：

define the environment variable DJANGO_SETTINGS_MODULE

import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "你的project.settings")

call settings.configure()

from django.conf import settings
settings.configure(DEBUG=True)

关于call settings.configure()官方的说法：

In some cases, you might want to bypass the DJANGO_SETTINGS_MODULE
environment variable. For example, if you’re using the template system
by itself, you likely don’t want to have to set up an environment
variable pointing to a settings module.

Exactly what I want to do.