Python Big Data Tricks

I’ve done a fruitful book camp recently based on the Python Tricks book by Dan Bader. There were 81 tricks that were new to me or which I found highly remarkable. It would not be very practical to list all of them down here. And it would probably not even comply with the publisher’s copyright. Luckily, I had my data-driven glasses with me during the five-day book camp.

Dan was so good to mention from time to time which Python Tricks have an impact on memory, speed and performance when data is processed on a large scale. This is how the Python Big Data Tricks compilation was born.

Copies

There are several ways to create a copy in python

a = ['foo', foo]

b = a.copy()
c = a[:]
d = list(a)
e = copy.copy(a)
f = copy.deepcopy(a)

Creating deep copies is slower and requires more space. In this benchmark it is 270 times slower than the slice approach:

How to clone or copy a list?

Namedtuples

namedtuples are great for creating immutable classes in python and they are more space-efficient than regular classes.

from collections import namedtuple
>>> Goodie = namedtuple('Goodie', [
... 'url', 
... 'followers', 
... ])

>>> goodie = Goodie('datagoodie.com', 5765776523764)
>>> goodie.followers
5765776523764

A beautiful benchmark on space efficiency:

python: class vs tuple huge memory overhead (?)

Generators

generators work like list notations but are streams of data
they allow for maintainable pipelines of data processing
use generators for memory efficiency because generators produce values on the go, e.g.
```
>>> # use a generator to go from 
... sum(x * 2 for x in range(3))
6
```

Iterator Chains

create data pipelines with iterator chains (dbader.org)

Arrays

There is a huge variety of arrays in Python

go for a generic array structure like a list when you begin your project and then change to a more efficient data structure as the data load get becomes critical
use NumPy/Pandas for a great choice of fast array implementations for scientific calculations and data analysis
use array.array for more space efficiency (strictly typed)
tuples require less space than lists
bytes are immutable, bytearrays are mutable. The conversion from bytearrays to bytes is super slow
you can turn regular primitives in binary blobs with struct.Struct. Doing that you can keep more data in memory or send it in a package over a network.

Queues

you can use lists as stacks using append() and pop() to add and remove the latest element at the end of the list
collections.deque great for push/pop at the end AND at the beginning (both O(N)), but performs poorly at random access O(n) complexity
in distributed environments queues can be used to either define elements as synchronously or asynchronously mutable
for priority queues use queue.PriorityQueue. Or use heapq in distributed environments

Measure Performance

deconstruct your functions and data-structures with Python’s Disassembler

Please let me know if you have other great tricks and code examples to make Big Data development with Python more efficient.

Twitter Facebook Google+ LinkedIn

Richard Rich Steinmetz

Python Big Data Tricks

Copies

Namedtuples

Generators

Iterator Chains

Arrays

Queues

Measure Performance

Comments

You May Also Enjoy

Software Developer Industry Niches 1 - Digital Marketing

Data Trends and Knowledge Nuggets from datanatives Conference 2018

Best Way to Tokenize German texts

Digital Time Off