Python Big Data Tricks
I’ve done a fruitful book camp recently based on the Python Tricks book by Dan Bader. There were 81 tricks that were new to me or which I found highly remarkable. It would not be very practical to list all of them down here. And it would probably not even comply with the publisher’s copyright. Luckily, I had my data-driven glasses with me during the five-day book camp.
Dan was so good to mention from time to time which Python Tricks have an impact on memory, speed and performance when data is processed on a large scale. This is how the Python Big Data Tricks compilation was born.
Copies
There are several ways to create a copy in python
a = ['foo', foo]
b = a.copy()
c = a[:]
d = list(a)
e = copy.copy(a)
f = copy.deepcopy(a)
Creating deep copies is slower and requires more space. In this benchmark it is 270 times slower than the slice approach:
Namedtuples
namedtuples
are great for creating immutable classes in python and they are more space-efficient than regular classes.
from collections import namedtuple
>>> Goodie = namedtuple('Goodie', [
... 'url',
... 'followers',
... ])
>>> goodie = Goodie('datagoodie.com', 5765776523764)
>>> goodie.followers
5765776523764
A beautiful benchmark on space efficiency:
Generators
- generators work like list notations but are streams of data
- they allow for maintainable pipelines of data processing
- use generators for memory efficiency because generators produce values on the go, e.g.
>>> # use a generator to go from ... sum(x * 2 for x in range(3)) 6
Iterator Chains
- create data pipelines with iterator chains (dbader.org)
Arrays
There is a huge variety of arrays in Python
- go for a generic array structure like a list when you begin your project and then change to a more efficient data structure as the data load get becomes critical
-
use NumPy/Pandas for a great choice of fast array implementations for scientific calculations and data analysis
- use
array.array
for more space efficiency (strictly typed) -
tuples
require less space than lists -
bytes are immutable, bytearrays are mutable. The conversion from bytearrays to bytes is super slow
- you can turn regular primitives in binary blobs with
struct.Struct
. Doing that you can keep more data in memory or send it in a package over a network.
Queues
- you can use lists as stacks using
append()
andpop()
to add and remove the latest element at the end of the list collections.deque
great for push/pop at the end AND at the beginning (bothO(N)
), but performs poorly at random accessO(n)
complexity- in distributed environments queues can be used to either define elements as synchronously or asynchronously mutable
- for priority queues use
queue.PriorityQueue
. Or useheapq
in distributed environments
Measure Performance
- deconstruct your functions and data-structures with Python’s Disassembler
Please let me know if you have other great tricks and code examples to make Big Data development with Python more efficient.
Comments