Save upto 90% of memory with simple tricks in Pandas

A few days back while working on sentiment analysis for a huge data set, I was facing memory issues multiple times. I was cleaning memory, nullifying objects and not loading all the columns at once; still memory issues. I knew that its pandas so I researched for some solution.
Pandas as many of you already know was started as a data wrangling and analysis tool but now due to its capabilities and easy to implement nature it has become the production-level API.
While working with small-sized data, there are no issues. However, when working with large datasets, memory becomes a concern. In this post, I will cover few easy but important techniques that can help use memory efficiently and it will reduce memory consumption up to 90%.
1. Load Data in chunks
When I first read about this solution, I didn’t realized how useful it can be. pandas.read_csv has a parameter called chunksize and it can be used to load data in chunks.
for reader in pd.read_csv(‘train.csv’, chunksize=500000):
print(reader.count())
The parameter chunksize is actually the number of rows to be read at any single time in order to fit into the local memory and returns TextFileReader object for iteration.
Also if you want to come out of Pandas zone while working with large data like aggregating, much better is to use dask, because it provides advanced parallelism.
2. Load only useful columns
With lots of unstructured and open source data availability, its very common that the data set will have many columns and there are very few that you actually use. So there is no point loading them all!!
df = pd.read_csv(‘train.csv’)
df = df[[‘col1’, ‘col2’, ‘col3’, ‘col4’, ‘col5’]]
Always load what you need.
3. Change numeric column dtypes
This is also one of the most important technique to reduce the memory used by numeric columns. If you are not using numeric columns up to their capacity, this technique can be used to save a lot of memory.
Every pandas data type has its range which defines what value you can store in that. For example int datatype has these ranges-
int8 from -128 to 127
int16 from-32768 to 32767
int32 from -2147483648 to 2147483647
int64 from -9223372036854775808 to 9223372036854775807.
The larger the range, the more memory is used. By default for all numeric values pandas assigns int64 range. But if the values in the numeric column are less than int64 range, then you can use lesser capacity dtypes like this.
df = pd.read_csv(“train.csv”, dtype={‘Col1’: ‘int8’, ‘Col2’: ‘int16’})
Try this. It can change the pandas memory game forever.
4. Change categorial column dtypes
Same as numeric column dtype, non-numeric column dtypes can also be changed or defined explicitly. For example, when we know that a particular column will only consists value from a fixed set of values. We can define the values while creating the data frame only.
df = pd.DataFrame({‘A’: list(‘abca’), ‘B’: list(‘bccd’)}, dtype=”category”)
String dtypes are Fun!!
5. Use Sparse data structures
Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
The sparse objects exist for memory efficiency reasons, in cases when you have a large, mostly NA data frame. The density (% of values that have not been “compressed”) is extremely low with sparse objects. Sparse object takes up much less memory on disk (pickled) and in the Python interpreter.
sdf = df.astype(pd.SparseDtype(“float”, np.nan))
Sparse has more benefits!!
6. Nullify/Delete unused objects
When you do data cleaning/pre-processing many time we create many temporary data frames and objects, deleting them afterwards also helps in taking up less memory.
It’s not a big task but very useful especially while working on text data, as during text data cleanup we tend to create lot of objects for sequencing / padding. Try deleting them afterwards and you will notice a huge difference in memory.
del X_train_sequence
del X_test_sequence
gc.collect()
Delete unused!!
Final Thoughts
So, these are some of the tricks you can apply and use pandas without memory issues. Using these tricks will not only help you keep a check on memory but also help you in faster computation as well.
If you want to see a practical implementation of few of these techniques, please check my Kaggle kernel here; where using few of the above techniques I was able to reduce the memory consumption up to 90%.
I hope you learned something new and it will help you in someway.
As always, if you have any questions/comments feel free to reach out and I will be happy to help!!