I have recently started using Jupyter Lab and my problem is that I work with quite large datasets (usually the dataset itself is approx. 1/4 of my computer RAM). After few trans
There is no reason to view the entire output of a large dataframe. Viewing or manipulating large dataframes will unnecessarily use large amounts of your computer resources.
Whatever you are doing can be done in miniature. It's far easier working on coding and manipulating data when the data frame is small. The best way to work with big data is to create a new data frame that takes only small portion or a small sample of the large data frame. Then you can explore the data and do your coding on the smaller data frame. Once you have explored the data and get your code working, then just use that code on the larger data frame.
The easiest way is simply take the first n, number of the first rows from the data frame using the head() function. The head function prints only n, number of rows. You can create a mini data frame by using the head function on the large data frame. Below I chose to select the first 50 rows and pass their value to the small_df. This assumes the BigData is a data file that comes from a library you opened for this project.
library(namedPackage)
df <- data.frame(BigData) # Assign big data to df
small_df <- head(df, 50) # Assign the first 50 rows to small_df
This will work most of the time, but sometimes the big data frame comes with presorted variables or with variables already grouped. If the big data is like this, then you would need to take a random sample of the rows from the big data. Then use the code that follows:
df <- data.frame(BigData)
set.seed(1016) # set your own seed
df_small <- df[sample(nrow(df),replace=F,size=.03*nrow(df)),] # samples 3% rows
df_small # much smaller df
I also work with very large datasets (3GB) on Jupyter Lab and have been experiencing the same issue on Labs.
It's unclear if you need to maintain access to the pre-transformed data, if not, I've started using del
of unused large dataframe variables if I don't need them. del
removes variables from your memory. Edit** : there a multiple possibilities for the issue I'm encountering. I encounter this more often when I'm using a remote jupyter instance, and in spyder as well when I'm perfoming large transformations.
e.g.
df = pd.read('some_giant_dataframe') # or whatever your import is
new_df = my_transform(df)
del df # if unneeded.
Jakes you may also find this thread on large data workflows helpful. I've been looking into Dask to help with memory storage.
I've noticed in spyder and jupyter that the freezeup will usually happen when working in another console while a large memory console runs. As to why it just freezes up instead of crashing out, I think this has something to do with the kernel. There are a couple memory issues open in the IPython github - #10082 and #10117 seem most relevant. One user here suggest disabling tab completion in jedi
or updating jedi.
In 10117 they propose checking the output of get_ipython().history_manager.db_log_output
. I have the same issues and my setting is correct, but it's worth checking
I think you should use chunks. Like that:
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunk_list = [] # append each chunk df here
# Each chunk is in df format
for chunk in df_chunk:
# perform data filtering
chunk_filter = chunk_preprocessing(chunk)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk_filter)
# concat the list into dataframe
df_concat = pd.concat(chunk_list)
For more information check it out: https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
I suggest don't append a list again(probably the RAM will overload again). You should finish your job in that for loop.
If you are using Ubuntu, check out OOM killers, you can get information from here
You can use earlyoom. It can be configured as you wish, e.g. earlyoom -s 90 -m 15
will start the earlyoom
and when swap size is less than %90 and memory is less than %15, it will kill the process that causes OOM and prevent the whole system to freeze. You can also configure the priority of the processes.
You can also use notebooks in the cloud also, such as Google Colab here. They have provided facility for recommended RAMs and support for Jupyter notebook is by default.
I am going to summarize the answers from the following question.
You can limit the memory usage of your programm. In the following this will be the function ram_intense_foo()
. Before calling that you need to call the function limit_memory(10)
import resource
import platform
import sys
import numpy as np
def memory_limit(percent_of_free):
soft, hard = resource.getrlimit(resource.RLIMIT_AS)
resource.setrlimit(resource.RLIMIT_AS, (get_memory() * 1024 * percent_of_free / 100, hard))
def get_memory():
with open('/proc/meminfo', 'r') as mem:
free_memory = 0
for i in mem:
sline = i.split()
if str(sline[0]) == 'MemAvailable:':
free_memory = int(sline[1])
break
return free_memory
def ram_intense_foo(a,b):
A = np.random.rand(a,b)
return A.T@A
if __name__ == '__main__':
memory_limit(95)
try:
temp = ram_intense_foo(4000,10000)
print(temp.shape)
except MemoryError:
sys.stderr.write('\n\nERROR: Memory Exception\n')
sys.exit(1)