Jupyter Lab freezes the computer when out of RAM - how to prevent it?

后端未结

关注

 7  1341

I have recently started using Jupyter Lab and my problem is that I work with quite large datasets (usually the dataset itself is approx. 1/4 of my computer RAM). After few trans

相关标签:

7条回答

旧巷少年郎

2021-02-04 08:29
There is no reason to view the entire output of a large dataframe. Viewing or manipulating large dataframes will unnecessarily use large amounts of your computer resources.

Whatever you are doing can be done in miniature. It's far easier working on coding and manipulating data when the data frame is small. The best way to work with big data is to create a new data frame that takes only small portion or a small sample of the large data frame. Then you can explore the data and do your coding on the smaller data frame. Once you have explored the data and get your code working, then just use that code on the larger data frame.

The easiest way is simply take the first n, number of the first rows from the data frame using the head() function. The head function prints only n, number of rows. You can create a mini data frame by using the head function on the large data frame. Below I chose to select the first 50 rows and pass their value to the small_df. This assumes the BigData is a data file that comes from a library you opened for this project.
```
library(namedPackage) 

df <- data.frame(BigData)                #  Assign big data to df
small_df <- head(df, 50)         #  Assign the first 50 rows to small_df
```
This will work most of the time, but sometimes the big data frame comes with presorted variables or with variables already grouped. If the big data is like this, then you would need to take a random sample of the rows from the big data. Then use the code that follows:
```
df <- data.frame(BigData)

set.seed(1016)                                          # set your own seed

df_small <- df[sample(nrow(df),replace=F,size=.03*nrow(df)),]     # samples 3% rows
df_small                                                         # much smaller df
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2021-02-04 08:33
I also work with very large datasets (3GB) on Jupyter Lab and have been experiencing the same issue on Labs. It's unclear if you need to maintain access to the pre-transformed data, if not, I've started using del of unused large dataframe variables if I don't need them. del removes variables from your memory. Edit** : there a multiple possibilities for the issue I'm encountering. I encounter this more often when I'm using a remote jupyter instance, and in spyder as well when I'm perfoming large transformations.

e.g.
```
df = pd.read('some_giant_dataframe') # or whatever your import is
new_df = my_transform(df)
del df # if unneeded.
```
Jakes you may also find this thread on large data workflows helpful. I've been looking into Dask to help with memory storage.

I've noticed in spyder and jupyter that the freezeup will usually happen when working in another console while a large memory console runs. As to why it just freezes up instead of crashing out, I think this has something to do with the kernel. There are a couple memory issues open in the IPython github - #10082 and #10117 seem most relevant. One user here suggest disabling tab completion in jedi or updating jedi.

In 10117 they propose checking the output of get_ipython().history_manager.db_log_output. I have the same issues and my setting is correct, but it's worth checking
0 讨论(0)
发布评论:

提交评论
- 加载中...

情深已故

2021-02-04 08:37

I think you should use chunks. Like that:

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunk_list = []  # append each chunk df here 

# Each chunk is in df format
for chunk in df_chunk:  
    # perform data filtering 
    chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)

# concat the list into dataframe 
df_concat = pd.concat(chunk_list)

For more information check it out: https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

I suggest don't append a list again(probably the RAM will overload again). You should finish your job in that for loop.

0 讨论(0)

悲&欢浪女

2021-02-04 08:39

If you are using Ubuntu, check out OOM killers, you can get information from here

You can use earlyoom. It can be configured as you wish, e.g. earlyoom -s 90 -m 15 will start the earlyoom and when swap size is less than %90 and memory is less than %15, it will kill the process that causes OOM and prevent the whole system to freeze. You can also configure the priority of the processes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-02-04 08:42

You can also use notebooks in the cloud also, such as Google Colab here. They have provided facility for recommended RAMs and support for Jupyter notebook is by default.

0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2021-02-04 08:48

I am going to summarize the answers from the following question. You can limit the memory usage of your programm. In the following this will be the function ram_intense_foo(). Before calling that you need to call the function limit_memory(10)

import resource
import platform
import sys
import numpy as np 

def memory_limit(percent_of_free):
    soft, hard = resource.getrlimit(resource.RLIMIT_AS)
    resource.setrlimit(resource.RLIMIT_AS, (get_memory() * 1024 * percent_of_free / 100, hard))

def get_memory():
    with open('/proc/meminfo', 'r') as mem:
        free_memory = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemAvailable:':
                free_memory = int(sline[1])
                break
    return free_memory

def ram_intense_foo(a,b):
    A = np.random.rand(a,b)
    return A.T@A

if __name__ == '__main__':
    memory_limit(95)
    try:
        temp = ram_intense_foo(4000,10000)
        print(temp.shape)
    except MemoryError:
        sys.stderr.write('\n\nERROR: Memory Exception\n')
        sys.exit(1)

0 讨论(0)

1 2 下一页