Jupyter Lab freezes the computer when out of RAM - how to prevent it?

后端 未结 7 1301
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-04 08:11

I have recently started using Jupyter Lab and my problem is that I work with quite large datasets (usually the dataset itself is approx. 1/4 of my computer RAM). After few trans

相关标签:
7条回答
  • 2021-02-04 08:29

    There is no reason to view the entire output of a large dataframe. Viewing or manipulating large dataframes will unnecessarily use large amounts of your computer resources.

    Whatever you are doing can be done in miniature. It's far easier working on coding and manipulating data when the data frame is small. The best way to work with big data is to create a new data frame that takes only small portion or a small sample of the large data frame. Then you can explore the data and do your coding on the smaller data frame. Once you have explored the data and get your code working, then just use that code on the larger data frame.

    The easiest way is simply take the first n, number of the first rows from the data frame using the head() function. The head function prints only n, number of rows. You can create a mini data frame by using the head function on the large data frame. Below I chose to select the first 50 rows and pass their value to the small_df. This assumes the BigData is a data file that comes from a library you opened for this project.

    library(namedPackage) 
    
    df <- data.frame(BigData)                #  Assign big data to df
    small_df <- head(df, 50)         #  Assign the first 50 rows to small_df
    

    This will work most of the time, but sometimes the big data frame comes with presorted variables or with variables already grouped. If the big data is like this, then you would need to take a random sample of the rows from the big data. Then use the code that follows:

    df <- data.frame(BigData)
    
    set.seed(1016)                                          # set your own seed
    
    df_small <- df[sample(nrow(df),replace=F,size=.03*nrow(df)),]     # samples 3% rows
    df_small                                                         # much smaller df
    
    0 讨论(0)
  • 2021-02-04 08:33

    I also work with very large datasets (3GB) on Jupyter Lab and have been experiencing the same issue on Labs. It's unclear if you need to maintain access to the pre-transformed data, if not, I've started using del of unused large dataframe variables if I don't need them. del removes variables from your memory. Edit** : there a multiple possibilities for the issue I'm encountering. I encounter this more often when I'm using a remote jupyter instance, and in spyder as well when I'm perfoming large transformations.

    e.g.

    df = pd.read('some_giant_dataframe') # or whatever your import is
    new_df = my_transform(df)
    del df # if unneeded.
    

    Jakes you may also find this thread on large data workflows helpful. I've been looking into Dask to help with memory storage.

    I've noticed in spyder and jupyter that the freezeup will usually happen when working in another console while a large memory console runs. As to why it just freezes up instead of crashing out, I think this has something to do with the kernel. There are a couple memory issues open in the IPython github - #10082 and #10117 seem most relevant. One user here suggest disabling tab completion in jedi or updating jedi.

    In 10117 they propose checking the output of get_ipython().history_manager.db_log_output. I have the same issues and my setting is correct, but it's worth checking

    0 讨论(0)
  • 2021-02-04 08:37

    I think you should use chunks. Like that:

    df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
    chunk_list = []  # append each chunk df here 
    
    # Each chunk is in df format
    for chunk in df_chunk:  
        # perform data filtering 
        chunk_filter = chunk_preprocessing(chunk)
    
        # Once the data filtering is done, append the chunk to list
        chunk_list.append(chunk_filter)
    
    # concat the list into dataframe 
    df_concat = pd.concat(chunk_list)
    

    For more information check it out: https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

    I suggest don't append a list again(probably the RAM will overload again). You should finish your job in that for loop.

    0 讨论(0)
  • 2021-02-04 08:39

    If you are using Ubuntu, check out OOM killers, you can get information from here

    You can use earlyoom. It can be configured as you wish, e.g. earlyoom -s 90 -m 15 will start the earlyoom and when swap size is less than %90 and memory is less than %15, it will kill the process that causes OOM and prevent the whole system to freeze. You can also configure the priority of the processes.

    0 讨论(0)
  • 2021-02-04 08:42

    You can also use notebooks in the cloud also, such as Google Colab here. They have provided facility for recommended RAMs and support for Jupyter notebook is by default.

    0 讨论(0)
  • 2021-02-04 08:48

    I am going to summarize the answers from the following question. You can limit the memory usage of your programm. In the following this will be the function ram_intense_foo(). Before calling that you need to call the function limit_memory(10)

    import resource
    import platform
    import sys
    import numpy as np 
    
    def memory_limit(percent_of_free):
        soft, hard = resource.getrlimit(resource.RLIMIT_AS)
        resource.setrlimit(resource.RLIMIT_AS, (get_memory() * 1024 * percent_of_free / 100, hard))
    
    def get_memory():
        with open('/proc/meminfo', 'r') as mem:
            free_memory = 0
            for i in mem:
                sline = i.split()
                if str(sline[0]) == 'MemAvailable:':
                    free_memory = int(sline[1])
                    break
        return free_memory
    
    def ram_intense_foo(a,b):
        A = np.random.rand(a,b)
        return A.T@A
    
    if __name__ == '__main__':
        memory_limit(95)
        try:
            temp = ram_intense_foo(4000,10000)
            print(temp.shape)
        except MemoryError:
            sys.stderr.write('\n\nERROR: Memory Exception\n')
            sys.exit(1)
    
    0 讨论(0)
提交回复
热议问题