Regex to remove new lines up to a specific character

后端 未结 5 1198
情歌与酒
情歌与酒 2021-01-27 14:41

I have a series of strings in a file of the format:

>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more i         


        
相关标签:
5条回答
  • 2021-01-27 14:56

    you don't have to use regex:

    [ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]    
    

    should work.

    In [43]: f=open('test.txt')
    
    In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]                                                                                   
    
    In [45]: contents
    Out[45]: 
    ['>HEADER_Text1\n',
     'Information here, yada yada yada',
     'Some more information here, yada yada yada',
     'Even some more information here, yada yada yada',
     '>HEADER_Text2\n',
     'Information here, yada yada yada',
     'Some more information here, yada yada yada',
     'Even some more information here, yada yada yada',
     '>HEADER_Text3\n',
     'Information here, yada yada yada',
     'Some more information here, yada yada yada',
     'Even some more information here, yada yada yada']
    
    0 讨论(0)
  • 2021-01-27 14:58

    Given that the > is always expected to be the first character on the new line

    "\n([^>])" with " \1"

    0 讨论(0)
  • 2021-01-27 15:11

    You really don't want a regex. And for this job, python and biopython are superfluous. If that's actually FASTQ format, just use sed:

    sed '/^>/ { N; N; N; s/\n/ /2g }' file
    

    Results:

    >HEADER_Text1
    Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
    >HEADER_Text2
    Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
    >HEADER_Text3
    Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
    
    0 讨论(0)
  • 2021-01-27 15:14

    this should also work.

    sampleText=""">HEADER_Text1 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada

    HEADER_Text2 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada HEADER_Text3 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada""""

    cleartext = re.sub ("\n(?!>)", "", sampleText)

    print cleartext

    HEADER_Text1Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text2Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text3Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada

    0 讨论(0)
  • 2021-01-27 15:18

    As noted in the comments, your best bet is to use an existing FASTA parser. Why not?

    Here's how I would join lines based on the leading greater-than:

    def joinup(f):
        buf = []
        for line in f:
            if line.startswith('>'):
                if buf:
                    yield " ".join(buf)
                yield line.rstrip()
                buf = []
            else:
                buf.append(line.rstrip())
        yield " ".join(buf)
    
    for joined_line in joinup(open("...")):
        # blah blah...
    
    0 讨论(0)
提交回复
热议问题