Regex to remove new lines up to a specific character

萝らか妹 提交于 2019-12-04 06:15:17

问题


I have a series of strings in a file of the format:

>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada
Some more information here, yada yada yada
Even some more information here, yada yada yada

I am trying to find a regex pattern which will remove the new line characters below the > character in between the next > character. So the final result would look like:

>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada

Does anyone know how I can come up with a regex pattern to do this?

Side note: This format is common in computational science as a FASTA format.

Thanks!


回答1:


As noted in the comments, your best bet is to use an existing FASTA parser. Why not?

Here's how I would join lines based on the leading greater-than:

def joinup(f):
    buf = []
    for line in f:
        if line.startswith('>'):
            if buf:
                yield " ".join(buf)
            yield line.rstrip()
            buf = []
        else:
            buf.append(line.rstrip())
    yield " ".join(buf)

for joined_line in joinup(open("...")):
    # blah blah...



回答2:


you don't have to use regex:

[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]    

should work.

In [43]: f=open('test.txt')

In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]                                                                                   

In [45]: contents
Out[45]: 
['>HEADER_Text1\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada',
 '>HEADER_Text2\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada',
 '>HEADER_Text3\n',
 'Information here, yada yada yada',
 'Some more information here, yada yada yada',
 'Even some more information here, yada yada yada']



回答3:


this should also work.

sampleText=""">HEADER_Text1 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada

HEADER_Text2 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada HEADER_Text3 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada""""

cleartext = re.sub ("\n(?!>)", "", sampleText)

print cleartext

HEADER_Text1Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text2Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text3Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada




回答4:


Given that the > is always expected to be the first character on the new line

"\n([^>])" with " \1"




回答5:


You really don't want a regex. And for this job, python and biopython are superfluous. If that's actually FASTQ format, just use sed:

sed '/^>/ { N; N; N; s/\n/ /2g }' file

Results:

>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada


来源:https://stackoverflow.com/questions/14800970/regex-to-remove-new-lines-up-to-a-specific-character

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!