I have a series of strings in a file of the format:
>HEADER_Text1
Information here, yada yada yada
Some more information here, yada yada yada
Even some more i
you don't have to use regex:
[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
should work.
In [43]: f=open('test.txt')
In [44]: contents=[ x.startswith('>') and x or x.replace('\n','') for x in f.readlines()]
In [45]: contents
Out[45]:
['>HEADER_Text1\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text2\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada',
'>HEADER_Text3\n',
'Information here, yada yada yada',
'Some more information here, yada yada yada',
'Even some more information here, yada yada yada']
Given that the > is always expected to be the first character on the new line
"\n([^>])" with " \1"
You really don't want a regex. And for this job, python and biopython are superfluous. If that's actually FASTQ format, just use sed
:
sed '/^>/ { N; N; N; s/\n/ /2g }' file
Results:
>HEADER_Text1
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text2
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
>HEADER_Text3
Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
this should also work.
sampleText=""">HEADER_Text1 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada
HEADER_Text2 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada HEADER_Text3 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada""""
cleartext = re.sub ("\n(?!>)", "", sampleText)
print cleartext
HEADER_Text1Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text2Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada HEADER_Text3Information here, yada yada yadaSome more information here, yada yada yadaEven some more information here, yada yada yada
As noted in the comments, your best bet is to use an existing FASTA parser. Why not?
Here's how I would join lines based on the leading greater-than:
def joinup(f):
buf = []
for line in f:
if line.startswith('>'):
if buf:
yield " ".join(buf)
yield line.rstrip()
buf = []
else:
buf.append(line.rstrip())
yield " ".join(buf)
for joined_line in joinup(open("...")):
# blah blah...