问题
This is my first question on Stack Overflow and so I want to apologise first if my question is not formatted correctly. I am not particularly experienced with coding, but am trying to solve a specific problem with my work.
I am trying to replace the headers of a large fasta file (used for aligning DNA sequences). I have a txt file containing the fasta alignment (alignment.txt), which has contents like this:
>418035201_b1_168_m12_gag__Assembly_8
ATGGGTGCGAGAGCGTCAGTATTAAGTGGGGGAAA......
>418035201_b1_168_m12_gag__Assembly_19
ATGGGTGCGAGAGCGTCAGTATTAAGTGGGGGAAA......
I also have a text file containing the desired names (newheaders.txt), which has contents like this:
>418035201_pM_s38_B168_m12_gag_c08_M13F_X00_consensus
>418035201_pM_s38_B168_m12_gag_c19_M13F_X00_consensus
....
I am trying to replace the headers (lines beginning '>') in the alignment.txt file with the new headers in the newheaders.txt file.
I have a python script with the following contents:
#!/usr/bin/env python
fasta= open('alignment.txt','r')
newnames= open('newheaders.txt','r')
newfasta= open('newfasta.txt', 'w')
for line in fasta:
if line.startswith('>'):
newname= newnames.readline()
newfasta.write(newname)
else:
newfasta.write(line)
print line
fasta.close()
newnames.close()
newfasta.close()
When I run this, I get the following output:
>418035201_pM_s38_B168_m12_gag_c08_M13F_X00_consensus
䄊䝔䝇䝔䝃䝁䝁䝃䍔䝁䅔呔䅁呇䝇䝇䅇䅁呁䅔䅇䝔䅃䝔䝇䅁䅁䅁呔....
>418035201_pM_s38_B168_m12_gag_c19_M13F_X00_consensus
䄊䝔䝇䝔䝃䝁䝁䝃䍔䝁䅔呔䅁呇䝇䝇䅇䅁呁䅔䅇䝔䅃䝔䝇䝁䅁䅁呔....
'line' is being changed from Roman characters to Chinese characters. It should NOT be in Chinese characters, and I can't work out for the life of me why this is happening!
When 'line' is printed to the console, it prints it correctly. I.e.
ATGGGTGCGAGAGCGTCAGTATTAAGTGGGGGAAAATTAGATGCGTGGGAGAA....
So I believe it must be something to do with how it is writing out.
If anybody would be able to help me with this or provide some insight I would greatly appreciate it, thank you.
[Edit: Now resolved. See below. Thanks everyone!]
回答1:
It seems Python supports an "encoding" parameter in the open() function to override the default encoding format. Provided you know what the correct encoding for your input and output files are, you should be able to correct it by adding something like the following (replacing the actual encodings with the correct ones in your case):
newnames= open('newheaders.txt','r', encoding='ascii')
newfasta= open('newfasta.txt', 'w', encoding='utf_8')
PS: Seems like the problem is due to Python 3 using Unicode by default for text file I/O, which is a change from Python 2.x.
回答2:
Thank you for your help, everybody. It's now resolved (Essentially I am an idiot)...
How I fixed it:
- Installed python3
- Re-saved both of the .txt files as Unicode UTF-8 with Unix(LF) line breaks.
- Changed "#!/usr/bin/env python" to "#!/usr/bin/env python3" at the beginning of the script.
- Ran python3 /Users/Sophie/Desktop/AttemptToRename/replacenames.py from the command line.
And it worked!
I'm not sure if all of these steps or only some of them were necessary, but it's now working as planned. Thanks again for all your help. Gonna go through and up-vote now! [Edit: apparently my up-votes don't show because I have a low reputation... :/]
来源:https://stackoverflow.com/questions/42807660/why-is-python-writing-out-in-chinese-characters