问题
im trying to get the xml presented here http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml but its a bit tricky cause they dont give any suport for it. The purpose is to get the xml to php in order to go trought the xml.
can someone give a hint?
回答1:
It's not really true that XML presented via HTML therein wouldn't be XML as well.
What you're looking for is something called textContent in DOMDocument. That will give you only the text from that HMTL. Like it is displayed "as text" in the browser.
So all you need to do is to load the HTML document into a DOMDocument. Because it contains errors the internal error are used:
$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);
The next part implies specific knowledge about the page being scraped. In your case the XML is the said text-content of all div-tags with class attribute "xml-tag" *followed* after the tag with the id "ResultView".
These tags can be easily fetched with an xpath query, then their text-content is stored into an array:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
$buffer[] = $node->textContent;
}
So everything left now is to create a new DOMDocument
and load that XML buffer into it, doing some nice formattings and the output:
$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');
These roughly 20 lines of code produce the following output then:
<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
<EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<TITLE/>
<STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
</STUDY_REF>
<DESIGN>
<DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
<SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
</SAMPLE_DESCRIPTOR>
<LIBRARY_DESCRIPTOR>
<LIBRARY_NAME>4008297</LIBRARY_NAME>
<LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED NOMINAL_LENGTH="250"/>
</LIBRARY_LAYOUT>
</LIBRARY_DESCRIPTOR>
<SPOT_DESCRIPTOR>
<SPOT_DECODE_SPEC>
<READ_SPEC>
<READ_INDEX>0</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Forward</READ_TYPE>
<BASE_COORD>1</BASE_COORD>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>1</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Reverse</READ_TYPE>
<RELATIVE_ORDER follows_read_index="0"/>
</READ_SPEC>
</SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>
</DESIGN>
<PLATFORM>
<ILLUMINA>
<INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
</ILLUMINA>
</PLATFORM>
<PROCESSING/>
</EXPERIMENT>
<SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
<IDENTIFIERS>
<PRIMARY_ID>ERA119046</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
</IDENTIFIERS>
</SUBMISSION>
<STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
<DESCRIPTOR>
<STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
<STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
<STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
<CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
<STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
</DESCRIPTOR>
</STUDY>
<SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
<SAMPLE_NAME>
<COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
<TAXON_ID>119602</TAXON_ID>
<SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
</SAMPLE_NAME>
<SAMPLE_LINKS>
<SAMPLE_LINK>
<ENTREZ_LINK>
<DB>biosample</DB>
<ID>859730</ID>
</ENTREZ_LINK>
</SAMPLE_LINK>
</SAMPLE_LINKS>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>Strain</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>Sample Description</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-StrainOrLine</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Sex</TAG>
<VALUE>not applicable</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Species</TAG>
<VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<RUN_SET>
<RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
<IDENTIFIERS>
<PRIMARY_ID>ERR109334</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
</EXPERIMENT_REF>
<Pool>
<Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
</Pool>
</RUN>
</RUN_SET>
</EXPERIMENT_PACKAGE>
So don't re-invent the wheel, just learn about the existing tools. It's sometimes more easy than it looks like on first sight.
回答2:
http://php.net/manual/en/class.simplexmlelement.php
It will give you an easy interface to use the xml as an object. You might set some attributes in order to parse cdata values and attributes I suppose. To get the xml from a web server use something like curl or file_get_contents. But curl is recommended.
回答3:
clicking on send data>get takes you to another page. Options to download in different formats. This url: http://trace.ncbi.nlm.nih.gov/Traces/sra/?cmd=dload&run_list=ERR109334&format=fasta appears to provide the data in gzip format. Perhaps you can use a GET
on this source instead of trying to parse the XML out of the HTML?
回答4:
You would have to make a list of all the valid HTMl tags and remove them from the webpage. For example:
$taglist = ['div', 'b', 'input']; // List the HTML tags here.
$xml= (read in the webpage here);
foreach ($taglist as $tag) {
$regex = '<' . $tag . '(?: [a-z]+(?:=.+))*?>';
$xml = preg_replace($regex, '', $xml);
// Repeat for the closing tag
$regex = '</' . $tag . '(?: [a-z]+(?:=.+))*?>';
$xml = preg_replace($regex, '', $xml);
}
After that finishes, $xml will contain the XML as a string, and PHP should be able to handle it.
回答5:
this class XmlRead
can do it. i put curl class for it too
curl:
function HeaderProc($response,$Run="",$String=1/*[Is 1 IF Use for String Mode ]*/){
if($String==1){
$response=explode("\r\n",$response);
}
$PartHeader=0;
$out[$PartHeader]=array();
while(list($key,$val)=each($response)){
$name='';
$value='';
$flag=false;
for($i=0;$i<strlen($val);$i++){
if($val[$i]==":"){
$flag=true;
for($j=$i+1;$j<strlen($val);$j++){
if($val[$i]=="\r" and $val[$i+1]=="\n"){
break;
}
$value.=$val[$j];
}
break;
}
$name.=$val[$i];
}
if($flag){
if($name=='' and $value==''){
$PartHeader++;
}else{
if(isset($out[$PartHeader][$name])){
if(is_array($out[$PartHeader][$name])){
$out[$PartHeader][$name][]=$value;
}else{
$T=$out[$PartHeader][$name];
$out[$PartHeader][$name]=array();
$out[$PartHeader][$name][0]=$T;
$out[$PartHeader][$name][1]=$value;
}
}else{
$out[$PartHeader][$name]=$value;
}
}
}else{
if($name==''){
$PartHeader++;
}else{
if(isset($out[$PartHeader][$name])){
if(is_array($out[$PartHeader][$name])){
$out[$PartHeader][$name][]=$value;
}else{
$T=$out[$PartHeader][$name];
$out[$PartHeader][$name]=array();
$out[$PartHeader][$name][0]=$T;
$out[$PartHeader][$name][1]=$name;
}
}else{
$out[$PartHeader][$name]=$name;
}
}
}
if($Run!=""){
$Run($name,$value);
}
}
return $out;
}
class cURL {
var $headers;
var $user_agent;
var $compression;
var $cookie_file;
var $proxy;
var $Cookie;
function CookieAnalysis($Cookie){//convert str cookie to array cookie
//echo $Cookie;
$this->Cookie=array();
preg_match("~(.*?)=(.*?);~si",' '.$Cookie.'; ',$M);
$this->Cookie[trim($M[1])]=trim($M[2]);
return $this->Cookie;
}
function cURL($cookies=false,$cookie='cookies.txt',$compression='gzip',$proxy='') {
$this->headers[] = 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$this->headers[] = 'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3';
$this->headers[] = 'Accept-Encoding:gzip,deflate,sdch';
$this->headers[] = 'Accept-Language:en-US,en;q=0.8';
$this->headers[] = 'Cache-Control:max-age=0';
$this->headers[] = 'Connection:keep-alive';
$this->user_agent = 'User-Agent:Mozilla/5.0 (SepidarSoft [Organic Search Engine Crawler] Linux Edition) AppleWebKit/536.5 (KHTML, like Gecko) SepidarBrowser/1.0.100.52 Safari/536.5';
$this->compression=$compression;
$this->proxy=$proxy;
$this->cookies=$cookies;
if ($this->cookies == TRUE) $this->cookie($cookie);
}
function cookie($cookie_file) {
if (file_exists($cookie_file)) {
$this->cookie_file=$cookie_file;
} else {
fopen($cookie_file,'w') or $this->error('The cookie file could not be opened. Make sure this directory has the correct permissions');
$this->cookie_file=$cookie_file;
@fclose($this->cookie_file);
}
}
function GET($url) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 1);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process,CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$response = curl_exec($process);
$header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
$result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
foreach($result['Header'] as $HeaderK=>$HeaderP){
if(!is_array($HeaderP['Set-Cookie']))continue;
foreach($HeaderP['Set-Cookie'] as $key=>$val){
$result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
}
}
$result['Body'] = substr( $response, $header_size );
$result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
$result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL);
curl_close($process);
return $result;
}
function POST($url,$data) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 1);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process, CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
curl_setopt($process, CURLOPT_POSTFIELDS, $data);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_POST, 1);
$response = curl_exec($process);
$header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
$result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
foreach($result['Header'] as $HeaderK=>$HeaderP){
if(!is_array($HeaderP['Set-Cookie']))continue;
foreach($HeaderP['Set-Cookie'] as $key=>$val){
$result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
}
}
$result['Body'] = substr( $response, $header_size );
$result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
$result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL);
curl_close($process);
return $result;
}
function error($error) {
echo "<center><div style='width:500px;border: 3px solid #FFEEFF; padding: 3px; background-color: #FFDDFF;font-family: verdana; font-size: 10px'><b>cURL Error</b><br>$error</div></center>";
die;
}
}
XmlRead
class XmlRead{
static function Clean($html){
$html=preg_replace_callback("~<script(.*?)>(.*?)</script>~si",function($m){
//print_r($m);
// $m[2]=preg_replace("/\/\*(.*?)\*\/|[\t\r\n]/s"," ", " ".$m[2]." ");
$m[2]=preg_replace("~//(.*?)\n~si"," ", " ".$m[2]." ");
//echo $m[2];
return "<script ".$m[1].">".$m[2]."</script>";
}, $html);
$search = array(
"/ +/" => " ",
"/<!–\{(.*?)\}–>|<!–(.*?)–>|[\t\r\n]|<!–|–>|\/\/ <!–|\/\/ –>|<!\[CDATA\[|\/\/ \]\]>|\]\]>|\/\/\]\]>|\/\/<!\[CDATA\[/" => "");
//$html = preg_replace(array_keys($search), array_values($search), $html);
$search = array(
"/\/\*(.*?)\*\/|[\t\r\n]/s" => "",
"/ +\{ +|\{ +| +\{/" => "{",
"/ +\} +|\} +| +\}/" => "}",
"/ +: +|: +| +:/" => ":",
"/ +; +|; +| +;/" => ";",
"/ +, +|, +| +,/" => ","
);
$html = preg_replace(array_keys($search), array_values($search), $html);
preg_match_all('!(<(?:code|pre|script).*>[^<]+</(?:code|pre|script)>)!',$html,$pre);
$html = preg_replace('!<(?:code|pre).*>[^<]+</(?:code|pre)>!', '#pre#', $html);
$html = preg_replace('#<!–[^\[].+–>#', '', $html);
$html = preg_replace('/[\r\n\t]+/', ' ', $html);
$html = preg_replace('/>[\s]+</', '><', $html);
$html = preg_replace('/\s+/', ' ', $html);
if (!empty($pre[0])) {
foreach ($pre[0] as $tag) {
$html = preg_replace('!#pre#!', $tag, $html,1);
}
}
return($html);
}
function loadNprepare($content,$encod='') {
$content=self::Clean($content);
//$content=html_entity_decode(html_entity_decode($content));
// $content=htmlspecialchars_decode($content,ENT_HTML5);
$this->DataPage='';
preg_match('~<body(.*?)>(.*?)</body>~si',$content,$M);
$this->DataPage=$M[2];
$HTML=$this->DataPage;
$HTML="<!doctype html><html><head><meta charset=\"utf-8\"><title>Untitled Document</title></head><body>".$HTML."</body></html>";
$dom= new DOMDocument;
$HTML = str_replace("&", "&", $HTML); // disguise &s going IN to loadXML()
// $dom->substituteEntities = true; // collapse &s going OUT to transformToXML()
$dom->recover = TRUE;
@$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML);
// dirty fix
foreach ($dom->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$dom->removeChild($item); // remove hack
$dom->encoding = 'UTF-8'; // insert proper
return $dom;
}
function GetBYClass($Doc,$ClassName){
$finder = new DomXPath($Doc);
return($finder->query("//*[contains(@class, '$ClassName')]"));
}
function extractText($node) {
if($node==NULL)return false;
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';
$text = '';
foreach($node->childNodes as $childNode) {
$text .= $this->extractText($childNode);
}
return $text;
}
}
function DOMRemove(DOMNode $from) {
$from->parentNode->removeChild($from);
}
}
call class and conf for your page
$cc = new cURL(); //
$XmlRead=new XmlRead();
$Data=$cc->get('http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml');
//get page
$doc=$XmlRead->loadNprepare($Data['Body']);//load as html
//remove two part of page related to your page .
$productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'title')->item(0));
$productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'aux')->item(0));
//select xml part
$productspec=$XmlRead->GetBYClass($doc,'rprt');
foreach($productspec as $data)
{
$content=html_entity_decode(html_entity_decode($XmlRead->extractText($data)));//decode as entity html
print_r($content);
}
output:
<EXPERIMENT_PACKAGE><EXPERIMENT alias="SC_EXP_7229_8#56"center_name="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS><TITLE></TITLE><STUDY_REF accession="ERP000913"refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS></STUDY_REF><DESIGN><DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION><SAMPLE_DESCRIPTOR accession="ERS074283"refname="MR223754-sc-2011-11-18T11:31:44Z-1306470"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS></SAMPLE_DESCRIPTOR><LIBRARY_DESCRIPTOR><LIBRARY_NAME>4008297</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT><PAIRED NOMINAL_LENGTH="250"></PAIRED></LIBRARY_LAYOUT></LIBRARY_DESCRIPTOR><SPOT_DESCRIPTOR><SPOT_DECODE_SPEC><READ_SPEC><READ_INDEX>0</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Forward</READ_TYPE><BASE_COORD>1</BASE_COORD></READ_SPEC><READ_SPEC><READ_INDEX>1</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Reverse</READ_TYPE><RELATIVE_ORDER follows_read_index="0"></RELATIVE_ORDER></READ_SPEC></SPOT_DECODE_SPEC></SPOT_DESCRIPTOR></DESIGN><PLATFORM><ILLUMINA><INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL></ILLUMINA></PLATFORM><PROCESSING></PROCESSING></EXPERIMENT><SUBMISSION accession="ERA119046"center_name="SC"submission_date="2012-04-17T09:29:50Z"alias="ERP000913-sc-20120417-2"lab_name=""><IDENTIFIERS><PRIMARY_ID>ERA119046</PRIMARY_ID><SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID></IDENTIFIERS></SUBMISSION><STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"center_name="SC"accession="ERP000913"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS><DESCRIPTOR><STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE><STUDY_TYPE existing_study_type="Whole Genome Sequencing"></STUDY_TYPE><STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT><CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME><STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria),please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION></DESCRIPTOR></STUDY><SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470"center_name="SC"accession="ERS074283"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS><SAMPLE_NAME><COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME><TAXON_ID>119602</TAXON_ID><SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME></SAMPLE_NAME><SAMPLE_LINKS><SAMPLE_LINK><ENTREZ_LINK><DB>biosample</DB><ID>859730</ID></ENTREZ_LINK></SAMPLE_LINK></SAMPLE_LINKS><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE><TAG>Strain</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>Sample Description</TAG><VALUE></VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-StrainOrLine</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Sex</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Species</TAG><VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE></SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES></SAMPLE><RUN_SET><RUN alias="SC_RUN_7229_8#56"center_name="SC"accession="ERR109334"total_spots="2708543"total_bases="406281450"size="334475592"load_done="true"published="2012-04-27 20:11:35"is_public="true"cluster_name="public"static_data_available="1"><IDENTIFIERS><PRIMARY_ID>ERR109334</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID></IDENTIFIERS><EXPERIMENT_REF refname="SC_EXP_7229_8#56"refcenter="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS></EXPERIMENT_REF><Pool><Member member_name=""accession="ERS074283"sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470"spots="2708543"bases="406281450"></Member></Pool></RUN></RUN_SET></EXPERIMENT_PACKAGE>
来源:https://stackoverflow.com/questions/15855188/extract-xml-from-xml-embebed-in-html