I'm writing a VB.NET application to parse a large XML file which is a Japanese dictionary. I'm completely new to XML parsing and don't really know what I'm doing. The whole dictionary fits between two XML tags <jmdict>
and </jmdict>
. The next level is the <entry>
, which contains all information for the 1 million entries, including the form, pronunciation, meaning of the word and so on.
A typical entry might look like this:
<gloss>fine arts</gloss>
<gloss xml:lang="dut">kunst</gloss>
<gloss xml:lang="dut">schone kunsten</gloss>
<gloss xml:lang="fre">art</gloss>
<gloss xml:lang="fre">beaux-arts</gloss>
<gloss xml:lang="ger">Kunst</gloss>
<gloss xml:lang="ger">die schönen Künste</gloss>
<gloss xml:lang="ger">bildende Kunst</gloss>
<gloss xml:lang="ger">Produktionsdesign</gloss>
<gloss xml:lang="ger">Szenographie</gloss>
<gloss xml:lang="hun">művészet</gloss>
<gloss xml:lang="hun">művészeti</gloss>
<gloss xml:lang="hun">művészi</gloss>
<gloss xml:lang="hun">rajzóra</gloss>
<gloss xml:lang="hun">szépművészet</gloss>
<gloss xml:lang="rus">изящные искусства; искусство</gloss>
<gloss xml:lang="rus">{~{的}} художественный, артистический</gloss>
<gloss xml:lang="slv">umetnost</gloss>
<gloss xml:lang="slv">likovna umetnost</gloss>
<gloss xml:lang="spa">bellas artes</gloss>
I have a class object, Entry
, which is used to store all of the information contained in an entry like the one above. I know what all the tags mean, I don't have an issue with interpreting the data semantically, I'm just not sure what tools I need to actually parse all of this information.
For example, how should I extract the contents of the <ent_seq>
tag at the beginning? And is the method used to extract information from an XML tag the same even it's contained within a parent tag, as in the <keb>
and <ke_pri>
tags which are contained within the <k_ele>
tags? Or should I use a different method?
I know this reads like homework help - I'm not asking for someone to provide the complete solution and build the parser. I just don't know where to start and what tools to use. I'd really appreciate some guidance on what methods I need to start parsing the XML file, and then I'll work on building the solution myself once I know what I'm doing.
So I've come across this code from this website which uses XMLReader to go through one node at a time:
Dim readXML As XmlReader = XmlReader.Create(New StringReader(xmlNode))
While readXML.Read()
Select Case readXML.NodeType
Case XmlNodeType.Element
ListBox1.Items.Add("<" + readXML.Name & ">")
Exit Select
Case XmlNodeType.Text
Exit Select
Case XmlNodeType.EndElement
Exit Select
End Select
End While
But I get the error on the first line
'XmlNode' is a class type and cannot be used as an expression
I'm not exactly sure what to do about this error - any ideas?
You can use these classes to deserialize your xml quickly
Imports System.IO
Imports System.Xml.Serialization
Public Class jmdict
Public Property entries As List(Of entry)
End Class
Public Class entry
Public Property ent_seq As Integer
Public Property k_ele As k_ele
Public Property r_ele As r_ele
Public Property senses As List(Of sense)
End Class
Public Class sense
Public Property posses As List(Of String)
Public Property glosses As List(Of gloss)
End Class
Public Class k_ele
Public Property keb As String
Public Property ke_pris As List(Of String)
End Class
Public Class r_ele
Public Property reb As String
Public Property re_pris As List(Of String)
End Class
Public Class gloss
Public Property lang As String
Public Property Text As String
Public Overrides Function ToString() As String
Return Text
End Function
End Class
The code to deserialize is
Dim serializer As New XmlSerializer(GetType(jmdict))
Dim d As jmdict
Using sr As New StreamReader("filename.xml")
d = CType(serializer.Deserialize(sr), jmdict)
End Using
Now you can iterate over each entry, and the entries' senses, and the senses' glosses
For Each e In d.entries
Console.WriteLine($"seq: {e.ent_seq}")
For Each s In e.senses
For Each g In s.glosses
Console.WriteLine($"Text: {g.Text}, Lang: {g.lang}")
The reasons your code takes so long are
- You are parsing xml as string
- You are inserting lines into a ListBox as you parse them
What do you want to put in the ListBox? If you have deserialized as I show, you can databind a specific list from the data, or a queried result of multiple lists.