Parse large RDF in Python

前端未结

关注

 6  1531

I\'d like to parse a very large (about 200MB) RDF file in python. Should I be using sax or some other library? I\'d appreciate some very basic code that I can build on, say to r

相关标签:

6条回答

迷失自我

2021-02-02 16:08
A very fast library to parse RDF files is LightRdf. It could be installed via pip. Code examples can be found on the project page.

If you want to parse triples from a gzipped RDF file, you can do this like that:
```
import lightrdf
import gzip

RDF_FILENAME = 'data.rdf.gz'

f = gzip.open(RDF_FILENAME, 'rb')
doc = lightrdf.RDFDocument(f, parser=lightrdf.xml.PatternParser)
for (s, p, o) in doc.search_triples(None, None, None)):
            print(s, p, o)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2021-02-02 16:11
I second the suggestion that you try out rdflib. It's nice and quick prototyping, and the BerkeleyDB backend store scales pretty well into the millions of triples if you don't want to load the whole graph into memory.
```
import rdflib

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("big.rdf")

# print out all the triples in the graph
for subject, predicate, object in graph:
    print subject, predicate, object
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-02-02 16:12
If you are looking for fast performance then I'd recommend you to use Raptor with the Redland Python Bindings. The performance of Raptor, written in C, is way better than RDFLib. And you can use the python bindings in case you don't want to deal with C.

Another advice for improving performance, forget about parsing RDF/XML, go with other flavor of RDF like Turtle or NTriples. Specially parsing ntriples is much faster than parsing RDF/XML. This is because the ntriples syntax is simpler.

You can transform your RDF/XML into ntriples using rapper, a tool that comes with raptor:
```
rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
```
The ntriples file will contain triples like:
```
<s1> <p> <o> .
<s2> <p2> "literal" .
```
and parsers tend to be very efficient handling this structure. Moreover, memory wise is more efficient than RDF/XML because, as you can see, this data structure is smaller.

The code below is a simple example using the redland python bindings:
```
import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object
```
The base URI is the prefixed URI in case you use relative URIs inside your RDF document. You can check documentation about the Python Redland bindings API in here

If you don't care much about performance then use RDFLib, it is simple and easy to use.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-02 16:16
For RDF processing in Python, consider using an RDF library such as RDFLib. If you also need a triplestore, more heavyweight solutions are available as well, but may not be needed here (PySesame, neo4jrdf with neo4jpy).

Before writing your own SAX parser for RDF, check out rdfxml.py:
```
import rdfxml
data = open('data.rdf', 'r').read()
rdfxml.parseRDF(data)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-02-02 16:28

Not sure if sax is the best solution, but IBM seems to think it works for high-performance XML parsing with Python: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/. Their example RDF dwarfs yours in size (200MB vs. 1.9GB), so their solution should work for you.

This article's examples start pretty basic and pick up quickly.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2021-02-02 16:33

In my experience, SAX is great for performance but it's a pain to write. Unless I am having issues, I tend to avoid programming with it.

"Very large" is dependent on the RAM of the machine. Assuming that your computer has over 1GB memory, lxml, pyxml or some other library e will be fine for 200mb files.

0 讨论(0)
发布评论:

提交评论
- 加载中...