问题
I want to parse a resume to get different titles and content, which includes bullets, paragraphs, urls. I have the resume in .doc/.docx format. Research so far has resulted in
1.building an xml file from the .doc file and then
2. build an xml parser using JDOM.
Is there any other approach or a better way to do this? some algorithm that would help identify structures in resume?
回答1:
look like you are in right direction. Simple approach is : Once you identify information and moved further, you just need to transverse based on +/- steps with calculated spaces, and identify results.
I am sure you are using NLP methodology which can help you to get data with proximity and then you can remove noise based on your experience.
or simple go and get some already build up. I recomend you RChilli CV Parsing or others like hireability or sovren and discuss your need. I am sure you get some information
thanks -K
回答2:
Interesting -- I worked in a solution where we used Solr to identify my identities.
Another approach is - you can use Apache Solr / index document into that, and fetch faceted search .
Only challenge is how to build library. This will be much shorter and simpler than Apache POI
Let me know if you need some help ?
来源:https://stackoverflow.com/questions/21994957/resume-parser-in-java