问题
I need to parse WHOIS raw data records into fields. There is no one consistent format for the raw data, and I need to support all the possible formats (there are ~ 40 unique formats that I know of). For examples, here are excerpts from 3 different WHOIS raw data records:
Created on: 2007-01-04
Updated on: 2014-01-29
Expires on: 2015-01-04
Registrant Name: 0,75 DI VALENTINO ROSSI
Contact: 0,75 Di Valentino Rossi
Registrant Address: Via Garibaldi 22
Registrant City: Pradalunga
Registrant Postal Code: 24020
Registrant Country: IT
Administrative Contact Organization: Giorgio Valoti
Administrative Contact Name: Giorgio Valoti
Administrative Contact Address: Via S. Lucia 2
Administrative Contact City: Pradalunga
Administrative Contact Postal Code: 24020
Administrative Contact Country: IT
Administrative Contact Email: giorgio_v@mac.com
Administrative Contact Tel: +39 340 4050596
---------------------------------------------------------------
Registrant :
onse telecom corporation
Gangdong-gu Sangil-dong, Seoul
Administrative Contact :
onse telecom corporation ruhisashi@onsetel.co.kr
Gangdong-gu Sangil-dong, Seoul,
07079976571
Record created on 19-Jul-2004 EDT.
Record expires on 19-Jul-2015 EDT.
Record last updated on 15-Jul-2014 EDT.
---------------------------------------------------------------
Registrant:
Name: markaviva comunica??o Ltda
Organization: markaviva comunica??o Ltda
E-mail: helissonmaia@markaviva.com.br
Address: RUA FERNANDES LIMA 360 sala 03
Address: 57300070
Address: ARAPIRACA - AL
Phone: 55 11 40039011
Country: BRASIL
Created: 20130405
Updated: 20130405
Administrative Contact:
Name: markaviva comunica??o Ltda
Organization: markaviva comunica??o Ltda
E-mail: helissonmaia@markaviva.com.br
Address: RUA FERNANDES LIMA 360 sala 03
Address: 57300070
Address: ARAPIRACA - AL
Phone: 55 11 40039011
Country: BRASIL
Created: 20130405
Updated: 20130405
As you can see, there's no repeating pattern. I need to extract fields such as 'Registrant Name', 'Registrant Address', 'Admin Name', 'Admin City', etc...
I first tried a basic method of field extraction, based on splitting the line on the first colon found, but it only works when the row prefixes are distinct, injective (no 2 rows with the same prefix exists) and, well, separated by a colon... (which is not always the case)
Now, I could go over the formats one by one and try to come up with a regex for each one of them, but that would require a lot of time, which I don't have. I wonder if there's any way to automatically mine and treat blocks of text as a context-based "chunk" (with regards to their spacing and common repeating words such as 'registrant' or 'admin') and analyze them accordingly. NLP Maybe?
I'll be glad to hear any ideas, as I'm kind of stumped here. Thanks
回答1:
Actually, there are ways to do the job without manual analysis of each format, but they may end up being even more complicated and time-consuming, especially if you are not familiar with them. I would try, for example, creating a parser for one format, parse a lot of data, then get same data in each of other 39 formats, and use already obtained knowldege to assign labels for each token (eg "Registant name", "Address"... and "Other"). After that some sequence classifier like CRF can be trained using labeled data, or some other method like automatic generation of regular expressions can be applied.
EDIT: Additional information added by request. There is a task, known in NLP as sequence labeling i.e. assigning one or more classes to tokens. The basic idea is that you have labeled data like "Registrant/Other Name/Other :/Other DI/Name-start VALENTIO/Name-Inside Rossi/Name-end Contact/Other" and you train a classifier to automatically label further data. Once you have data labeled that way, its trivial to extract necessary strings.
Widely used classifiers are Conditional Random Fields (CRF) and (recently) recurrent neural network based classifiers. Mathematics behind them is a bit complex, but there are ready made tools that can be used (to some extent) as black boxes. I outlined example use case in this answer where you can find step-by-step instruction.
Notes of caution:
It is not impossible that some classifiers can generalize to unknown formats given only samples in known formats, but generally you need examples in a number different formats. One can hand-label a couple of training samples, or as I suggested, one way to get such examples is to get known data in a number of formats and assign labels to tokens based on your knowldege. For example, if you know that DI VALENTINO ROSSI is Registrant name in one format, you can find it and label as registrant name in another.
CRF and other classifiers are not guaranteed to make 100% perfect prediction. Accuracy will depend on number training samples and template features and vary from one task to another. One the positive side, good CRF model can be robust against format changes, small errors in format and new formats - situations where hand-written parser usually fails. I think that for your task this should work well, but no one can tell until you actually try and see. I solved a couple of similar problems such as free-form price list parsing, postal address parsing etc with this approach.
It takes some time to get familiar with the whole concept.
回答2:
There will be no way to get around analyzing each format. Also, do not use regex here.
You should continue as you started. But you have to affine:
- if two lines have the same label, put the content in an array (like for 'Address')
- if the line ends with a ':' read until the following line is empty or ends in a ':'.
After parsing you need to standardize the data, that will have different levels of keys and details.
来源:https://stackoverflow.com/questions/28653098/automatic-whois-data-parsing