gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': 1, 'hijack': 1, 'disorder': 1, 'square': 1, 'leaders': 1, 'deal': 2, 'people': 3, 'streets': 1, 'demonstrations': 2, 'observed': 1, 'street': 2, 'college': 1, 'organised': 1, 'operation': 1, 'special': 1, 'shown': 1, 'attendance': 1, 'normal': 1, 'unions': 2, 'individuals': 1, 'safety': 2, 'prosecuted': 1, 'ira': 1, 'ground': 1, 'public': 2, 'told': 1, 'body': 1, 'stewards': 2, 'obey': 1, 'business': 1, 'gathered': 1, 'assemble': 1, 'garda': 5, 'sinn': 1, 'broken': 1, 'fachtna': 1, 'management': 2, 'possibility': 1, 'groups': 3, 'put': 1, 'affiliated': 1, 'strong': 2, 'security': 1, 'stage': 1, 'behaviour': 1, 'involved': 1, 'route': 2, 'violence': 1, 'dublin': 3, 'fein': 1, 'ensure': 2, 'stand': 1, 'act': 2, 'contingency': 1, 'troublemakers': 2, 'facilitate': 2, 'road': 1, 'members': 1, 'prepared': 1, 'presence': 1, 'sullivan': 2, 'reassure': 1, 'number': 3, 'community': 1, 'strategic': 1, 'visible': 2, 'addressed': 1, 'notify': 1, 'trained': 1, 'eirigi': 1, 'city': 4, 'gpo': 1, 'from': 3, 'crowd': 1, 'visit': 1, 'wood': 1, 'editor': 1, 'peaceful': 4, 'expected': 2, 'today': 1, 'commissioner': 4, 'quay': 1, 'ictu': 1, 'advance': 1, 'murphy': 2, 'gardai': 6, 'aware': 1, 'closures': 1, 'courts': 1, 'branch': 1, 'deployed': 1, 'made': 1, 'thousands': 1, 'socialist': 1, 'work': 1, 'supt': 2, 'feehan': 1, 'mr': 1, 'briefing': 1, 'visited': 1, 'manner': 1, 'irish': 2, 'metropolitan': 1, 'spotters': 1, 'organisers': 1, 'in': 13, 'dissident': 1, 'evidence': 1, 'tom': 1, 'arrangements': 3, 'experience': 1, 'allowed': 1, 'sought': 1, 'rally': 1, 'connell': 1, 'officers': 3, 'potential': 1, 'holding': 1, 'units': 1, 'place': 2, 'events': 1, 'dignified': 1, 'planned': 1, 'independent': 1, 'added': 2, 'plans': 1, 'congress': 1, 'centre': 3, 'comprehensive': 1, 'measures': 1, 'yesterday': 2, 'alert': 1, 'important': 1, 'moving': 1, 'plan': 2, 'highly': 1, 'law': 2, 'senior': 2, 'fair': 1, 'recent': 1, 'refuse': 1, 'attempt': 1, 'brady': 1, 'liaising': 1, 'conscious': 1, 'light': 1, 'clear': 1, 'headquarters': 1, 'wing': 1, 'chief': 2, 'maintain': 1, 'harcourt': 1, 'order': 2, 'left': 1}}
I have a python script that extracts words from text files and counts the number of times they occur in the file.
I want to add them to an ".ARFF" file to use for weka classification. Above is an example output of my python script. How do I go about inserting them into an ARFF file, keeping each text file separate. Each file is differentiated by {"with their words in here!!"}
There are details on the ARFF file format here and it's very simple to generate. For example, using a cut-down version of your Python dictionary, the following script:
import re
d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html':
{'dail': 1,
'focus': 1,
'actions': 1,
'trade': 2,
'protest': 1,
'identify': 1 }}
for original_filename in d.keys():
m = re.search('^(.*)\.html$',original_filename,)
if not m:
print "Ignoring the file:", original_filename
continue
output_filename = m.group(1)+'.arff'
with open(output_filename,"w") as fp:
fp.write('''@RELATION wordcounts
@ATTRIBUTE word string
@ATTRIBUTE count numeric
@DATA
''')
for word_and_count in d[original_filename].items():
fp.write("%s,%d\n" % word_and_count)
Generates output of the form:
@RELATION wordcounts
@ATTRIBUTE word string
@ATTRIBUTE count numeric
@DATA
dail,1
focus,1
actions,1
trade,2
protest,1
identify,1
... in a file called gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff
. If that's not exactly what you want, I'm sure you can easily alter it. (For example, if the "words" might have spaces or other punctuation in them, you probably want to quote them.)
I know it's pretty easy to generate an arff file on your own, but I still wanted to make it simpler so I wrote a python package
https://github.com/ubershmekel/arff
It's also on pypi so easy_install arff
This project seems to be a bit more up to date. You can install it via
pip:
$ pip install liac-arff
or easy_install:
$ easy_install liac-arff
来源:https://stackoverflow.com/questions/5230699/creating-an-arff-file-from-python-output