difflib

How can I create an artificial key column for merging two datasets using difflab when the column of interest has missing cells?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 07:33:34
问题 Goal : If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine. I was recommended of using difflib to create an artificial key column to merge on. This new column is called 'name'. difflib.get_close_matches looks for similar strings in df2. This works well when all rows in the 'CandidateName' column are present but I get

Python Difflib - How to Get SDiff Sequences with “Change” Op

假装没事ソ 提交于 2019-12-11 02:16:23
问题 I am reading the documentation for Python's difllib. According to the docs each, Differ delta gives a sequence Code Meaning '- ' line unique to sequence 1 '+ ' line unique to sequence 2 ' ' line common to both sequences '? ' line not present in either input sequence But what about the "Change" operation? How do I get a "c " instruction similar to the results in Perl's sdiff? 回答1: Show this script. sdiff.py @ hungrysnake.net http://hungrysnake.net/doc/software__sdiff_py.html Perl's sdiff

How to delete invalid characters between multiple strings in python?

痞子三分冷 提交于 2019-12-10 11:42:07
问题 I'm working in a project with OCR in Spanish . The camera captures different frames in a line of text. The line of text contains this: Este texto, es una prueba del dispositivo lector para no videntes. After some operations I get strings like that: s1 = "Este texto, es una p!" s2 = "fste texto, es una |prueba u.-" s3 = "jo, es una prueba del dispo‘" s4 = "prueba del dispositivo \ec" s5 = "del dispositivo lector par:" s6 = "positivo lector para no xndev" s7 = "lector para no videntes" s8 = "¡r

ignore spaces when comparing strings in python

巧了我就是萌 提交于 2019-12-08 16:08:22
问题 I am using difflib python package. No matter whether I set isjunk argument, the calculated ratios are the same. Isn't the difference of spaces ignored when isjunk is lambda x: x == " " ? In [193]: difflib.SequenceMatcher(isjunk=lambda x: x == " ", a="a b c", b="a bc").ratio() Out[193]: 0.8888888888888888 In [194]: difflib.SequenceMatcher(a="a b c", b="a bc").ratio() Out[194]: 0.8888888888888888 回答1: isjunk works a little differently than you might think. In general, isjunk merely identifies

Python - getting just the difference between strings

落花浮王杯 提交于 2019-12-07 03:06:00
问题 What's the best way of getting just the difference from two multiline strings? a = 'testing this is working \n testing this is working 1 \n' b = 'testing this is working \n testing this is working 1 \n testing this is working 2' diff = difflib.ndiff(a,b) print ''.join(diff) This produces: t e s t i n g t h i s i s w o r k i n g t e s t i n g t h i s i s w o r k i n g 1 + + t+ e+ s+ t+ i+ n+ g+ + t+ h+ i+ s+ + i+ s+ + w+ o+ r+ k+ i+ n+ g+ + 2 What's the best way of getting exactly: testing

Python Difflib Deltas and Compare Ndiff

心已入冬 提交于 2019-12-05 03:04:20
问题 I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes. I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in to my head apparently. I was trying to recreate this in a somewhat simple program shown below, but the thing that I seem to be missing is that the Delta's contain at least as much as the original file, and more. Is it not possible to get to just the

Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list?

血红的双手。 提交于 2019-12-04 13:19:18
I want to use something like difflib.get_close_matches but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list). The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string). For example, instead of: >>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'hallo', 'Hallo'] I would like: >>> difflib.get_close_matches('Hello', words) [0, 1, 6] There doesn't seem to exist a parameter to

How to highlight more than two characters per line in difflibs html output

孤街醉人 提交于 2019-12-03 20:52:07
I am using difflib.HtmlDiff to compare two files. I want the differences to be highlighted in the outputted html. This already works when there are a maximum of two different chars in one line: a = "2.000" b = "2.120" But when there are more different characters on one line then in the output the whole line is marked red (on the left side) or green (on the right side of the table): a = "2.000" b = "2.123" Is this behaviour configurable? So can I set the number of different characters at which the line is marked as deleted / added? EDIT: Example: import difflib diff=difflib.HtmlDiff() print

Python Difflib Deltas and Compare Ndiff

筅森魡賤 提交于 2019-12-03 16:29:48
I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes. I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in to my head apparently. I was trying to recreate this in a somewhat simple program shown below, but the thing that I seem to be missing is that the Delta's contain at least as much as the original file, and more. Is it not possible to get to just the pure changes? The reason I ask is hopefully obvious - to save disk space. I could just save the entire

Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

。_饼干妹妹 提交于 2019-12-01 13:41:26
Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68 : I then parse both sources with BeauifulSoup and diff them with difflib . url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of lines url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html' response2 = urllib2.urlopen(url2) content2 = response2.read() # get response as list of lines import difflib d = difflib.Differ() diffed = d.compare(content, content) soup = bs4