I have some PowerPoint documents that I keep version-controlled with git. I want to know what differences are between versions of a file. Text is most important, images and form
I was unable to install python-pptx, as suggested by the accepted answer, so I looked for a node.js solution (that may also work for several other file formats that it can handle).
Install https://github.com/dbashford/textract (npm install --global textract
).
Define how to diff "textract"
in your .git config. For my Windows machine,
[diff "textract"]
binary = true
textconv=textract.cmd
Define in your .gitattributes
that *.pptx
file should use diff "textract"
*.pptx diff=textract
git diff
happily.
Not really. PowerPoint file is essentially an archive (zip) of the folder full of files. Git will treat it as a binary file (cause it is).
Maybe there's a 3rd party extension to do it but I've never heard of it.
I can't speak directly to git as we use Visual Studio + TFS at work. However, a bit of research reveals this should work. What I do on VS is to integrate WinMerge and its plugin which supports a text comparison of MS Office and PDF files. This allows me to do diffs of pptx, docx, pdf, etc. files published to version control.
For git, the way it should work is:
1) Get WinMerge with the xdocdiff plugin: http://freemind.s57.xrea.com/xdocdiffPlugin/en/index.html 2) Integrate WinMerge with git: https://coderwall.com/p/76wmzq/winmerge-as-git-difftool-on-windows
Hopefully this will allow you to see the text-based diffs for your PowerPoint.
I wrote this for use with git on the command-line (requires Python and the python-pptx library):
"""
Setup -- Add these lines to the following files:
--- .gitattributes
*.pptx diff=pptx
--- .gitconfig (or repo\.git\config or your_user_home\.gitconfig) (change the path to point to your local copy of the script)
[diff "pptx"]
binary = true
textconv = python C:/Python27/Scripts/git-pptx-textconv.py
usage:
git diff your_powerpoint.pptx
Thanks to the python-pptx docs and this snippet:
http://python-pptx.readthedocs.org/en/latest/user/quickstart.html#extract-all-text-from-slides-in-presentation
"""
import sys
from pptx import Presentation
if __name__ == '__main__':
if len(sys.argv) != 2:
print "Usage: git-pptx-textconv file.xslx"
path_to_presentation = sys.argv[1]
prs = Presentation(path_to_presentation)
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
par_text = ''
for run in paragraph.runs:
s = run.text
s = s.replace(r"\\", "\\\\")
s = s.replace(r"\n", " ")
s = s.replace(r"\r", " ")
s = s.replace(r"\t", " ")
s = s.rstrip('\r\n')
# Convert left and right-hand quotes from Unicode to ASCII
# found http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database
# go here if more power is needed http://code.activestate.com/recipes/251871/
# or here https://pypi.python.org/pypi/Unidecode/0.04.1
punctuation = { 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22 }
s.translate(punctuation).encode('ascii', 'ignore')
s = s.encode('utf-8')
if s:
par_text += s
print par_text