Recursive call to a duplicated Bash script, making it unable to access the assets

问题

Edit : This post is now addressed in a new, as the problem as to be presented slightly differently. It's here : How can I efficiently run XSLT transformations for a large number of files in parallel?

I'm stuck in my attempts of parallelizing a process, and after some decent time spent on it I'd like to request some help ...

Basically, I have a lots of XML files to transform with a specific XSLT sheet. But the sheet uses a call to an (very slow) API to fetch additional data, and taking the whole batch of XMLs in 1 go will take (very) long.

Therefore I splitted all the files from the original "input" folder into subfolder containing each around 5000 XML files, and I copied the following Bash script inside each subfolder too:

for f in *.xml
do
  java -jar ../../saxon9he.jar -xsl:../../some-xslt-sheet.xsl -s:$f
done

And I call each process, for each folder, from the "root" folder containing altogether the "input" folder, the Saxon library and the XSLT sheet :

find input -type d -exec sh {}/script.sh \;

But I get this error:

Unable to access jarfile ../../saxon9he.jar

I suppose it comes form the fact that I'm operating from the "root" folder, when the scripts being called are lower in the directories. I could solver the problem (if I'm correct) by copying all the assets in each subfolder, but I found the solution making my current approach even clumsier.

Thanks to anyone who might have an idea and make me understand this !

回答1:

Firstly, you really don't want to initialize a new Java VM to run each transformation: this is typically going to take much longer than running the actual transformation. To put this in perspective, for "typical" transformations you will often see Java initialization time 3 seconds, stylesheet compilation time 300ms, transformation time 10ms. So if you can find a way to do it that only initializes Java and compiles the stylesheet once, your total time for 10K documents is going to be 2 minutes rather than 10 hours.

There are various ways to achieve this but they all involve using something other than a shell-script to control the process. The simplest, in my view, is to control it from XSLT itself, by using the collection() function to access all the files in the directory. This has an added bonus, if you're using Saxon-EE, that the files will be processed (parsed) in parallel using all the cores on your machine, which can speed things up by another factor of 4 or so. You just need to add an entry point to the stylesheet something like:

<xsl:template name="main">
  <xsl:for-each select="collection('file:///my/dir?select=*.xml;recurse=yes')!saxon:discard-document(.)">
    <xsl:result-document href="....">
      <xsl:apply-templates/>
    </xsl:result-document>
  </xsl:for-each>
</xsl:template>

The saxon:discard-document call is optional, but because it makes documents eligible for garbage collection, means that you are less likely to run out of memory.

Another approach to writing the control loop is to use a specialized shell such as xmlsh.

回答2:

Try it this way to enter the directory for each argument and call the script therein:

for d in */script.sh
do
  (
    cd "$(basename "$d")"
    sh ./script.sh
  )
done

来源：https://stackoverflow.com/questions/43128510/recursive-call-to-a-duplicated-bash-script-making-it-unable-to-access-the-asset

标签

xml

bash

xslt

saxon