I want to cluster pdf documents based on their structure, not only the text content.
The main problem with the text only approach is, that it will loose the information i