Is CKAN capable of dealing with 100k+ files and TB of data?

前端 未结 2 763
终归单人心
终归单人心 2021-02-09 11:21

What we are wanting to do is create a local data repository for our lab memebers to organize, search upon, access, catalog, reference our data, etc. I feel that CKAN can do all

相关标签:
2条回答
  • 2021-02-09 11:49

    We're using CKAN at the Natural History Museum (data.nhm.ac.uk) for some pretty hefty research datasets - our main specimen collection has 2.8 million records - and it's handling it very well. We have had to extend CKAN with some custom plugins to make this possible though - but they're open source and available on Github.

    Our datasolr extension moves querying large datasets into SOLR, which handles indexing and searching big datasets better than postgres (on our infrastructure anyway) - https://github.com/NaturalHistoryMuseum/ckanext-datasolr.

    To prevent CKAN falling over when users download big files, we moved the packaging and download to a separate service and task queue.

    https://github.com/NaturalHistoryMuseum/ckanext-ckanpackager https://github.com/NaturalHistoryMuseum/ckanpackager

    So yes, CKAN with a few contributed plugins can definitely handle larger datasets. We haven't tested it with TB+ datasets yet, but we will next year when we use CKAN to release some phylogenetic data.

    0 讨论(0)
  • 2021-02-09 12:00

    Yes :)

    But there are extensions to use or build.

    Take a look at the extensions built for CKAN Galleries (http://datashades.com/ckan-galleries/). We built that specifically for image and video assets that are referenced in the record level of a dataset resource.

    There is an S3 cloud connector for object storage if needed.

    We've started to look at various ways to extend CKAN so it can provide enterprise data storage and management for all types of data. Very large, real time, IoT specific, Linked Data, etc.

    I think in some cases these will be addressed by adding the concept of 'resource containers' to CKAN. In some sense both file store and data store are examples of such resource container extensions.

    Using AWS's API Gateway service we are looking at ways to present the request methods for data stored via external integration with third party solutions as if they were no different to other CKAN resources.

    Although not everyone is there just yet, when you use infrastructure as software, which AWS enables, you can build some really neat stuff which looks like software running on a traditional web stack but is actually making use of S3, Lambda, temporary relational DBs and API Gateway to do some very heavy lifting.

    We aim to open source the approach taken for such work as open architecture as it matures. We've started this already by publishing scripts used to build supercomputer clusters on AWS. You can find those here: https://github.com/DataShades/awscloud-hpc

    0 讨论(0)
提交回复
热议问题