问题
I want to create a directory structure table as described in this question where:
Directory = "Primary Key" id field, typically an integer
Directory_Parent = "Foreign Key" id field, which points to the id of another Directory in the same table
Value = string containing the directory/folder name
Given Tree/Fruit/Apples/
Directory | Directory_Parent | Value
0 null Root
1 0 Tree
2 1 Fruit
3 2 Apples
A Root folder has been created at Primary Key 0 with a null parent.
My paths are being imported from CSV and currently are in a table with 2 columns:
FileID Path
1 videos/gopro/father/mov001.mp4
2 videos/gopro/father/mov002.mp4
3 pictures/family/father/Oldman.jpg
4 pictures/family/father/Oldman2.jpg
5 documents/legal/father/estate/will.doc
6 documents/legal/father/estate/will2.doc
7 documents/legal/father/estate/newyork/albany/will.doc
8 video/gopro/father/newyork/albany/holiday/christmas/2002/mov001.mp4
9 pictures/family/father/newyork/albany/holiday/christmas/2002/july/Oldman.jpg
10 pictures/family/father/newyork/albany/holiday/christmas/2002/june/Oldman2.jpg
This table contains 1 million file entries.
What's a fast & optimized way to parse this data and move the folder structure into a new table as described above?
In this demo the folders are separated by "/" and moved into a new column, if that helps.
回答1:
SQL lacks the flexibility and tools of a programming language, which would provide a fast and optimized solution to this problem.
Moreover SQLite is the poorest of the databases when it comes to string manipulation because it does not support a function like SQL Server's STRING_SPLIT() or MySql's SUBSTRING_INDEX() which would be very helpful.
Nevertheless the problem is interesting and I gave it a shot.
I create the table dir_struct
with this statement:
CREATE TABLE dir_struct (
Directory INTEGER PRIMARY KEY,
Directory_Parent INTEGER REFERENCES dir_struct(Directory),
Value TEXT
);
and I insert the 'root'
row:
INSERT INTO dir_struct (Directory, Directory_Parent, Value) VALUES (0, null, 'root');
Also, I turn OFF
foreign key enforcement with:
PRAGMA foreign_keys = OFF;
although it is by default turned off, just in case.
First you need a recursive CTE that splits the paths to individual directories (much like the answer on your previous question).
Then in the 2nd CTE, with conditional aggregation, each directory goes in its own column (with the limitation of max 10 directories).
The 3d CTE removes duplicates and the 4th CTE assigns, with ROW_NUMBER()
window function, unique ids to the directories.
Finally with a self join of the results of the 4th CTE, the rows are inserted in the table:
WITH
split AS (
SELECT 0 idx,
FileDataID,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT idx + 1,
FileDataID,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
cols AS (
SELECT DISTINCT
MAX(CASE WHEN idx = 0 THEN item END) path0,
MAX(CASE WHEN idx = 1 THEN item END) path1,
MAX(CASE WHEN idx = 2 THEN item END) path2,
MAX(CASE WHEN idx = 3 THEN item END) path3,
MAX(CASE WHEN idx = 4 THEN item END) path4,
MAX(CASE WHEN idx = 5 THEN item END) path5,
MAX(CASE WHEN idx = 6 THEN item END) path6,
MAX(CASE WHEN idx = 7 THEN item END) path7,
MAX(CASE WHEN idx = 8 THEN item END) path8,
MAX(CASE WHEN idx = 9 THEN item END) path9
FROM split
GROUP BY FileDataID
),
paths AS (
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, path9 FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, path7, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, path6, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, path5, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, path4, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, path3, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, path2, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, path1, null, null, null, null, null, null, null, null FROM cols UNION
SELECT path0, null, null, null, null, null, null, null, null, null FROM cols
),
ids AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY path0, path1, path2, path3, path4, path5, path6, path7, path8, path9) nr,
COALESCE(path9, path8, path7, path6, path5, path4, path3, path2, path1, path0) last_child,
path0 || COALESCE('/' || path1, '') ||
COALESCE('/' || path2, '') ||
COALESCE('/' || path3, '') ||
COALESCE('/' || path4, '') ||
COALESCE('/' || path5, '') ||
COALESCE('/' || path6, '') ||
COALESCE('/' || path7, '') ||
COALESCE('/' || path8, '') ||
COALESCE('/' || path9, '') full_path
FROM paths
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT i1.nr, COALESCE(i2.nr, 0), i1.last_child
FROM ids i1 LEFT JOIN ids i2
ON i1.full_path = i2.full_path || '/' || i1.last_child
In my test dataset which consists of 187365 rows, the rows were inserted in (average) 9.5-10 minutes, which will be much longer for your larger dataset.
See the demo.
What is more interesting is that with simpler code, the performance is worse (but you can also test it):
WITH
split AS (
SELECT Path,
0 parent_len,
SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
FROM listfile
UNION ALL
SELECT Path,
parent_len + LENGTH(item) + 1,
SUBSTR(value, 1, INSTR(value, '/') - 1),
SUBSTR(value, INSTR(value, '/') + 1)
FROM split
WHERE value LIKE '%_/_%'
),
row_numbers AS (
SELECT parent_path, item,
ROW_NUMBER() OVER (ORDER BY parent_path, item) rn
FROM (SELECT DISTINCT SUBSTR(Path, 1, parent_len) parent_path, item FROM split)
)
INSERT INTO dir_struct(Directory, Directory_Parent, Value)
SELECT r1.rn, COALESCE(r2.rn, 0) rn_parent, r1.item
FROM row_numbers r1 LEFT JOIN row_numbers r2
ON r1.parent_path = r2.parent_path || r2.item || '/'
The ids assigned to the directories by this query are different than the ones assigned by the first solution, but they are correct and unique.
This runs in (average) 14-15 minutes.
See the demo.
The conclusion is that if this is a one time thing, maybe you can use it, but I would not recommend it as a solution to this requirement.
来源:https://stackoverflow.com/questions/65769179/sqlite-create-directory-structure-table-from-a-list-of-paths