I want to make a tags
column of type json
:
e.g.,
id | tags
=========================================
1 | \'["tag1"
By "extracts a scalar value", does this mean I must extract & index each item in the arrays individually [...]?
You can extract as many items as you want. They will be stored as scalars (e.g. string), rather than as compound values (which JSON is).
CREATE TABLE mytags (
id INT NOT NULL AUTO_INCREMENT,
tags JSON,
PRIMARY KEY (id)
);
INSERT INTO mytags (tags) VALUES
('["tag1", "tag2", "tag3"]'),
('["tag1", "tag3", "tag5", "tag7"]'),
('["tag2", "tag5"]');
SELECT * FROM mytags;
+----+----------------------------------+
| id | tags |
+----+----------------------------------+
| 1 | ["tag1", "tag2", "tag3"] |
| 2 | ["tag1", "tag3", "tag5", "tag7"] |
| 3 | ["tag2", "tag5"] |
+----+----------------------------------+
Let's create an index with one item only (first value from the JSON object):
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]')),
ADD INDEX tags_index (tags_scalar);
SELECT * FROM mytags;
+----+----------------------------------+-------------+
| id | tags | tags_scalar |
+----+----------------------------------+-------------+
| 1 | ["tag1", "tag2", "tag3"] | "tag1" |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | "tag1" |
| 3 | ["tag2", "tag5"] | "tag2" |
+----+----------------------------------+-------------+
Now you have an index on the VARCHAR column tags_scalar
. The value contains quotes, which can also be skipped:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_unquote(json_extract(tags, '$[0]'))),
ADD INDEX tags_index (tags_scalar);
SELECT * FROM mytags;
+----+----------------------------------+-------------+
| id | tags | tags_scalar |
+----+----------------------------------+-------------+
| 1 | ["tag1", "tag2", "tag3"] | tag1 |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | tag1 |
| 3 | ["tag2", "tag5"] | tag2 |
+----+----------------------------------+-------------+
As you can already imagine, the generated column can include more items from the JSON:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]', '$[1]', '$[2]')),
ADD INDEX tags_index (tags_scalar);
SELECT * from mytags;
+----+----------------------------------+--------------------------+
| id | tags | tags_scalar |
+----+----------------------------------+--------------------------+
| 1 | ["tag1", "tag2", "tag3"] | ["tag1", "tag2", "tag3"] |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | ["tag1", "tag3", "tag5"] |
| 3 | ["tag2", "tag5"] | ["tag2", "tag5"] |
+----+----------------------------------+--------------------------+
or use any other valid expression to auto-generate a string out of the JSON structure, in order to obtain something that can be easily indexed and searched like "tag1tag3tag5tag7".
[...](meaning I must know the maximum length of the array to index them all)?
As explained above, you don't need to know - NULL values can be skipped by using any valid expression. But of course it's always better to know.
Now there's the architecture decision: Is JSON data type the most appropriate to achieve the goal? To solve this particular problem? Is JSON the right tool here? Is it going to speed up searching?
How do I index a variable length array?
If you insist, cast string:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (replace(replace(replace(cast(tags as char), '"', ''), '[', ''), ']', '')),
ADD INDEX tags_index (tags_scalar);
SELECT * from mytags;
+----+----------------------------------+------------------------+
| id | tags | tags_scalar |
+----+----------------------------------+------------------------+
| 1 | ["tag1", "tag2", "tag3"] | tag1, tag2, tag3 |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | tag1, tag3, tag5, tag7 |
| 3 | ["tag2", "tag5"] | tag2, tag5 |
+----+----------------------------------+------------------------+
This way or another you end up with a VARCHAR or TEXT column, where you apply the most applicable index structure (some options).
Further reading: