I want to make a tags
column of type json
:
e.g.,
id | tags
=========================================
1 | \'["tag1"
By "extracts a scalar value", does this mean I must extract & index each item in the arrays individually [...]?
You can extract as many items as you want. They will be stored as scalars (e.g. string), rather than as compound values (which JSON is).
CREATE TABLE mytags (
id INT NOT NULL AUTO_INCREMENT,
tags JSON,
PRIMARY KEY (id)
);
INSERT INTO mytags (tags) VALUES
('["tag1", "tag2", "tag3"]'),
('["tag1", "tag3", "tag5", "tag7"]'),
('["tag2", "tag5"]');
SELECT * FROM mytags;
+----+----------------------------------+
| id | tags |
+----+----------------------------------+
| 1 | ["tag1", "tag2", "tag3"] |
| 2 | ["tag1", "tag3", "tag5", "tag7"] |
| 3 | ["tag2", "tag5"] |
+----+----------------------------------+
Let's create an index with one item only (first value from the JSON object):
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]')),
ADD INDEX tags_index (tags_scalar);
SELECT * FROM mytags;
+----+----------------------------------+-------------+
| id | tags | tags_scalar |
+----+----------------------------------+-------------+
| 1 | ["tag1", "tag2", "tag3"] | "tag1" |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | "tag1" |
| 3 | ["tag2", "tag5"] | "tag2" |
+----+----------------------------------+-------------+
Now you have an index on the VARCHAR column tags_scalar
. The value contains quotes, which can also be skipped:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_unquote(json_extract(tags, '$[0]'))),
ADD INDEX tags_index (tags_scalar);
SELECT * FROM mytags;
+----+----------------------------------+-------------+
| id | tags | tags_scalar |
+----+----------------------------------+-------------+
| 1 | ["tag1", "tag2", "tag3"] | tag1 |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | tag1 |
| 3 | ["tag2", "tag5"] | tag2 |
+----+----------------------------------+-------------+
As you can already imagine, the generated column can include more items from the JSON:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]', '$[1]', '$[2]')),
ADD INDEX tags_index (tags_scalar);
SELECT * from mytags;
+----+----------------------------------+--------------------------+
| id | tags | tags_scalar |
+----+----------------------------------+--------------------------+
| 1 | ["tag1", "tag2", "tag3"] | ["tag1", "tag2", "tag3"] |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | ["tag1", "tag3", "tag5"] |
| 3 | ["tag2", "tag5"] | ["tag2", "tag5"] |
+----+----------------------------------+--------------------------+
or use any other valid expression to auto-generate a string out of the JSON structure, in order to obtain something that can be easily indexed and searched like "tag1tag3tag5tag7".
[...](meaning I must know the maximum length of the array to index them all)?
As explained above, you don't need to know - NULL values can be skipped by using any valid expression. But of course it's always better to know.
Now there's the architecture decision: Is JSON data type the most appropriate to achieve the goal? To solve this particular problem? Is JSON the right tool here? Is it going to speed up searching?
How do I index a variable length array?
If you insist, cast string:
ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;
ALTER TABLE mytags
ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (replace(replace(replace(cast(tags as char), '"', ''), '[', ''), ']', '')),
ADD INDEX tags_index (tags_scalar);
SELECT * from mytags;
+----+----------------------------------+------------------------+
| id | tags | tags_scalar |
+----+----------------------------------+------------------------+
| 1 | ["tag1", "tag2", "tag3"] | tag1, tag2, tag3 |
| 2 | ["tag1", "tag3", "tag5", "tag7"] | tag1, tag3, tag5, tag7 |
| 3 | ["tag2", "tag5"] | tag2, tag5 |
+----+----------------------------------+------------------------+
This way or another you end up with a VARCHAR or TEXT column, where you apply the most applicable index structure (some options).
Further reading:
It's now possible with MySQL 8.0.17+
Something like this (not tested)
CREATE TABLE posts (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
tags JSON,
INDEX tags( (CAST(tags AS VARCHAR(32) ARRAY)) )
);
Use it this way:
SELECT * FROM posts
WHERE JSON_CONTAINS(tags, CAST('[tag1, tag2]' AS JSON));
More details and samples here: https://dev.mysql.com/doc/refman/8.0/en/json.html
It's not practical to index an array in JSON in MySQL.
You can use generated columns to extract each element of the array into a separate scalar column, and index each generated column. But how many of these columns will you need? How will you know which column contains the value you're searching for?
You can use a generated column as @bostaf's answer shows, extracting multiple array values and making a comma-separated string. You can't use an plain index to search this string for a word that may occur in the middle. Perhaps you could use a fulltext index, but this only works if the array elements are each a single word.
In April 2018, I gave a presentation about this kind of weakness using JSON in MySQL: How to Use JSON in MySQL Wrong.
The better solution for multi-valued attributes is to store them in a dependent table, in the manner proscribed by database normalization. Then the values appear over multiple rows, in a single column, which you can index in a more straightforward way.
Re your comment:
I came up with a solution for enforcing uniqueness on a JSON array, but it depends on the elements of the array staying in sorted order.
mysql> create table mytags ( tags json );
mysql> insert into mytags values ('["tag1", "tag3", "tag5", "tag7"]');
The JSON_UNQUOTE() function returns the JSON as a string.
mysql> select json_unquote(tags) from mytags;
+----------------------------------+
| json_unquote(tags) |
+----------------------------------+
| ["tag1", "tag3", "tag5", "tag7"] |
+----------------------------------+
Now we know how to make a generated column based on that, and then a UNIQUE KEY index on the generated column. This works in MySQL 5.7 and later.
mysql> alter table mytags
add column j varchar(768) as (json_unquote(tags)),
add unique index (j);
Now trying to insert the same array of values in the JSON column fails:
mysql> insert into mytags (tags) values ('["tag1", "tag3", "tag5", "tag7"]');
ERROR 1062 (23000): Duplicate entry '["tag1", "tag3", "tag5", "tag7"]' for key 'j'
Unfortunately, there's no good way to ensure that the JSON array is sorted. See Sorted json array field So it's up to you to design your application code so that it always pre-sorts the values in the JSON array before inserting or updating.