Databases are intended to efficiently store and query data. The problem is, there are many different types of data we can store: numbers, strings, JSON, geometrical data. Databases use different methods to store different types of data - table structure, indexes. Not always the same way of storing and querying the data is efficient for all of its types, making it quite hard to use one-fits-all solution. As a result, databases try to use different approaches for different data types. For example, in MySQL or MariaDB we have generic, well performing solution like InnoDB, which works fine in majority of the cases, but we also have separate functions to work with JSON data, separate spatial indexes to speed up querying geometric data or fulltext indexes, helping with text data. In this blog, we will take a look at how MariaDB can be used to work with full text data.
Regular B+Tree indexes in InnoDB can also be used to speed up searches for the text data. The main issue is that, due to their structure and nature, they can only help with search for the leftmost prefixes. It is also expensive to index large volumes of text (which, given the limitations of the leftmost prefix, doesn’t really make sense). Why? Let’s take a look at a simple example. We have the following sentence:
“The quick brown fox jumps over the lazy dog”
Using regular indexes in InnoDB we can index the full sentence:
“The quick brown fox jumps over the lazy dog”
The point is, when looking for this data, we have to lookup full leftmost prefix. So a query like:
SELECT text FROM mytable WHERE sentence LIKE “The quick brown fox jumps”;
Will benefit from this index but a query like:
SELECT text FROM mytable WHERE sentence LIKE “quick brown fox jumps”;
Will not. There’s no entry in the index that starts from ‘quick’. There’s an entry in the index that contains ‘quick’ but starts from ‘The’, thus it cannot be used. As a result, it is virtually impossible to efficiently query text data using B+Tree indexes. Luckily, both MyISAM and InnoDB have implemented FULLTEXT indexes, which can be used to actually work with text data on MariaDB. The syntax is slightly different than with regular SELECTs, let’s take a look at what we can do with them. As for data we used random index file from the dump of Wikipedia database. The data structure is as below:
617:11539268:Arthur Hamerschlag
617:11539269:Rooster Cogburn (character)
617:11539275:Membership function
617:11539282:Secondarily Generalized Tonic-Clonic Seizures
617:11539283:Corporate Challenge
617:11539285:Perimeter Mall
617:11539286:1994 St. Louis Cardinals season
As a result, we created table with two BIG INT columns and one VARCHAR.
MariaDB [(none)]> CREATE TABLE ft_data.ft_table (c1 BIGINT, c2 BIGINT, c3 VARCHAR, PRIMARY KEY (c1, c2);
Afterwards we loaded the data:
MariaDB [ft_data]> LOAD DATA INFILE '/vagrant/enwiki-20190620-pages-articles-multistream-index17.txt-p11539268p13039268' IGNORE INTO TABLE ft_table COLUMNS TERMINATED BY ':';
MariaDB [ft_data]> ALTER TABLE ft_table ADD FULLTEXT INDEX idx_ft (c3);
Query OK, 0 rows affected (5.497 sec)
Records: 0 Duplicates: 0 Warnings: 0
We also created the FULLTEXT index. As you can see, the syntax for that is similar to regular index, we just had to pass the information about the index type as it defaults to B+Tree. Then we were ready to run some queries.
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Starship');
+-----------+----------+------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------------+
| 119794610 | 12007923 | Starship Troopers 3 |
| 250627749 | 12479782 | Miranda class starship (Star Trek) |
| 250971304 | 12481409 | Starship Hospital |
| 253430758 | 12489743 | Starship Children's Hospital |
+-----------+----------+------------------------------------+
4 rows in set (0.009 sec)
As you can see, the syntax for the SELECT is slightly different than what we are used to. For fulltext search you should use MATCH() … AGAINST () syntax, where in MATCH() you pass the column or columns you want to search and in AGAINST() you pass coma-delimited list of keywords. You can see from the output that by default search is case insensitive and it searches the whole string, not just the beginning as it is with B+Tree indexes. Let’s compare how it will look like if we would add normal index on the ‘c3’ column - FULLTEXT and B+Tree indexes can coexist on the same column without any problems. Which would be used is decided based on the SELECT syntax.
MariaDB [ft_data]> ALTER TABLE ft_data.ft_table ADD INDEX idx_c3 (c3);
Query OK, 0 rows affected (1.884 sec)
Records: 0 Duplicates: 0 Warnings: 0
After the index has been created, let’s take a look at the search output:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE c3 LIKE 'Starship%';
+-----------+----------+------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------+
| 253430758 | 12489743 | Starship Children's Hospital |
| 250971304 | 12481409 | Starship Hospital |
| 119794610 | 12007923 | Starship Troopers 3 |
+-----------+----------+------------------------------+
3 rows in set (0.001 sec)
As you can see, our query returned only three rows. This is expected as we are looking for rows which only start with a string ‘Starship’.
MariaDB [ft_data]> EXPLAIN SELECT * FROM ft_data.ft_table WHERE c3 LIKE 'Starship%'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ft_table
type: range
possible_keys: idx_c3,idx_ft
key: idx_c3
key_len: 103
ref: NULL
rows: 3
Extra: Using where; Using index
1 row in set (0.000 sec)
When we check the EXPLAIN output we can see that the index has been used to lookup for the data. But what if we want to look for all the rows which contain the string ‘Starship’, no matter if it is at the beginning or not. We have to write the following query:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE c3 LIKE '%Starship%';
+-----------+----------+------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------------+
| 250627749 | 12479782 | Miranda class starship (Star Trek) |
| 253430758 | 12489743 | Starship Children's Hospital |
| 250971304 | 12481409 | Starship Hospital |
| 119794610 | 12007923 | Starship Troopers 3 |
+-----------+----------+------------------------------------+
4 rows in set (0.084 sec)
The output matches what we got from the fulltext search.
MariaDB [ft_data]> EXPLAIN SELECT * FROM ft_data.ft_table WHERE c3 LIKE '%Starship%'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ft_table
type: index
possible_keys: NULL
key: idx_c3
key_len: 103
ref: NULL
rows: 473367
Extra: Using where; Using index
1 row in set (0.000 sec)
The EXPLAIN is different though - as you can see it still uses index but this time it does a full index scan. That is possible as we indexed full c3 column so all the data is available in the index. Index scan will result in random reads from the table but for such small table MariaDB decided it more efficient than reading the whole table. Please note the execution time: 0.084s for our regular SELECT. Comparing this to fulltext query, it is bad:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Starship');
+-----------+----------+------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------------+
| 119794610 | 12007923 | Starship Troopers 3 |
| 250627749 | 12479782 | Miranda class starship (Star Trek) |
| 250971304 | 12481409 | Starship Hospital |
| 253430758 | 12489743 | Starship Children's Hospital |
+-----------+----------+------------------------------------+
4 rows in set (0.001 sec)
As you can see, query which use FULLTEXT index took 0.001s to execute. We are talking here about orders of magnitude differences.
MariaDB [ft_data]> EXPLAIN SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Starship')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ft_table
type: fulltext
possible_keys: idx_ft
key: idx_ft
key_len: 0
ref:
rows: 1
Extra: Using where
1 row in set (0.000 sec)
Here’s how the EXPLAIN output look like for the query using FULLTEXT index - that fact is indicated by type: fulltext.
Fulltext queries also have some other features. It is possible, for example, to return rows which might be relevant to the search term. MariaDB looks for words located near the row that you search for and then run a search also for them.
MariaDB [(none)]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Starship');
+-----------+----------+------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------------+
| 119794610 | 12007923 | Starship Troopers 3 |
| 250627749 | 12479782 | Miranda class starship (Star Trek) |
| 250971304 | 12481409 | Starship Hospital |
| 253430758 | 12489743 | Starship Children's Hospital |
+-----------+----------+------------------------------------+
4 rows in set (0.001 sec)
In our case, word ‘Starship’ can be related to words like ‘Troopers’, ‘class’, ‘Star Trek’, ‘Hospital’ etc. To use this feature we should run the query with “WITH QUERY EXPANSION” modifyer:
MariaDB [(none)]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Starship' WITH QUERY EXPANSION) LIMIT 10;
+-----------+----------+-------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+-------------------------------------+
| 250627749 | 12479782 | Miranda class starship (Star Trek) |
| 119794610 | 12007923 | Starship Troopers 3 |
| 253430758 | 12489743 | Starship Children's Hospital |
| 250971304 | 12481409 | Starship Hospital |
| 277700214 | 12573467 | Star ship troopers |
| 86748633 | 11886457 | Troopers Drum and Bugle Corps |
| 255120817 | 12495666 | Casper Troopers |
| 396408580 | 13014545 | Battle Android Troopers |
| 12453401 | 11585248 | Star trek tos |
| 21380240 | 11622781 | Who Mourns for Adonais? (Star Trek) |
+-----------+----------+-------------------------------------+
10 rows in set (0.002 sec)
The output contained large number of rows but this sample is enough to see how it works. The query returned rows like:
“Troopers Drum and Bugle Corps”
“Battle Android Troopers”
Those are based on the search for the word ‘Troopers’. It also returned rows with strings like:
“Star trek tos”
“Who Mourns for Adonais? (Star Trek)”
Which, obviously, are based on the lookup for the word ‘Start Trek’.
If you would need more control over the term you want to search for, you can use “IN BOOLEAN MODE”. It allows to use additional operators. The full list is in the documentation, we’ll show just a couple of examples.
Let’s say we want to search not just for the word ‘Star’ but also for other words which start with the string ‘Star’:
MariaDB [(none)]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('Star*' IN BOOLEAN MODE) LIMIT 10;
+----------+----------+---------------------------------------------------+
| c1 | c2 | c3 |
+----------+----------+---------------------------------------------------+
| 20014704 | 11614055 | Ringo Starr and His third All-Starr Band-Volume 1 |
| 154810 | 11539775 | Rough blazing star |
| 154810 | 11539787 | Great blazing star |
| 234851 | 11540119 | Mary Star of the Sea High School |
| 325782 | 11540427 | HMS Starfish (19S) |
| 598616 | 11541589 | Dwarf (star) |
| 1951655 | 11545092 | Yellow starthistle |
| 2963775 | 11548654 | Hydrogenated starch hydrolysates |
| 3248823 | 11549445 | Starbooty |
| 3993625 | 11553042 | Harvest of Stars |
+----------+----------+---------------------------------------------------+
10 rows in set (0.001 sec)
As you can see, in the output we have rows that contain strings like ‘Stars’, ‘Starfish’ or ‘starch’.
Another use case for the BOOLEAN mode. Let’s say we want to search for rows which are relevant to the House of Representatives in Pennsylvania. If we will run regular query, we will get results somehow related to any of those strings:
MariaDB [ft_data]> SELECT COUNT(*) FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('House, Representatives, Pennsylvania');
+----------+
| COUNT(*) |
+----------+
| 1529 |
+----------+
1 row in set (0.005 sec)
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('House, Representatives, Pennsylvania') LIMIT 20;
+-----------+----------+--------------------------------------------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+--------------------------------------------------------------------------+
| 198783294 | 12289308 | Pennsylvania House of Representatives, District 175 |
| 236302417 | 12427322 | Pennsylvania House of Representatives, District 156 |
| 236373831 | 12427423 | Pennsylvania House of Representatives, District 158 |
| 282031847 | 12588702 | Pennsylvania House of Representatives, District 47 |
| 282031847 | 12588772 | Pennsylvania House of Representatives, District 196 |
| 282031847 | 12588864 | Pennsylvania House of Representatives, District 92 |
| 282031847 | 12588900 | Pennsylvania House of Representatives, District 93 |
| 282031847 | 12588904 | Pennsylvania House of Representatives, District 94 |
| 282031847 | 12588909 | Pennsylvania House of Representatives, District 193 |
| 303827502 | 12671054 | Pennsylvania House of Representatives, District 55 |
| 303827502 | 12671089 | Pennsylvania House of Representatives, District 64 |
| 337545922 | 12797838 | Pennsylvania House of Representatives, District 95 |
| 219202000 | 12366957 | United States House of Representatives House Resolution 121 |
| 277521229 | 12572732 | United States House of Representatives proposed House Resolution 121 |
| 20923615 | 11618759 | Special elections to the United States House of Representatives |
| 20923615 | 11618772 | List of Special elections to the United States House of Representatives |
| 37794558 | 11693157 | Nebraska House of Representatives |
| 39430531 | 11699551 | Belgian House of Representatives |
| 53779065 | 11756435 | List of United States House of Representatives elections in North Dakota |
| 54048114 | 11757334 | 2008 United States House of Representatives election in North Dakota |
+-----------+----------+--------------------------------------------------------------------------+
20 rows in set (0.003 sec)
As you can see, we found some useful data but we also found data which is totally not relevant to our search. Luckily, we can refine such query:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('+House, +Representatives, +Pennsylvania' IN BOOLEAN MODE);
+-----------+----------+-----------------------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+-----------------------------------------------------+
| 198783294 | 12289308 | Pennsylvania House of Representatives, District 175 |
| 236302417 | 12427322 | Pennsylvania House of Representatives, District 156 |
| 236373831 | 12427423 | Pennsylvania House of Representatives, District 158 |
| 282031847 | 12588702 | Pennsylvania House of Representatives, District 47 |
| 282031847 | 12588772 | Pennsylvania House of Representatives, District 196 |
| 282031847 | 12588864 | Pennsylvania House of Representatives, District 92 |
| 282031847 | 12588900 | Pennsylvania House of Representatives, District 93 |
| 282031847 | 12588904 | Pennsylvania House of Representatives, District 94 |
| 282031847 | 12588909 | Pennsylvania House of Representatives, District 193 |
| 303827502 | 12671054 | Pennsylvania House of Representatives, District 55 |
| 303827502 | 12671089 | Pennsylvania House of Representatives, District 64 |
| 337545922 | 12797838 | Pennsylvania House of Representatives, District 95 |
+-----------+----------+-----------------------------------------------------+
12 rows in set (0.001 sec)
As you can see, by adding ‘+’ operator we made it clear we are interested only in the output where given word exists. As a result the data we got in response is exactly what we were looking for.
We can also exclude words from the search. Let’s say that we are looking for flying things but our search results are contaminated by different flying animals we are not interested in. We can easily get rid of foxes, squirrels and frogs:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('+flying -fox* -squirrel* -frog*' IN BOOLEAN MODE) LIMIT 10;
+----------+----------+-----------------------------------------------------+
| c1 | c2 | c3 |
+----------+----------+-----------------------------------------------------+
| 13340153 | 11587884 | List of surviving Boeing B-17 Flying Fortresses |
| 16774061 | 11600031 | Flying Dutchman Funicular |
| 23137426 | 11631421 | 80th Flying Training Wing |
| 26477490 | 11646247 | Kites and Kite Flying |
| 28568750 | 11655638 | Fear of Flying |
| 28752660 | 11656721 | Flying Machine (song) |
| 31375047 | 11666654 | Flying Dutchman (train) |
| 32726276 | 11672784 | Flying Wazuma |
| 47115925 | 11728593 | The Flying Locked Room! Kudou Shinichi's First Case |
| 64330511 | 11796326 | The Church of the Flying Spaghetti Monster |
+----------+----------+-----------------------------------------------------+
10 rows in set (0.001 sec)
Final feature we would like to show is the ability to search for the exact quote:
MariaDB [ft_data]> SELECT * FROM ft_data.ft_table WHERE MATCH(c3) AGAINST ('"People\'s Republic of China"' IN BOOLEAN MODE) LIMIT 10;
+-----------+----------+------------------------------------------------------------------------------------------------------+
| c1 | c2 | c3 |
+-----------+----------+------------------------------------------------------------------------------------------------------+
| 12093896 | 11583713 | Religion in the People's Republic of China |
| 25280224 | 11640533 | Political rankings in the People's Republic of China |
| 43930887 | 11716084 | Cuisine of the People's Republic of China |
| 62272294 | 11789886 | Office of the Commissioner of the Ministry of Foreign Affairs of the People's Republic of China in t |
| 70970904 | 11824702 | Scouting in the People's Republic of China |
| 154301063 | 12145003 | Tibetan culture under the People's Republic of China |
| 167640800 | 12189851 | Product safety in the People's Republic of China |
| 172735782 | 12208560 | Agriculture in the people's republic of china |
| 176185516 | 12221117 | Special Economic Zone of the People's Republic of China |
| 197034766 | 12282071 | People's Republic of China and the United Nations |
+-----------+----------+------------------------------------------------------------------------------------------------------+
10 rows in set (0.001 sec)
As you can see, fulltext search in MariaDB works quite well, it is also faster and more flexible than search using B+Tree indexes. Please keep in mind though that this is by no means a way of handling large volumes of data - with the data growth, feasibility of this solution will reduce. Still, for the small data sets this solution is perfectly valid. It can definitely buy you more time to, eventually, implement dedicated full text search solutions like Sphinx or Lucene. Of course, all of the features we described are available in MariaDB clusters deployed from ClusterControl.