Search the MySQL manual:

6.8 MySQL Full-text Search

As of Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text indexes in MySQL are an index of type FULLTEXT. FULLTEXT indexes are used with MyISAM tables only and can be created from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. For large datasets, it will be much faster to load your data into a table that has no FULLTEXT index, then create the index with ALTER TABLE (or CREATE INDEX). Loading data into a table that already has a FULLTEXT index will be slower.

Full-text searching is performed with the MATCH() function.

mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO articles VALUES
    -> (NULL,'MySQL Tutorial', 'DBMS stands for DataBase ...'),
    -> (NULL,'How To Use MySQL Efficiently', 'After you went through a ...'),
    -> (NULL,'Optimising MySQL','In this tutorial we will show ...'),
    -> (NULL,'1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    -> (NULL,'MySQL vs. YourSQL', 'In the following database comparison ...'),
    -> (NULL,'MySQL Security', 'When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM articles
    ->          WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

The MATCH() function performs a natural language search for a string against a text collection (a set of one or more columns included in a FULLTEXT index). The search string is given as the argument to AGAINST(). The search is performed in case-insensitive fashion. For every row in the table, MATCH() returns a relevance value, that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.

When MATCH() is used in a WHERE clause (see example above) the rows returned are automatically sorted with highest relevance first. Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

It is also possible to perform a boolean mode search. This is explained later in the section.

The preceding example is a basic illustration showing how to use the MATCH() function. Rows are returned in order of decreasing relevance.

The next example shows how to retrieve the relevance values explicitly. As neither WHERE nor ORDER BY clauses are present, returned rows are not ordered.

mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles;
+----+-----------------------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') |
+----+-----------------------------------------+
|  1 |                        0.64840710366884 |
|  2 |                                       0 |
|  3 |                        0.66266459031789 |
|  4 |                                       0 |
|  5 |                                       0 |
|  6 |                                       0 |
+----+-----------------------------------------+
6 rows in set (0.00 sec)

The following example is more complex. The query returns the relevance and still sorts the rows in order of decreasing relevance. To achieve this result, you should specify MATCH() twice. This will cause no additional overhead, because the MySQL optimiser will notice that the two MATCH() calls are identical and invoke the full-text search code only once.

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root') AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5055546709332 |
|  6 | When configured properly, MySQL ... |   1.31140957288 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

MySQL uses a very simple parser to split text into words. A ``word'' is any sequence of characters consisting of letters, digits, `'', and `_'. Any ``word'' that is present in the stopword list or is just too short is ignored. The default minimum length of words that will be found by full-text searches is four characters. This can be changed as described in section 6.8.2 Fine-tuning MySQL Full-text Search.

Every correct word in the collection and in the query is weighted according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. The weights of the words are then combined to compute the relevance of the row.

Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not reflect adequately their semantic value, and this model may sometimes produce bizarre results.

mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)

The search for the word MySQL produces no results in the above example, because that word is present in more than half the rows. As such, it is effectively treated as a stopword (that is, a word with zero semantic value). This is the most desirable behaviour -- a natural language query should not return every second row from a 1 GB table.

A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that such rows have been assigned a low semantic value in this particular dataset.

As of Version 4.0.1, MySQL can also perform boolean full-text searches using the IN BOOLEAN MODE modifier.

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    ->     AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+------------------------------+-------------------------------------+
| id | title                        | body                                |
+----+------------------------------+-------------------------------------+
|  1 | MySQL Tutorial               | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Efficiently | After you went through a ...        |
|  3 | Optimising MySQL             | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security               | When configured properly, MySQL ... |
+----+------------------------------+-------------------------------------+

This query retrieved all the rows that contain the word MySQL (note: the 50% threshold is not used), but that do not contain the word YourSQL. Note that a boolean mode search does not automatically sort rows in order of decreasing relevance. You can see this from result of the preceding query, where the row with the highest relevance (the one that contains MySQL twice) is listed last, not first. A boolean full-text search can also work even without a FULLTEXT index, although it would be slow.

The boolean full-text search capability supports the following operators:

+
A leading plus sign indicates that this word must be present in every row returned.
-
A leading minus sign indicates that this word must not be present in any row returned.
By default (when neither plus nor minus is specified) the word is optional, but the rows that contain it will be rated higher. This mimicks the behaviour of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier.
< >
These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The < operator decreases the contribution and the > operator increases it. See the example below.
( )
Parentheses are used to group words into subexpressions.
~
A leading tilde acts as a negation operator, causing the word's contribution to the row relevance to be negative. It's useful for marking noise words. A row that contains such a word will be rated lower than others, but will not be excluded altogether, as it would be with the - operator.
*
An asterisk is the truncation operator. Unlike the other operators, it should be appended to the word, not prepended.
"
The phrase, that is enclosed in double quotes ", matches only rows that contain this phrase literally, as it was typed.

And here are some examples:

apple banana
find rows that contain at least one of these words.
+apple +juice
... both words.
+apple macintosh
... word ``apple'', but rank it higher if it also contain ``macintosh''.
+apple -macintosh
... word ``apple'' but not ``macintosh''.
+apple +(>turnover <strudel)
... ``apple'' and ``turnover'', or ``apple'' and ``strudel'' (in any order), but rank ``apple pie'' higher than ``apple strudel''.
apple*
... ``apple'', ``apples'', ``applesauce'', and ``applet''.
"some words"
... ``some words of wisdom'', but not ``some noise words''.

Subsections

User Comments

Posted by Tim Gustafson on Friday May 17 2002, @6:24am[Delete] [Edit]

Note: you should add at least 3 rows to the table
before you try to match anything, and what you're
searching for should only be contained in one of
the three rows. This is because of the 50%
thereshold. If you insert only one row, then now
matter what you search for, it is in 50% or more
of the rows in the table, and therefore
disregarded.

Posted by Martin Glancy on Thursday August 29 2002, @12:06pm[Delete] [Edit]

A clarification for those very new to mySQL:
MATCH (body,title) will only work if you have an
index on both fields together - something created
like this:
ALTER TABLE some_tbl ADD FULLTEXT (body,title);
It will not work if you have an index on each
of 'body' and 'title' separately.

Posted by Dyfed Lloyd Evans on Monday October 21 2002, @4:39am[Delete] [Edit]

Hyphen '-' characters break literals at the moment.
A search for something like "GATA-D22S690" finds
all entries containing GATA and not the full
hyphenated text. The '-' character is treated as a
word stop even within literals. The same is true if
any of the special text search modifiers are used (eg
+, -, ~) so that hyphenated literals are not correctly
found with full text searches.

Posted by Julien Hofstede on Sunday October 27 2002, @11:49pm[Delete] [Edit]

There is no way to check if the zero results are
caused by the 50%-thing or if there really are no
results.

There is no way to find out what the maximum of
relevance is.

Posted by Olaf.Nix on Friday November 15 2002, @4:54am[Delete] [Edit]

Note: When trying to search for an exact phrase in
boolean mode you have to use a construct like
MATCH(...) AGAINST('"my phrase"' IN BOOLEAN
MODE). Note the outer single-quotes!! The
documentation is not very clear in this point and it
took my a few hours to figure this out. If you don't
enclose it in single quotes the query becomes
a 'normal' search.

Posted by Patrick O'Lone on Monday December 9 2002, @8:51am[Delete] [Edit]

It should be noted in the documentation that IN
BOOLEAN MODE will almost always return a
relevance of 1.0. In order to get a relevance that is
meaningful, you'll need to:
<BR/><BR/>
SELECT MATCH('Content') AGAINST ('keyword1
keyword2') as Relevance FROM table WHERE MATCH
('Content') AGAINST('+keyword1 +keyword2' IN
BOOLEAN MODE) HAVING Relevance > 0.2 ORDER
BY Relevance DESC
<BR/><BR/>
Notice that you are doing a regular relevance query
to obtain relevance factors combined with a WHERE
clause that uses BOOLEAN MODE. The BOOLEAN
MODE gives you the subset that fulfills the
requirements of the BOOLEAN search, the relevance
query fulfills the relevance factor, and the HAVING
clause (in this case) ensures that the document is
relevant to the search (i.e. documents that score
less than 0.2 are considered irrelevant). This also
allows you to order by relevance.
<BR/><BR/>
This may or may not be a bug in the way that IN
BOOLEAN MODE operates, although the comments
I've read on the mailing list suggest that IN
BOOLEAN MODE's relevance ranking is not very
complicated, thus lending itself poorly for actually
providing relevant documents. BTW - I didn't notice
a performance loss for doing this, since it appears
MySQL only performs the FULLTEXT search once,
even though the two MATCH clauses are different.
Use EXPLAIN to prove this.

Posted by Nathan Ostgard on Monday April 14 2003, @12:25pm[Delete] [Edit]

An easy solution to correct for spelling errors for small search items like the name of the city is to build a column that contains the SOUNDEX of each. I've found that using a 4 character SOUNDEX works the best. An example:

ALTER TABLE cities ADD city_soundex VARCHAR(4) NOT NULL;
UPDATE cities SET city_soundex=LEFT(SOUNDEX(city_name),4);

And then to query against:

SELECT * FROM citites WHERE city_soundex=LEFT(SOUNDEX('Some City Name'),4);

Posted by m m on Wednesday April 16 2003, @2:25pm[Delete] [Edit]

Does anyone know the actual formula used to find term relevance? Or, does anyone know how to find the exact formula?

We are trying to build a recommendation system with this feature.

Thanks

Posted by gogman on Monday May 5 2003, @10:42am[Delete] [Edit]

MySQL defaults to an 'OR'. Example: 'dog cat' = 'dog OR cat'. Most fulltext search engines default to an 'AND'. These include: AltaVista, Fast Search, Google, Hotbot, Lycos, Northern Light and Yahoo. Excite is an exception that defaults to an 'OR'.


New Feature: set-variable = ft_boolean_default='AND'

vs 'OR'

('OR' would be the default setting so as to not break older code)

With a 'AND' default 'OR' would have to be explicit. Example: 'dog cat' = 'dog AND cat', 'dog OR cat' - requires 'OR' to be set.

Performance tests are indicating a 5-7 times increase in search speed with "AND" vs "OR" statements.

Posted by gogman on Monday May 5 2003, @10:31am[Delete] [Edit]

Under MySQL 4.1, I'm recieving an expanded set of boolean variables.

mysql> show variables like 'ft_boolean_syntax';

ft_boolean_syntax + - > < ( ) ~ * : " " & |

Anyone have the definative definition and use for the new ": & |" terms?

Posted by [name withheld] on Monday May 12 2003, @6:43am[Delete] [Edit]

This page says that words under four characters won't be indexed for a full-text search, but "pie" is given as an example later on in a boolean mode search. Does this mean that boolean mode searches don't have this four character limit?

Posted by John Takacs on Tuesday May 20 2003, @9:41am[Delete] [Edit]

Greetings!

FYI benchmark freaks:

Currently, I'm using mysql's full text search support. I have a database of 3-5 million rows. Each row is unique, let's say a product. Each row has several columns, but the two I search on are title and description. I used the syntax above to create a full text index on title and description. Title has approximately 100 characters, and description has 255 characters.

At the moment, mysql is taking 50 seconds plus to return results on simple one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM RedHat Linux 7.3 platform, with nothing else running on it, i.e. another server is handling HTTP requests. It is a dedicated mysql box.

Obviously, the above performance is unacceptable for real world web applications.

Posted by [name withheld] on Saturday May 24 2003, @3:00pm[Delete] [Edit]

To John Takacs:

Thanks for your information, this is the kind of real-world information we need!
What would you use for that large fulltext searches?
Any comparisons with fulltext-search-engines are welcome!
www.searchtools.com might be a good starting point!
NOTE: Just checked, searchtools.com seems to be down (where are they gone?), you might use the google cache.

Have a nice day!

Posted by John Takacs on Saturday May 24 2003, @9:34pm[Delete] [Edit]

To "name withheld" above:

I forgot to add that the mysql server is experiencing the load of ONE user. Me. In other words, the above abysmal performance is the result of one user, me, typing in one single word query. 50 seconds was actually good. Some queries took minutes....yes....minutes.

While the above may be good enough for report generation, i.e. fire and forget about it, come back after lunch to look at the results, it won't work for a web app, where the user is waiting for the results in real-time.

Honestly, I was and still am dumbfounded. The above description of my table is accurate, and the above table is not some industrial stength behemoth that belongs on a 50,000 USD server/software package.

BTW, it took approximately 11 - 12 hours to make the index. Not sure if this is good or bad.

Also, Full-text indexing may still be applicable, if you have less than 300,000 - 500,000 rows of data. Some tests I did on another table with that amount did produce queries of less than 1 second duration. Table structure on that smaller table was the same as my above post.

Finally, as far as full-text search, check out http://www.aspseek.org, and lucene from , http://jakarta.apache.org. Both are scalable, full-text indexing search engines. Amazingly fast.

I had hoped to stick with a slightly modified mysql solution, but due to the above performance issues, I'll go back to one of the above search engines.

On a side note, attempting to index the Title column via normal indexing resulted in the server crashing each time.

Posted by [name withheld] on Sunday May 25 2003, @4:49am[Delete] [Edit]

Thanks again, John, hopefully the mysql-team will try to get into contact with you to check your results - this definitely is a MUST for them... and a lot of people are really very interested in what they will say...

BTW: swish-e.org is also VERY nice search-engine, also capable of doing fulltext search in XML-Docs and Tags... not many around, that will do this!

Posted by Emanuel Alexandre Tavares on Sunday May 25 2003, @3:43pm[Delete] [Edit]

Hi,

I've been working with MySQL since 1999 and I always by happy with the results.

But now I'm trying to use the full text index and I've been having problems with the "*" operator.

This just doesn´t work.

I've created a table with this script:

create table full
(id_full integer auto_increment primary key,
texto1 varchar(255),
texto2 longtext)

And the full text index with:

creste fulltext index i_full1 on full (texto1, texto2)

I put some records in the table with expressions:

"express"
"jornalexpress"
"expressjornal"

When I made a search like this:

select id_full, match(texto1, texto2) against('express') from full
where match(texto1, texto2) against('express');

everything works fine and just the lines with "express" has returned.

Then, I tried to do something like that:

select id_full, match(texto1, texto2) against('express*') from full
where match(texto1, texto2) against('express*');

and that:

select id_full, match(texto1, texto2) against('*express') from full
where match(texto1, texto2) against('*express');

But just the records with "express" returned. The line with expressjornal has totaly ignored.

Then, I tried this:

select id_full, match(texto1, texto2) against('expres*') from full
where match(texto1, texto2) against('expres*');

And nothing has returned.

What's going on?

I used 2 different databases:

MySQL 3.23.52 in a RedHat 7.3 and
MySQL 4.0.11Gama in a Mandrake 9.1

The table was a test with just 100 records.

Thanks.


Emanuel

Posted by John Takacs on Tuesday May 27 2003, @5:19pm[Delete] [Edit]

Emmanuel,

You need to post your question to the mailing list, per the posting guidelines here. You may wait a long time for an answer here.

Looking at your problem, I really don't know the answer, either. I suspect there are still some issues to be worked out with Full-text indexing. You really should consider an alternative, such as one of the search engines mentioned above.

John

Posted by Alexander Kochetov on Thursday May 29 2003, @1:36am[Delete] [Edit]

Emanuel,
when using special characters for boolean search always add 'IN BOOLEAN MODE' into AGAINST statement. Otherwise, those characters (*, +, -, ...) will be ignored.

Posted by Harsh Busa on Saturday June 14 2003, @1:58am[Delete] [Edit]

You can trick the more than 3 character restriction of full text search by appending * to the search word less than 4 characters.

So if you are searching for "TOP" it will not match plain "TOP" but word like "TOPOLOGY" "TOPGUNS" or whatever.

So we have decided not to use this trick :-)

Posted by Jim Nguyen on Wednesday June 18 2003, @10:33pm[Delete] [Edit]

The full-text search is slow when there are a lot of rows in the table. I have more than 2 million rows with text and multiple word searches (3 or more) take about 30 seconds to a minute or longer.

I am running a Athlon 2.2 Ghz with 512 MB DDR RAM 400 Mhz. Hard drive has seek time of 9 ms.

Posted by John Takacs on Tuesday June 24 2003, @8:32pm[Delete] [Edit]

Jim Nguyen,

Thanks for the info. I think it is time that more and more people posted real world benchmark type info. Only then will people be able to determine if Mysql will work for them.

In our case, I think 30 second to 1 minute delays for search results is a SHOW STOPPER. Nobody will use a search like that!

The second disconcerting thing is that a table with 1,000,000 to 3,000,000 records is HARDLY an enterprise class database. This is an ordinary average Joe type table.

Finally, when you do a search here at MySQL.com, have you noticed what they use? They use Mnogosearch! Not the Full text indexing of MySQL, but Mngosearch!

Regards,

John

Posted by Jim Nguyen on Wednesday June 25 2003, @6:06pm[Delete] [Edit]

Yes John, I agree with you. The full-text search is slow. I was looking at some of the future versions to see if they were ever going to optimize it. But I didn't really see anything relating to full-text search. I also emailed a mysql newsgroup but didn't get back a response. Is full-text searching going to be faster someday...if so when?

Posted by [name withheld] on Wednesday July 2 2003, @9:33am[Delete] [Edit]

How about INNODB support for FULLTEXT search? INNODB has lots of advantages over MyISAM. So I really want to move over to INNODB. The only thing holding me back is the lack of support for FULLTEXT search, which is required for my application. Put the two together, and MySQL will be truly useful.

Posted by Alan Riley on Monday July 14 2003, @3:46pm[Delete] [Edit]

We too have a database with close to 6 million rows in it. We would like to use the fulltext search, but it is painfully slow. The sad thing is that any one query we need to run only needs to be run on a subset of the rows (think of those 6 million rows as being divided between about 80 different categories, with results only needed to be returned from within a category). It is a shame that there isn't some way for us to have a fulltext index within the rows of another column which is also indexed.

Our next shot at this is going to be to create 80 different tables, one for each category (yuck!), just to try to get decent performance on the within-category fulltext search.

I would think there is enough interest in fulltext search for there to be an email list dedicated to it where those of us who need to use it on real-world web sites could interact with the developers to try to tune it for the good of the community at large.

Posted by jekend John on Thursday July 31 2003, @2:22am[Delete] [Edit]

dose the full-text search support two-byte character such as big5 or gbk ?
It seems not .
Is there any parameter to enabled it ?

Add your own comment.