Skip to content

Fuzzy search return wrong results #3355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
MohammedOmar123 opened this issue May 5, 2025 · 12 comments
Open
4 tasks

Fuzzy search return wrong results #3355

MohammedOmar123 opened this issue May 5, 2025 · 12 comments
Assignees
Labels

Comments

@MohammedOmar123
Copy link

MohammedOmar123 commented May 5, 2025

Bug Description:

I tried the following query with the following options, I expect that all results to be have both searched terms, but as you see I got very different results

SELECT title FROM occupations WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1;
+------------------------------------------+
| title |
+------------------------------------------+
| actuarial advisory services manager |
| wealth management advisory manager |
| hotel guests services adviser |
| sales marketing adviser |
| aerospace engineering draughting adviser |
| hp nonstop adviser |
| hp nonstop and tandem adviser |
| sales support adviser |
| hr admin adviser |
| hr administration adviser |
| sales utility adviser |
| salesforce adviser |
| hr administrative adviser |
| hr adviser |

And when I increased the distance to 2 like this
SELECT title FROM occupations WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=2;
I didn't get any results

My index config

id | bigint | |
| title | text | indexed stored |
| title_str | string | |
| occupation_id | bigint | |
| type | string | |
| language | string | |
| vector | float_vector | knn |
| title_str_attr | string | |
+----------------+--------------+-----------

Variable_name | Value |
+---------------+----------------------------------------------------------------------------+
| settings | min_infix_len = 2
index_exact_words = 1
morphology = lemmatize_en, stem_ar |
I use RT index.

Manticore Search Version:

I pulled the latest version using docker

Operating System Version:

I used Docker

Have you tried the latest development version?

Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation reviewed
@sanikolaev
Copy link
Collaborator

Hi @MohammedOmar123

I can't reproduce your second issue (empty results with distance = 2) with the latest Manticore docker. Here's how I tested it:

# docker run -d --name=manticore --rm manticoresearch/manticore:latest
687ee6952605c68e117d2736a4c735c6f50e9033197ec272231534edb9dcefd4

# docker exec -it manticore mysql -P9306 -h0 -v -e "drop table if exists occupations_test; create table occupations_test(title text) min_infix_len='2' index_exact_words='1' morphology='lemmatize_en, stem_ar'; insert into occupations_test(title) values ('actuarial advisory services manager'),('wealth management advisory manager'),('hotel guests services adviser'),('sales marketing adviser'),('aerospace engineering draughting adviser'),('hp nonstop adviser'),('hp nonstop and tandem adviser'),('sales support adviser'),('hr admin adviser'),('hr administration adviser'),('sales utility adviser'),('salesforce adviser'),('hr administrative adviser'),('hr adviser'); SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1; SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=2;"
--------------
drop table if exists occupations_test
--------------

--------------
create table occupations_test(title text) min_infix_len='2' index_exact_words='1' morphology='lemmatize_en, stem_ar'
--------------

--------------
insert into occupations_test(title) values ('actuarial advisory services manager'),('wealth management advisory manager'),('hotel guests services adviser'),('sales marketing adviser'),('aerospace engineering draughting adviser'),('hp nonstop adviser'),('hp nonstop and tandem adviser'),('sales support adviser'),('hr admin adviser'),('hr administration adviser'),('sales utility adviser'),('salesforce adviser'),('hr administrative adviser'),('hr adviser')
--------------

--------------
SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1
--------------

+------------------------------------------+
| title                                    |
+------------------------------------------+
| actuarial advisory services manager      |
| wealth management advisory manager       |
| hotel guests services adviser            |
| sales marketing adviser                  |
| aerospace engineering draughting adviser |
| hp nonstop adviser                       |
| hp nonstop and tandem adviser            |
| sales support adviser                    |
| hr admin adviser                         |
| hr administration adviser                |
| sales utility adviser                    |
| salesforce adviser                       |
| hr administrative adviser                |
| hr adviser                               |
+------------------------------------------+
--------------
SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=2
--------------

+------------------------------------------+
| title                                    |
+------------------------------------------+
| actuarial advisory services manager      |
| wealth management advisory manager       |
| hotel guests services adviser            |
| sales marketing adviser                  |
| aerospace engineering draughting adviser |
| hp nonstop adviser                       |
| hp nonstop and tandem adviser            |
| sales support adviser                    |
| hr admin adviser                         |
| hr administration adviser                |
| sales utility adviser                    |
| salesforce adviser                       |
| hr administrative adviser                |
| hr adviser                               |
+------------------------------------------+

Could you please share your table files or adjust this example so the issue shows up?

We'll look into the first issue.

@sanikolaev
Copy link
Collaborator

@MohammedOmar123 what does this return for you?

SELECT title FROM occupations WHERE MATCH('cybersecurity')

I pulled the latest version using docker

Do you mean manticoresearch/manticore:latest or manticoresearch/manticore:dev ?

@sanikolaev
Copy link
Collaborator

@donhardman another interesting case (when cybersecurity it included in one of the docs):

docker exec -it manticore mysql -P9306 -h0 -v -e "drop table if exists occupations_test; create table occupations_test(title text) min_infix_len='2' index_exact_words='1' morphology='lemmatize_en, stem_ar'; insert into occupations_test(title) values ('cybersecurity actuarial advisory services manager'),('wealth management advisory manager'),('hotel guests services adviser'),('sales marketing adviser'),('aerospace engineering draughting adviser'),('hp nonstop adviser'),('hp nonstop and tandem adviser'),('sales support adviser'),('hr admin adviser'),('hr administration adviser'),('sales utility adviser'),('salesforce adviser'),('hr administrative adviser'),('hr adviser'); SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1; SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=2;"
--------------
drop table if exists occupations_test
--------------

--------------
create table occupations_test(title text) min_infix_len='2' index_exact_words='1' morphology='lemmatize_en, stem_ar'
--------------

--------------
insert into occupations_test(title) values ('cybersecurity actuarial advisory services manager'),('wealth management advisory manager'),('hotel guests services adviser'),('sales marketing adviser'),('aerospace engineering draughting adviser'),('hp nonstop adviser'),('hp nonstop and tandem adviser'),('sales support adviser'),('hr admin adviser'),('hr administration adviser'),('sales utility adviser'),('salesforce adviser'),('hr administrative adviser'),('hr adviser')
--------------

--------------
SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1
--------------

+---------------------------------------------------+
| title                                             |
+---------------------------------------------------+
| cybersecurity actuarial advisory services manager |
| wealth management advisory manager                |
| hotel guests services adviser                     |
| sales marketing adviser                           |
| aerospace engineering draughting adviser          |
| hp nonstop adviser                                |
| hp nonstop and tandem adviser                     |
| sales support adviser                             |
| hr admin adviser                                  |
| hr administration adviser                         |
| sales utility adviser                             |
| salesforce adviser                                |
| hr administrative adviser                         |
| hr adviser                                        |
+---------------------------------------------------+
--------------
SELECT title FROM occupations_test WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=2
--------------

+---------------------------------------------------+
| title                                             |
+---------------------------------------------------+
| cybersecurity actuarial advisory services manager |
+---------------------------------------------------+

@MohammedOmar123
Copy link
Author

MohammedOmar123 commented May 14, 2025

Hello @sanikolaev Thanks for reviewing the issue, Yes I used manticoresearch/manticore:latest
and for the files that you request to share, do you mean the settings file of the table ?
and where I can find them?

@sanikolaev
Copy link
Collaborator

and for the files that you request to share, do you mean the settings file of the table ?

You can send us the physical tables files via our write-only S3 storage - https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data

@MohammedOmar123
Copy link
Author

and for the files that you request to share, do you mean the settings file of the table ?

You can send us the physical tables files via our write-only S3 storage - https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data

I can't provide the entire table, but I can share the table settings with you. Just let me know which files you need.

@donhardman
Copy link
Contributor

donhardman commented May 16, 2025

Hi @MohammedOmar123, and thanks for reporting the issue! We eliminated it and merged to master. Please give a try on the fresh dev version of the Docker image; for that, you can use manticoresearch/manticore:dev.

If you notice any incorrect behavior that you think is an issue, please let me know, and we can take a look into it.

@PavelShilin89, let's add tests for this case to make sure it works, you can use @sanikolaev MRE.

@donhardman
Copy link
Contributor

Also, there is a preserve option. I'm not sure if the word cybersecurity is present in your dataset. However, the logic currently by default removes it from the fuzzy matching algorithm when there is no such keyword in the dataset. If this is the case, we can use preserve=1, for example:

SELECT title FROM occupations WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1, preserve=1

You can try to play with it and see the differences on your own.

@MohammedOmar123
Copy link
Author

MohammedOmar123 commented May 16, 2025

Also, there is a preserve option. I'm not sure if the word cybersecurity is present in your dataset. However, the logic currently by default removes it from the fuzzy matching algorithm when there is no such keyword in the dataset. If this is the case, we can use preserve=1, for example:

SELECT title FROM occupations WHERE MATCH('cybersecurity advisor') OPTION fuzzy=1, distance=1, preserve=1

You can try to play with it and see the differences on your own.

Thank you @donhardman , This "preserve" option resolved the issue as I see.

For cybersecurity, yes it is presented in many rows in my dataset as you see in the this query
SELECT title FROM occupations WHERE MATCH('cybersecurity') OPTION fuzzy=1, distance=1 ;
+-------------------------------------------+
| title |
+-------------------------------------------+
| stagiaire en cybersécurité |
| cybersecurity programme director |
| cybersecurity analyst |
| cybersecurity architect |
| cybersecurity consultant |
| cybersecurity delivery manager |
| cybersecurity engineer |
| cybersecurity head |
| cybersecurity manager |
| cybersecurity specialist |
| cybersecurity developer |
| cybersecurity implementer |
| cybersecurity solutions expert |
| cybersecurity compliance director |
| cybersecurity legal officer |
| offensive cybersecurity expert |
| cybersecurity tester |
| cybersecurity risk assessor  |
| cybersecurity risk assurance consultant  |
| cybersecurity impact analyst  |
+-------------------------------------------+
as you see without the option that you provided I got results without "cybersecurity" keyword, see the following example
mysql> SELECT title FROM occupations WHERE MATCH('cybersecurity teester') OPTION fuzzy=1, distance=1 ;
+--------------------------------------+
| title |
+--------------------------------------+
| cybersecurity tester |
| harpsicord tester |
| harpsicord quality tester |
| compatibility tester |
| water quality tester |
| water safety tester |
| selenium tester |
| liquid waste tester |
| network tester |
| electrical equipment tester |
| PCB tester |
| printed circuit board tester |
| boat tester |
| chemical testers |
| tester of electronic equipment |
| electronics tester |
| software tester |
| engineered wood board tester |
| engineered wood board qaulity tester |
| prototype vehicle tester |
+--------------------------------------+
after using it, I got the right rows
mysql> SELECT title FROM occupations WHERE MATCH('cybersecurity teester') OPTION fuzzy=1, distance=1, preserve=1 ;
+----------------------+
| title |
+----------------------+
| cybersecurity tester |
+----------------------+

@MohammedOmar123
Copy link
Author

MohammedOmar123 commented May 19, 2025

Hello @donhardman , I am facing another issue, could you please check if I am doing something wrong, I tried dev version and the latest version (manticoresearch/manticore:latest or manticoresearch/manticore:dev )
STR:

CREATE TABLE new_index (
  id BIGINT,
  title TEXT,
  most_important_id UINT,
  second_important_id UINT,
  other_ids MULTI,
  serial UINT
) min_infix_len='2' index_exact_words='1' morphology='lemmatize_en,stem_ar';

INSERT INTO new_index (id, title, most_important_id, second_important_id, other_ids, serial) VALUES
(1, 'c/c++ software developer', 0, 0, (), 0),
(2, 'c and c++ software developer', 0, 0, (), 0),
(3, 'c software developer', 0, 0, (), 0),
(4, 'c++ software developer', 0, 0, (), 0),
(5, 'c# software developer', 0, 0, (), 0),
(6, 'c # software developer', 0, 0, (), 0),
(7, 'c plus plus software developer', 0, 0, (), 0),
(8, 'c sharp software developer', 0, 0, (), 0);

INSERT INTO new_index (id, title, most_important_id, second_important_id, other_ids, serial) VALUES
(9, 'windows ce software developer', 0, 0, (), 0),
(10, 'yii software developer', 0, 0, (), 0),
(11, 'yii framework software developer', 0, 0, (), 0),
(12, 'yii php software developer', 0, 0, (), 0),
(13, 'php yii software developer', 0, 0, (), 0),
(14, 'yii 2 software developer', 0, 0, (), 0),
(15, 'yii 2 framework software developer', 0, 0, (), 0),
(16, 'yii 2 php software developer', 0, 0, (), 0),
(17, 'php yii 2 software developer', 0, 0, (), 0),
(18, 'software developer web developer', 0, 0, (), 0),
(19, 'embedded software developer', 0, 0, (), 0),
(20, 'windows embedded compact software developer', 0, 0, (), 0),
(21, 'windows compact embedded software developer', 0, 0, (), 0),
(22, 'middleware software developer', 0, 0, (), 0),
(23, 'middle ware software developer', 0, 0, (), 0),
(24, 'embedded software developer', 0, 0, (), 0),
(25, 'embedded systems software developer', 0, 0, (), 0),
(26, 'embedded system software developer', 0, 0, (), 0),
(27, 'embedded systems software developers', 0, 0, (), 0),
(28, 'satellite software developer', 0, 0, (), 0);


 select title from new_index where match('C++ Software Developer'); -- Results returned correctly as expected
 title                          |
+--------------------------------+
| c/c++ software developer       |
| c and c++ software developer   |
| c software developer           |
| c++ software developer         |
| c# software developer          |
| c # software developer         |
| c plus plus software developer |
| c sharp software developer     |


 select title from new_index where match('C++ Software Developer') option fuzzy= 1, distance = 1; 
--  Unexpected Results returned and ranked higher
+---------------------------------------------+
| title                                       |
+---------------------------------------------+
| windows ce software developer               |
| yii framework software developer            |
| yii php software developer                  |
| yii 2 software developer                    |
| php yii 2 software developer                |
| yii software developer                      |
| php yii software developer                  |
| yii 2 framework software developer          |
| yii 2 php software developer                |
| c/c++ software developer                    |
| c and c++ software developer                |
| c software developer                        |
| c++ software developer                      |
| c# software developer                       |
| c # software developer                      |
| c plus plus software developer              |
| c sharp software developer                  |
| software developer web developer            |
| embedded software developer                 |
| windows embedded compact software developer |
+---------------------------------------------+

 select title from new_index where match('C++ Software Developer') option fuzzy= 1, distance = 1, preserve=1; 
-- Here the results got different from the latest and dev version .
-- in Dev I got this 
+---------------------------------------------+
| title                                       |
+---------------------------------------------+
| windows ce software developer               |
| windows embedded compact software developer |
| windows compact embedded software developer |
| yii software developer                      |
| yii framework software developer            |
| yii php software developer                  |
| php yii software developer                  |
| yii 2 software developer                    |
| yii 2 framework software developer          |
| yii 2 php software developer                |
| php yii 2 software developer                |
| c/c++ software developer                    |
| c and c++ software developer                |
| c software developer                        |
| c++ software developer                      |
| c# software developer                       |
| c # software developer                      |
| c plus plus software developer              |
| c sharp software developer                  |
| software developer web developer            |
+---------------------------------------------+

-- Now, In the latest version i got two different results 
-- For my main dataset I got empty results, 
describe my_index;
+----------------+--------------+----------------+
| Field          | Type         | Properties     |
+----------------+--------------+----------------+
| id             | bigint       |                |
| title          | text         | indexed stored |
| title_str      | text         | indexed stored |
| occupation_id  | bigint       |                |
| type           | string       |                |
| language       | string       |                |
| vector         | float_vector | knn            |
| title_str_attr | string       |                |
+----------------+--------------+----------------+

-- And for this index that I tried to reproduce the issue, I got the results correctly as I expected.
 -- The dataset in the following index is the same data that I used above.      

 select title from new_index where match('C++ Software Developer') option fuzzy= 1, distance = 1, preserve=1;
+---------------------------------------------+
| title                                       |
+---------------------------------------------+
| c/c++ software developer                    |
| windows ce software developer               |
| c and c++ software developer                |
| c software developer                        |
| c++ software developer                      |
| c# software developer                       |
| c # software developer                      |
| c sharp software developer                  |
| windows embedded compact software developer |
| windows compact embedded software developer |
| c plus plus software developer              |
+---------------------------------------------+

  | Type   | Properties     |
+---------------------+--------+----------------+
| id                  | bigint |                |
| title               | text   | indexed stored |
| most_important_id   | uint   |                |
| second_important_id | uint   |                |
| other_ids           | mva    |                |
| serial              | uint   |                |
+---------------------+--------+----------------+

@sanikolaev
Copy link
Collaborator

@MohammedOmar123 please try disabling layouts:

mysql> select title from new_index where match('C++ Software Developer') option fuzzy= 1, distance = 1, layouts='';
+---------------------------------------------+
| title                                       |
+---------------------------------------------+
| c/c++ software developer                    |
| windows ce software developer               |
| c and c++ software developer                |
| yii software developer                      |
| yii framework software developer            |
| c software developer                        |
| c++ software developer                      |
| yii php software developer                  |
| c# software developer                       |

This can alleviate the issue and you should consider control the layouts in general, but still the result seems to be not perfect. Perhaps there's a bug:

[Wed May 21 05:54:21.237 2025] [9481831] [BUDDY] Fuzzy: variations for 'C++ Software Developer': [{"original":"c","keywords":["software"]},{"original":"software","keywords":["software"]},{"original":"developer","keywords":["developer","developers"]}]

...

[Wed May 21 05:58:18.474 2025] [9481841] [BUDDY] Fuzzy: variations for 'C++ Software Developer': [{"original":"c","keywords":["software"]},{"original":"software","keywords":["software"]},{"original":"developer","keywords":["developer","developers"]}]

@donhardman please look into this.

@donhardman
Copy link
Contributor

I found one issue and already fixed it, but anyway, what I found will not fix the root core of the problem.

I still recommend for now to disable layouts and also use the preserve=1 flag that will keep the word in case we have no autocomplete matches found.

The core issue is lying deep in the core methods we use.

mysql> call suggest('c++', 'new_index');
mysql> call keywords('c++', 'new_index');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | c         | c          |
+------+-----------+------------+
mysql> call suggest('c', 'new_index');
mysql>

So when we are preparing fuzzy, the flow goes through multiple stages, and one of them is trying to use suggest to guess. As I found, it looks like it does not return anything.

I will continue to investigate further. But in the meantime, layouts='', preserve=1 should be a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants