I had a few ideas for modifying the index that could possibly improve search. Is Mastodon the best place to submit feature requests like this or should these be submitted elsewhere?
1. Index domain names. eg. "Accidental Tech Podcast" is often referred to as ATP and the podcast url is atp.fm. If the domain name was indexed, a search for "atp" should bring up this podcast as well.
2. Map common misspellings to the correct spellings (https://www.infoplease.com/culture-entertainment/journalism-literature/frequently-misspelled-words). Also words that have multiple correct spellings (e.g. anyway vs. any way)
3. Improved word stemmer (e.g. "run" and "ran" could return similar results)
4. Contractions (e.g. "can't", "cannot", and "can not" could return similar results)
5. Diacritics insensitivity (accented characters). For example "brene" vs "brené"
@snowninja Yep, discussion here is best. Those are all good ideas. Got any inclination on how to make that happen in Sphinx/Manticore? Search improvement is one of those things I start doing every couple of months and then get pulled away into some other more urgent thing and always gets kicked down the road. I'd love to get some traction on it.
@dave Great! I totally get it about so much to be working on. I'm not an expert in Maticore, but here are my thoughts on how each of these could possibly be done:
1. Index the `link` field with just the base domain name. Use `regexp_filter` to strip the TLD and everything after as well as the prefix (e.g. https://www) from that field.
2. Using exceptions (synonyms) to create a table of the commonly misspelled words would probably do the trick.
3. The built-in stemmer may be ok for the time being, but there are some additional morphology preprocessors that could probably be tweaked a bit (documentation here https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Morphology)
4. English contraction support might have to be added manually with a wordform config https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Wordforms#wordforms
5. Diacritic insensitive search can be enabled with `charset_table=non_cjk`
I don't build search indexes myself, but to me these suggestions look like they should be among the default settings.
@bencomp I think these suggestions make a lot of sense for the podcast indexing space (perhaps not in every search domain though). Hopefully they won't be too time-consuming to configure! I'm happy to help in any way if I can.
Intended for all stake holders of podcasting who are interested in improving the eco system