Boxes Of Tat: The Joy of Search Terms

Over the many years I've been performing and supporting user searches in on-line databases I've seen examples of many kinds of search and index term that cause problems for basic searching and require knowledge of how the search system process the query and matches the resulting terms to those in the indexes.

Did they really call it that

It shouldn't really surprise us given the number of pun based shop names [1,2], but it seems that if you give someone a chance to name something they will often decide to have a little fun. In the biosciences this has become a bit of a running joke, and has caused issues requiring more sedate names to be used officially [3,4,5,6]. Of course such changes to names generate their own problems with identifying which names have been used in the literature and when they have been used. It also shouldn't be surprising that researchers who have chosen to use some form of systematic naming, often end up using the same names as other researchers, usually for different things. Fortunately in the bioscience there are nomenclature bodies (e.g. HGNC) that try to keep track of this stuff, but that does not fix the existing literature and you can still find packets of resistance to name changes [7].

In these days of search engine optimisation (SEO), you would hope that people would try harder to give things unique names. But that doesn't always work out either, as a name that is unique in written form might not be so easy to distinguish when spoken, as any of the people who have had to present on the EB-eye search system used at the EBI can attest.

So there is a need to consider if the nomenclature being used in a search is the appropriate nomenclature and if it is the only nomenclature that suits the parameters of the query. Typically this can be checked using domain knowledge and thesauri for identifying equivalent domain terms.

Acronyms and abbreviations

Having been at conferences where you would see groups of principle investigators (PIs) huddled together concocting names for their latest projects or research products, it becomes clear that they can get caught in games of "let's show how clever we are." This can lead to the use of names which are common words, for example:

Given that these are all from the biosciences it is interesting that biologically relevant names have been chosen for the software or projects acronym. When it comes to searching this would not be too bad if the acronyms were spelled out every time, but they aren't. Either they are considered as being in common usage and thus do not need explanation, they are seen as unambiguous in the context or the expanded version is not included in the data in the database being searched (e.g. the abstract is in the database using the acronym, but the expansion of the name is only appears the full text which is not available to the search).

One area where this can cause serious problems is with the abbreviations used to describe genes. These gene symbols are short and often derived from a longer gene name, and since they are symbolic the full name is often not included. Thus you may have no way of knowing what the corresponding longer name is without performing additional searches, and even if you do have the full name it may not be helpful in providing terms that are present in the database. So gene symbols like "ACID-1", "comE" and "namE" cause problems for searching since they correspond to words we expect to see in the text. Cases when the name matches the stop word list used for the database search become really difficult to find, for example "alsO", "THE-1" and "use-1" correspond to commonly used stop words and thus typically will become place holders or are ignored in the search.

When it comes to these check the documentation for the search system to see how stop word processing behaves. In databases containing this type of information there are often specific fields which are excluded from stop word processing specifically to support these kind of search. Also consider if using a specialist database might provide an easier route to getting useful terms for more general searching. As a somewhat backwards example, in order to find the terms above I used the UniProtKB database of proteins, since it has clean gene symbol searching, and thus I could find symbols matching words.

Case gives meaning

In some cases the character case of the term has an effect on the meaning, which can be a problem since most search systems are case insensitive. In Drosophilia genetics this used to cause all kinds of problems since the gene nomenclature was case sensitive (i.e. adh, Adh and ADH were not the same thing), in recent years there has been a move away from this style of nomenclature towards a nomenclature that does not have this problem, but these terms are used in the literature, especially in older papers. A similar thing happens in the protein structure domain where some components of structure identifiers can be case sensitive.

In these cases the work around it to use the domain specific databases that understand the specific rules applied to the terms, and use the resulting information to build appropriate queries in the literature databases.

Note there are some really odd cases, like the dotted and dotless I in Turkish, where case changes that may be applied during indexing can cause problems for searching and can also change the meaning of the text. Fortunately these are rare.

Short Search Terms

Very short words such as 'a' or 'of' may be excluded from searching due to term length restrictions rather than via stop word lists. Depending on the implementation such terms may also be excluded from the index. The rationalle is that such short terms are so common that they do not provide useful terms in the majority of cases and thus removing them will improves performance and avoid issues with commonly appearing short terms such as numbers as well as short words. The minimum term length depends on the search system and potentially the specific field(s) being searched. Generally expect terms of two or fewer characters to be difficult to search with.

As usual check the documentation for the search system and database, to see if such processing is present, and if using specific fields may get around the problem. Otherwise consider alternative databases and search systems that may allow short terms for searching,

Romanized Forms

For content in multiple languages or using non-latin characters, it helps if the search system lets you know how these have been treated. In some cases translations are also provided to support searching, for example the Chinese Biological Abstracts are provided in English. In other cases, in particular for author names, an attempt may have been made to index an romanized form of the name [8,9] so it can be found without having to figure out how to type the specific characters used.

One thing to be aware of is that while most of the modern Internet uses UNICODE (typically as UTF-8), and thus characters for all languages are available and are handled in a consistent way, there are still applications, data and documents that use other character encodings that are not always compatible with what services expect to be receiving. The usual way this problem crops up is when copy and pasting search terms from an e-mail or document into a search form, and some of the characters start looking a little strange (commonly replacement by rectangles, but other characters can appear). The strangeness can manifest in the pasted text, or in the processed query terms displayed alongside the search result. In cases where the source character set is known, character conversion tools can be used to get things into the expected character set for use in searching.

Punctuation

Some instances of punctuation are important parts of search terms, things like the use if the apostrophe in Irish names such as "O'Donnell" or in chemical compound names (e.g. "2'-O-phosphonoadenosine 5'-{3-[1-(3-carbamoyl-1,4-dihydropyridin-1-yl)-1,4-anhydro-D-ribitol-5-yl] dihydrogen diphosphate}"). In these cases the handling of punctuation by the search system can effect the results. Systems that remove punctuation or replace it with something else (typically spaces) can create ambiguity and may not be able to find the corresponding documents. In other systems the puncuation may form meaningful syntax for the search system (e.g. the use of ':' as a field delimiter in Apache Lucene based systems). In the case of chemical terms this can also interact with other issues such as extended character handling and use of short terms, to make the search even more difficult. In some systems special rules are used to ensure such terms are handled appropriately and will find corresponding documents.

As an example have a look at some search results for the full name of NAPDH, a biologically active chemical compound: 2'-O-phosphonoadenosine 5'-{3-[1-(3-carbamoyl-1,4-dihydropyridin-1-yl)-1,4-anhydro-D-ribitol-5-yl] dihydrogen diphosphate}

A basic naive search (dump the term in the box and hit go):

Google web search - 10 results
Bing web search - 37 results
EBI Search - search error, syntax problem
NCBI Entrez search - no results found

A quoted search, to get handling as a phrase, protect against syntax clashes and to preserve as much as possible

Google web search - no results found
Bing web search - 37 results
EBI Search - 3 results
NCBI Entrez search - 1723 results

Notice the differences in behaviour. While Bing comes out well for this search the specialist searches are providing exactly what I would expect. The behaviour of Google is a little odd though, and may indicate that as a single phrase Google thinks the term is too long or too complex.

[1] "Funny shop names" http://www.guy-sports.com/months/jokes_name_places.htm
[2] "20 Brilliant Funny Punning Business Names" http://www.huffingtonpost.co.uk/2012/08/31/punning-man-funny-shop-signs-photos_n_1846101.html
[3] "The naming of the genes" http://blog.wellcome.ac.uk/2012/11/26/the-naming-of-the-genes/
[4] "Gene/Protein Name Etymology" https://www.biostars.org/p/88965/
[5] "A Gene by Any Other Name" http://www.americanscientist.org/issues/pub/a-gene-by-any-other-name
[6] "Clever Drosophila gene names" http://web.archive.org/web/20080221055542/http://tinman.vetmed.helsinki.fi/eng/drosophila.html
[7] "In which I crave some nomenclatural consistency" http://web.archive.org/web/20090702000035/http://network.nature.com/people/UE19877E8/blog/2008/07/02/in-which-i-crave-some-nomenclatural-consistency
[8] "Romanization" https://en.wikipedia.org/wiki/Romanization
[9] "to handle accented characters as they used to be handled" http://support.orcid.org/forums/175591-orcid-ideas-forum/suggestions/8239947-to-handle-accented-characters-as-they-used-to-be-h

Boxes Of Tat

Thursday, 5 November 2015

The Joy of Search Terms