Codifying Semantic Information in Medical Questions Using Lexical Sources
Paul E. Pancoast, MD, Arthur B. Smith, MS, Chi-Ren Shyu, PhD
University of Missouri-Columbia
Methods:
Source Questions (4083 test questions):
•University of Iowa researchers gathered questions from clinicians during observation in a clinical setting. Ely JW, Osheroff JA, Ebell MH, et al. Analysis of questions asked by family doctors regarding patient care. BMJ. Aug 7 1999;319(7206):358-361
•British researchers answer questions submitted by clinicians and post evidence-based answers on the web. http://www.attract.wales.nhs.uk/about/one.htm
•Australian researchers answer questions submitted by clinicians and post evidence-based answers on the web. http://ww.sph.uq.edu.au/CGP/red/quest/faqs.asp
•
Source Vocabulary:
•MRCON – a table from the Metathesaurus (2003AB)
•Lists the medical concepts by unique identifiers (CUI)
•Lists each string (word or phrase) associated with the concepts
•UNIQUE – (string => 1 concept)
•AMBIGUOUS – (string => 2+ concepts)
•COLD – 1) ambient temperature, 2) viral upper respiratory infection, 3) Chronic Obstructive Lung Disease
•2,247,454 strings associated with concepts
•1,860,680 unique strings
•900,550 unique CUI
•Non-medical Lexicon – generated from Roget’s Thesaurus
•Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently), action/relationship (treats, attends, reduce, lessen, can, improve)
•749 terms in this lexicon
Parsing Application:
•Source vocabulary => 37-ary tree structure
•Indexed by sequences of characters in the string (a-z0-9 and <space>) after lower-casing and removal of punctuation
•Source questions –
•Examined in 3-word, 2-word, 1-word windows for matches with the source vocabulary
•{what is the} best treatment for acute pharyngitis
•What {is the best} treatment for acute pharyngitis . . .
•What is the best treatment for {acute pharyngitis} !!Match . .
•What is the best {treatment} for xxxxxxxxx !!Match
•Generates a report of:
•Total number of words parsed
•Number of matches from UNIQUE, AMBIGUOUS, NON-MEDICAL LEXICON
•Strings that didn’t match the source vocabularies
Results:
(unmatched words, 2+ occurrences)
We suspect that some unmatched words will be important to
determine the meaning of a medical question –
particularly relationship words (verbs)
Discussion:
•MRCON – selected for relatively low rate of ambiguous strings (11%) although other tables have a larger number of strings, they have much higher ambiguous rates.
•Other researchers have similar results matching biomedical text to controlled vocabularies
•Cimino et. al matched 43% of words with Meta-1 (we had 56% Metathesaurus matches) – Computers and Biomedical Research. Aug 1992;25(4):366-373.
•Hersh et. al matched 60% of words to a medical terminology and names dictionary (we had 79% combined lexicon matches) – Proceedings/AMIA Annual Fall Symposium. 1997
•Stop words – commonly removed by most normalization tools.  (Prepositions, conjunctions, pronouns, etc)
•They provide valuable contextual information in medical questions
•Blood FOR an HIV-positive patient   -  much different than
•Blood FROM an HIV-positive patient
•
•Patient taking asprin AND coumadin  -  more likely to bleed than
•Patient taking asprin OR coumadin
•
•Integers – difficult to manage (discarded in MetaMap) but also provide valuable discriminatory information
•Patient with hyperkalemia of 5.1 mEq/li – a concern but not critical
•Patient with hyperkalemia of 8.9 mEq/li – either a lab error or dead by now…
•
•Verbs – Action/Relation Concepts –
•Not listed in the Metathesaurus…..
•Some included in our Non-medical lexicon
•Verb strings => concepts – very fluid and VERY ambiguous – how many concepts can be represented by ‘USE’?
•Relation concepts may be ‘conceptually’ related to entity/event concepts, but they are not equivalent
•Diagnose => Diagnosis
•Treat => Treatment
•Evaluate => Evaluation
•Verb tense changes the meaning of a question
•… in a patient TAKING antibiotics
•… in a patient who TOOK antibiotics
Research Purpose:
•To find a method for classifying medical questions that are asked by clinicians.
Hypothesis:
•Simply indexing questions by keyword isn’t sufficient to –
•Distinguish questions with different meanings but similar wording
•Group questions with different words but similar meanings
Examples:
Different words       What is the best way to treat acute pharyngitis in healthy children?
Similar meaning            How do you approach a normal pediatric patient with a sore throat?
Similar words        How do you deal with diabetic patients who are resistant to insulin?
Different meanings      How do you deal with diabetic patients who are resistant to taking insulin?
Why Bother? (to classify medical questions?)
•Clinicians often have questions when treating patients
•Researchers have gathered collections of these questions
•There is no good method to classify the questions in these collections
•How many times has a particular question been asked?  And, has a similar question already been definitively answered (using evidence-based methods?)
•Which questions should receive priority when evidence-based answers are written?
•How should a database of questions with evidence-based answers be indexed for retrieval?  What kind of search capabilities should it have?
100 unique strings – 7850 occurrences – 57.6% of total matches
712 unique strings – 3+ hits – 85% of total matches
19.1% of words – didn’t match the source vocabulary
4.3%
502
70
Integer
5.3%
614
10
Pronoun
7.0%
810
72
Mix *
9.5%
1095
103
Adj/Adv/Con
21.9%
2544
9
Preposition
20.3%
2356
186
Noun
31.7%
3676
261
Verb
Percent
Total Number
Unique words
If ‘acute pharyngitis’ is found 12 times  => 1 string, 12 hits, 24 words
Words – the total number of matching words
Hits – how often individual strings were found
String – individual word or phrase that matched (the source vocabulary
19.1%
13,624
2,321
Unmatched
24.9%
17,783
16,768
208
Non-medical
13.7%
9,769
9,256
574
MRCON
Ambiguous
42.3%
30,186
24,844
4,534
MRCON
Unique
% match
Words
Hits
Strings
Basic Structure of the B Tree
Information Flow Diagram
Acknowledgements:
The authors gratefully acknowledge Jon Brassey, TRIP Database & Director, ATTRACT Wales (UK), the Medical School of the University of Queensland (Australia), the Centre for Reviews and Dissemination, University of York (UK), and Dr. John Ely for the kind donation of clinical question sets.  Dr. Pancoast acknowledges the National Library of Medicine Biomedical and Health Informatics Research Training Grant 2-T15-LM07089-11.
Summary:
•We developed an application to:
•Extract medical concepts from natural language text
•Map these medical concepts to a controlled vocabulary
•This is similar to the MetaMap application, but with a different purpose (to represent the meaning of medical questions rather than to extract search terms from medical text)
•We used these codified representations to match questions using vector calculations (G. Salton, A. Wong, and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11), 1975, pp. 613-620)
•
•The major hindrance to satisfactory matching and clustering is the lack of representation of relations between medical concepts