![]() |
![]() |
![]() |
Arabic Software Desktop Publishing Machine Translation Document Management NLP OCR ASR TTS MultimediA |
![]() |
NameSphere |
|
Solving the Problem of Traditional Approaches to Name Matching
Romanization The term "'rewrite rule" comes from early syntactic theory, where a string of constituents would be replaced ("rewritten") by another string, e.g., S → NP VP rewrites the symbol S as the two symbols NP and VP. As applied to names, special rules are used to convert variants of transliterated names into a single canonical form. So, for example, Arabic Mohamed, Muhammad, Mahomet, Imhammad, and Mehmed can all get regularized to Muhamad. The regularized forms are then compared in matching. For indexing, the regularized names are compressed, using one compression scheme or other: for instance, doubled letters and vowels are removed. This is so that sufficiently similar regularizations will bring each other back. For instance, whether Mahumd is meant to be Muhamad or a typo for Mahmud, both will be returned. This approach is based on two assumptions: a. each Romanization can be mapped to exactly one source name, and b. very similar source names will have the same compression. Neither assumption is correct. Differences in Romanization and pronunciation often make it impossible to determine which source name was intended, and names that have variants in common may have very different variants, as well. For example, the Arabic letter qaf Þ can be pronounced in various ways: in Iraq, it's a hard G (as in "goof"); in Saudi Arabia, it's a J (as in "judge"); in most cities outside the Gulf, it's a glottal stop (as in "uh-oh"). This means that the name Qaasim ("apportioner") could also be spelled as Ga(a)sim, Ja(a)sim, or 'A(a)sim. But Jaasim and 'Asim could also represent separate names (meaning "tremendous one" and "protector", respectively). Getting these names to match by regularizing them to the same thing obscures the differences between them. This Q-G-J-' alternation is not an isolated case; such instances are common. Consider the following overlapping Romanization: Arabic letter Spellings Example name: Na_ir Meaning za' Ò z, dh Nazir, Nadhir peerdhal Ð dh, d Nadhir, Nadir heraldtha' Ë th, dh, s Nadhir, Nathir, Nasir spreaderdal Ï d Nadir raresin Ó s Nasir granter of victorySo Nadhir could represent at least three different names, each of which might have other variants, none of which should match each other. Either names which don't match are grouped together, or names that do are missed. Segmentation In many cultures, names commonly have affixes attached:
prefixes (e.g. Arabic
Hajyousef), or suffixes
(e.g. Russian
Petrovna).
Names can also appear joined together, an occurrence quite frequent with
Chinese names (e.g. Xiaoyan
vs Xiao Yan,
Xiaomei vs Xiao Mei,
Meixiao vs Mei Xiao,
Linxiao vs Lin Xiao).
By introducing whitespace into names, traditional rules can divide joined names, or remove
prefixes and suffixes. Unfortunately, this process is prone to error. Rule Ordering A rewrite-rule engine is limited by the fact that only one rule can fire at any given position in a name. Once the rule has fired, the pointer advances. It is not possible to back up, or to stay in the same place. For instance, a potential rule to convert Francophone CH to SH allows Charif to become Sharif. Another rule might break up initial consonant clusters by inserting a vowel, converting Shrif into Sharif. However, the first rule is trumped by the second, so that Chrif becomes Charif, not matching Sharif (not to mention the problem of Germanophone spellings where CH should become KH, not SH). Corrective attempts to avoid manglings (like Ladin to L Al Din) could change rules to only apply after three letters. However, this would cause problems with Noureddine, a Francophone spelling of Nur Al Din ("light of the faith"). Although Nour is four letters long, a rule to convert Francophone OU to U would reduce it to three. After OU is rewritten as U, the pointer advances to R, and all the rule engine can see is "Reddine" -- which is too short to trigger that Al Din rule. Either general rules that extract Al Din after four letters, five letters, etc. (which in turn may interfere with other rules in unexpected ways) are required; or a special rule for Noureddine is needed. Of course workarounds can be found for these particular cases, but there is no general solution to this problem. Although a rewrite mechanism makes it possible to write very powerful rules, there is no way to ensure that these rules will not interfere with each other, and the more of these workaround rules one writes, the harder it becomes for those who inherit the system to understand, maintain, and improve the rules. The NameSphere Approach NameSphere uses three main ways to improve retrieving and
comparing transliterated names: Although some of these could be implemented alongside rewrite rules, others would involve more or less drastic changes to a rewrite process. Implementing all three of these solutions allows NameSphere to completely replace a traditional rewrite-rule system. NameSphere allows improved methods of generating good name variants, as well as improved methods of finding variants of a particular spelling, by employing a suite of tools. These tools operate behind the scenes, providing the user with a seamless process. Generating a list of quality variations of a name can be performed in a few steps, unseen by the user. NameSphere's tailored multi-compressional intersective-comparative approach to names produces lists of the good variant spellings, while simultaneously eliminating the unwanted junk variations. Finding variants of a particular spelling involves a pre-generative tailored baseform search. Through mapping the source variant to a base form before generating variants, NameSphere avoids the skewed possibilities derived from taking a non-standard variant as prototype. Please contact
George N.
Hallak for pricing and customized solutions. |
|
| All AppTek Products are on GSA Advantage!®
ChatSphere | LocalSphere | MediaSphere | MemorySphere | NameFinder TranSphere | TranSphere Plug-in for Microsoft Office | TextFinder | WebTrans |
|
AramediA |
|
|
|
61 Adams Street, Braintree,
MA 02184-1906 |
|
We Ship All Around the Globe |
Copyright © 1995 - 2008 - GnhBos, Incorporated. dba AramediA. All rights reserved. |
|
Our Dictionaries,
machine translation, translation memory, Dictionary software,
multimedia, Bidirectional, English Arabic English, English
Dictionaries, Arabic Dictionary,
Arabic
language, software localization, software localisation, translation, Arabic
Sakhr Arabic software, Learn Arabic, Arabic for beginners,
translation, multimedia, educational programs, arabic terminology,arabic
terminologies,
english-arabic,french-arabic, spanish-arabic,arabic meanings,
acronyms,arabic translation, homonyms,jargon,vocabulary,lexicon,define,
Arabic typing tutor,
dictionary, scanning,translation software, arabic dictionary,Arabic, Islam,
Moslem, Islamic,
Hebrew, Farsi, Persian, Persia, Iran, Iranian,Arabic
lexicography,lexicon,spell,word,
vocabulary, language, domain,arabic dictionaries,
arabic glossaries,,arabic translators,arabic translator,english-arabic
translator, qamoos,
Ajeeb, Arabic language tutor,arabic
keyboards,spellchecker,keyboards,persian,farsi,
urdu,hebrew, font, transliteration,
index,synonym,antonym,thesaurus,meaning,word-hoard,
ideom,phrase,phraseology,expression,Arabic,software localization,translation,
global,
dictionary, technical,machine translation,education,educational
multimedia,word,
multilingual, word processor, languages, arabic text email, Spanish, German,
french, asian, turkish |
|
Our Dictionaries, machine
translation, translation memory,
and lexicons multilingual dictionary software cover a wide spectrum.
Call for more information: 1-781-849-0021 Dictionary software,
multimedia, Bidirectional, English Arabic English, English Dictionaries,
Arabic Dictionary,
covers many languages dictionary and applications, please call us for more information at
1-781-849-0021: |