arabic, anism1a2.gif (1972 bytes) arabic software solutions, anism1b1.gif (3409 bytes) arabic, anism1a2.gif (1972 bytes)

Arabic Software  Desktop Publishing  Machine Translation  Document Management  NLP   OCR  ASR  TTS  MultimediA


NameSphere

Solving the Problem of Traditional Approaches to Name Matching


NameSphere combines several tools and techniques to match variant spellings and transliterations of names originating in languages using non-Roman scripts. NameSphere replaces the broadly used rewrite-rule approach, which has several disadvantages, particularly when it comes to non-European names. The main problem areas are in Romanization, segmentation, and rule ordering. No matter how much this latter approach is tweaked, it is not possible to completely overcome these disadvantages.

Romanization

The term "'rewrite rule" comes from early syntactic theory, where a string of constituents would be replaced ("rewritten") by another string, e.g., S → NP VP rewrites the symbol S as the two symbols NP and VP. As applied to names, special rules are used to convert variants of transliterated names into a single canonical form. So, for example, Arabic Mohamed, Muhammad, Mahomet, Imhammad, and Mehmed can all get regularized to Muhamad.

The regularized forms are then compared in matching. For indexing, the regularized names are compressed, using one compression scheme or other: for instance, doubled letters and vowels are removed. This is so that sufficiently similar regularizations will bring each other back. For instance, whether Mahumd is meant to be Muhamad or a typo for Mahmud, both will be returned.

This approach is based on two assumptions: a. each Romanization can be mapped to exactly one source name, and b. very similar source names will have the same compression. Neither assumption is correct. Differences in Romanization and pronunciation often make it impossible to determine which source name was intended, and names that have variants in common may have very different variants, as well.

For example, the Arabic letter qaf Þ can be pronounced in various ways: in Iraq, it's a hard G (as in "goof"); in Saudi Arabia, it's a J (as in "judge"); in most cities outside the Gulf, it's a glottal stop (as in "uh-oh"). This means that the name Qaasim ("apportioner") could also be spelled as Ga(a)sim, Ja(a)sim, or 'A(a)sim. But Jaasim and 'Asim could also represent separate names (meaning "tremendous one" and "protector", respectively). Getting these names to match by regularizing them to the same thing obscures the differences between them. This Q-G-J-' alternation is not an isolated case; such instances are common. Consider the following overlapping Romanization:

Arabic letter       Spellings        Example name: Na_ir         Meaning

za' Ò                     z, dh                 Nazir, Nadhir                       peer

dhal Ð                    dh, d                 Nadhir, Nadir                       herald

tha' Ë                    th, dh, s            Nadhir, Nathir, Nasir            spreader

dal Ï                        d                     Nadir                                 rare

sin Ó                      s                    Nasir                                   granter of victory
 

So Nadhir could represent at least three different names, each of which might have other variants, none of which should match each other. Either names which don't match are grouped together, or names that do are missed.

Segmentation

In many cultures, names commonly have affixes attached: prefixes (e.g. Arabic Hajyousef), or suffixes (e.g. Russian Petrovna). Names can also appear joined together, an occurrence quite frequent with Chinese names (e.g. Xiaoyan vs Xiao Yan, Xiaomei vs Xiao Mei, Meixiao vs Mei Xiao, Linxiao vs Lin Xiao). By introducing whitespace into names, traditional rules can divide joined names, or remove prefixes and suffixes. Unfortunately, this process is prone to error.

First, it may remove affixes that aren't really affixes. The North African Arabic prefix Ow, a variant of Ould (which comes from Arabic Walad, "son of"), also has variants Aw, Wa, Wi, Oua and Oui. A rule to remove this prefix would also apply to names like Wakim, Wisam, Wizight and Oussama, where it should not. Similarly, the suffix Aldin ("the faith") often appears without the L, in names like Saifeddin ("sword of..."), Shamseddine ("sun of..."), Salahuddin ("righteousness of..."). Removing this suffix will mangle names like Ladin (changing it to L Al Din) and Ayden (becoming Ay Al Din). Similar problems will crop up with suffixed variants of Allah.

Second, a rewrite mechanism is unable, in general, to split apart joined names, as there is no general mechanism to recognize two familiar elements and break them up. Because a rule engine operates in one pass, regularization and segmentation must take place at the same time. This means that each name requires a separate rule to divide it from other names. For example, if there is a rule to separate variants of Chinese Xiao from the beginnings and endings of names, there is no way to check whether the remaining segment (e.g. Yan, Mei, Lin) is, in fact, also a name. To do that requires rules to cover every possible pair of joined names -- in either possible order!

Rule Ordering

A rewrite-rule engine is limited by the fact that only one rule can fire at any given position in a name. Once the rule has fired, the pointer advances. It is not possible to back up, or to stay in the same place. For instance, a potential rule to convert Francophone CH to SH allows Charif to become Sharif. Another rule might break up initial consonant clusters by inserting a vowel, converting Shrif into Sharif. However, the first rule is trumped by the second, so that Chrif becomes Charif, not matching Sharif (not to mention the problem of Germanophone spellings where CH should become KH, not SH).

Corrective attempts to avoid manglings (like Ladin to L Al Din) could change rules to only apply after three letters. However, this would cause problems with Noureddine, a Francophone spelling of Nur Al Din ("light of the faith"). Although Nour is four letters long, a rule to convert Francophone OU to U would reduce it to three. After OU is rewritten as U, the pointer advances to R, and all the rule engine can see is "Reddine" -- which is too short to trigger that Al Din rule. Either general rules that extract Al Din after four letters, five letters, etc. (which in turn may interfere with other rules in unexpected ways) are required; or a special rule for Noureddine is needed.

Of course workarounds can be found for these particular cases, but there is no general solution to this problem. Although a rewrite mechanism makes it possible to write very powerful rules, there is no way to ensure that these rules will not interfere with each other, and the more of these workaround rules one writes, the harder it becomes for those who inherit the system to understand, maintain, and improve the rules.

The NameSphere Approach

NameSphere uses three main ways to improve retrieving and comparing transliterated names:
1. Language-specific compression schemes
2. Dictionaries of names by ethnolinguistic origin
3. Fuzzy matching tailored to specific groups

Although some of these could be implemented alongside rewrite rules, others would involve more or less drastic changes to a rewrite process. Implementing all three of these solutions allows NameSphere to completely replace a traditional rewrite-rule system. NameSphere allows improved methods of generating good name variants, as well as improved methods of finding variants of a particular spelling, by employing a suite of tools. These tools operate behind the scenes, providing the user with a seamless process.

Generating a list of quality variations of a name can be performed in a few steps, unseen by the user. NameSphere's tailored multi-compressional intersective-comparative approach to names produces lists of the good variant spellings, while simultaneously eliminating the unwanted junk variations.

Finding variants of a particular spelling involves a pre-generative tailored baseform search. Through mapping the source variant to a base form before generating variants, NameSphere avoids the skewed possibilities derived from taking a non-standard variant as prototype.

Please contact George N. Hallak for pricing and customized solutions.
 

All AppTek Products are on GSA Advantage!®
ChatSphere | LocalSphere | MediaSphere | MemorySphere | NameFinder
TranSphere | TranSphere Plug-in for Microsoft OfficeTextFinder | WebTrans

 

Home Page || AramediA Contact Info || Adobe Middle East (ME)  Arabic Fonts || Arabic Language Tutors || Arabic NewsStand
Arabic Resources ||
American Sign Language (ASL) 
Arabic Calligraphy || Children || Educational PC & Mac 
Desktop Publishing DTP - PC & Mac || Dictionaries 
Islamic Software || Microsoft Arabic Software
  Multilingual Keyboards || New Products  || Shopping Cart
Price List || OCR  Harf Multimedia || Machine Translation  Search Engines
|| Sakhr Enterprise Software Solutions  Universal Word || World Resources || Word Processors
The AramediA Sales Policy ||
Software Search  aramediaStore.com ||
Amazon.com

AramediA

Join Our Newsletter

61 Adams Street, Braintree, MA 02184-1906
United States of America (USA)
Tel 1-781-849-0021 Fax 1-781-849-2922


animail2.gif (5769 bytes)

We Ship All Around the Globe

Copyright © 1995 - 2008 - GnhBos, Incorporated. dba AramediA. All rights reserved.
 

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Our Dictionaries, machine translation, translation memory,
and lexicons multilingual dictionary software cover a wide spectrum.
Call for more information: 1-781-849-0021

English Arabic, Arabic terminology, synonym, Arabic terminologies, word-hoard,english-arabic translator,
English Arabic translator, translation services, acronyms, phrase, spell, Ajeeb, Arabic translators, translation, word, French Arabic, phraseology, synonym, Islam, spell, translation services, thesaurus, language, Arabic terminology, dictionary, meaning, Arabic dictionary, lexicon, Arabic translator, domain, language, Arabic translator, synonym, Arabic dictionary, Arabic lexicography, vocabulary, lexicon,

Dictionary software, multimedia, Bidirectional, English Arabic English, English Dictionaries, Arabic Dictionary,
covers many languages dictionary and applications, please call us for more information at
1-781-849-0021:
Word processing for the following languages is also available: European, Arabic, Hebrew, Cyrillic, Asian and Indian Languages, Albanian, Arabic (includes spell checker), Aramaic, Armenian, Azeri-Arabic, Azeri-Cyrillic, Azeri-Turkish, Bengali, Bohemian(Czech), Bulgarian, Burmese, Byelorussian, Croatian, Danish, Dutch, English, Esperanto, Farsi, French, Finnish, Georgian, German, Greek/Modern, Greek/Classical, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, International Phonetic Alphabet, Inuktitut, Italian, Kannada, Khmer, Ladino, Lao, Latin, Latvian, Lihyanite, Lithuanian, Macedonian, Malayalam, Malay-Jawi, Marathi, Moabite, Mongolian, Nabataean, Nepali, Norwegian, Oriya, Oromo, Pashto, Polish, Portuguese, Punjabi, Rumanian, Russian, Safitic,Sami,
Sanskrit, Serbian, Sinhalese, Slovak, Slovenian, South Arabian, Spanish, Swedish, Swiss, Syriac-Eastern, Syriac-Estrangelo, Tagalog, Tamil, Telugu, Thai, Talmudic, Tibetan, Tigrinya, Tigre, Transliteration, Turkish, UK-English, Ugaritic, Ukrainian, Urdu, Vietnamese, Welsh, Wendish
Lusatian, sorbian , Yiddish,

Arabic language, software localization, software localisation, translation, Arabic Sakhr Arabic software, Learn Arabic, Arabic for beginners, translation, multimedia, educational programs, arabic terminology,arabic terminologies, english-arabic,french-arabic, spanish-arabic,arabic meanings, acronyms,arabic translation, homonyms,jargon,vocabulary,lexicon,define, Arabic typing tutor, dictionary, scanning,translation software, arabic dictionary,Arabic, Islam, Moslem, Islamic, Hebrew, Farsi, Persian, Persia, Iran, Iranian,Arabic lexicography,lexicon,spell,word, vocabulary, language, domain,arabic dictionaries, arabic glossaries,,arabic translators,arabic translator,english-arabic translator, qamoos, Ajeeb, Arabic language tutor,arabic keyboards,spellchecker,keyboards,persian,farsi, urdu,hebrew, font, transliteration, index,synonym,antonym,thesaurus,meaning,word-hoard, ideom,phrase,phraseology,expression,Arabic,software localization,translation, global, dictionary, technical,machine translation,education,educational multimedia,word, multilingual, word processor, languages, arabic text email, Spanish, German, french, asian, turkish
 

Our Dictionaries, machine translation, translation memory, and lexicons multilingual dictionary software cover a wide spectrum. Call for more information: 1-781-849-0021

english-arabic,arabic terminology,synonym,arabic terminologies,word-hoard,
english-arabic translator, english-arabic translator,translation services,acronyms,phrase, spell,Ajeeb,arabic translators,translation,word,french-arabic,phraseology,synonym, islam,spell,translation services,thesaurus, language,arabic terminology,dictionary, meaning,arabic dictionary,lexicon,arabic translator,domain, language,arabic translator,synonym,arabic dictionary,Arabic lexicography,vocabulary,lexicon,
Boston Limo

Dictionary software, multimedia, Bidirectional, English Arabic English, English Dictionaries, Arabic Dictionary, covers many languages dictionary and applications, please call us for more information at 1-781-849-0021:

Word processing for the following languages is also available: European, Arabic, Hebrew, Cyrillic, Asian and Indian Languages, Albanian, Arabic (includes spell checker), Aramaic, Armenian, Azeri-Arabic, Azeri-Cyrillic, Azeri-Turkish, Bengali, Bohemian(Czech), Bulgarian, Burmese, Byelorussian, Croatian, Danish, Dutch, English, Esperanto, Farsi, French, Finnish, Georgian, German, Greek/Modern, Greek/Classical, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, International Phonetic Alphabet, Inuktitut, Italian, Kannada, Khmer, Ladino, Lao, Latin, Latvian, Lihyanite, Lithuanian, Macedonian, Malayalam, Malay-Jawi, Marathi, Moabite, Mongolian, Nabataean, Nepali, Norwegian, Oriya, Oromo, Pashto, Polish, Portuguese, Punjabi, Rumanian, Russian, Safitic,Sami, Sanskrit, Serbian, Sinhalese, Slovak, Slovenian, South Arabian, Spanish, Swedish, Swiss, Syriac-Eastern, Syriac-Estrangelo, Tagalog, Tamil, Telugu, Thai, Talmudic, Tibetan, Tigrinya, Tigre, Transliteration, Turkish, UK-English, Ugaritic, Ukrainian, Urdu, Vietnamese, Welsh, Wendish Lusatian, sorbian , Yiddish.