TESTING AND SUBJECTS FOR FUTURE RESEARCH
A total of 31 image records were generated for the test index. These 31 text files were created by recording a MACRO to generate the fields in WordPerfect 7.0. Then, all appropriate data was entered into the nine fields outlined in Chapter IV, and these files were then saved as ASCII DOS Generic Word Processor (.txt) files. This file format was selected because the ZyIMAGE Web search engine outputs only flat text, in the form of ACSII-style text files. This is true of the ZyINDEX system no matter what the source, whether it be electronic data or scanned images that the ZySCAN software translates into .txt files in the Optical Character Recognition (OCR) process. This choice of file formats was deemed appropriate because it works around any possible text placement problems that occur with the use of indents and tabs that are software-specific.
The next step is to build an index in ZyBUILDi. The process of building this index is quite simple, and takes only a second or two for a database of this size. Two major considerations in the index building process include the generation of a stop word file (Zy calls it a noise word list), and a synonym list. Both of these critical lists are index specific, that is, a custom file is built by the indexer as this process develops.
ZyBUILDi supplies a generic noise word file that can be edited at the discretion of the programmer. This list includes many short words, adjectives, conjunctions, and unfortunately for this Civil War Index, all single characters, like "a" or "c". In an index that includes many items identified by character (Company "A" or Ulysses "S" Grant), it was necessary to exclude these characters from the stop list. The field names (CREATOR, DESCRIPTION, etc.) were added to this stop word list, with the intention of limiting the size and optimizing the operating speed of the index. ZyBUILDi includes the noise list text file in the index folder, and it is a simple matter to edit it at any time. Of course, following all modifications, the index must be rebuilt.
For some inexplicable reason, the synonym list is not modified in ZyBUILDi, but rather in ZyFINDi, the index search software. After each synonym modification, the indexer must return to ZyBUILDi to rebuilt the index to include the newly constructed synonyms. This list is a very important element to the index. Every military rank appears in two, three, or possibly more forms, and the same is true of all fighting units. All synonym lists have to be constructed normally and conversely to cover all possible permutations of the forms. For example, the following list addresses the synonyms for the rank of sergeant:
First Purmutation | Second Purmutation | Third Purmutation |
---|---|---|
Coreword: sergeant | Coreword: ser | Coreword: sergt |
Synonym: ser | Synonym: sergeant | Synonym: ser |
Synonym: sergt | Synonym: sergt | Synonym: sergeant |
This treatment is necessary for all fighting units (O.V.I = OVI = Ohio Volunteer Infantry), and all other index terms that have possible synonyms that could derail the record retrieval process. Synonym identification is of primary importance in helping the researcher to successfully match the language contained within the catalogue records. After all, and electronic index actually performs just one very simple task, it matches words and characters in the index, with words and characters supplied by the researcher. The building of a synonym list is a continuing process that evolves throughout the duration of the record and index creation process.
Four reference librarians from the Ohio Historical Society were invited to test this small index. The response from all four was a unanimous, "how soon can this index be completed?" This response is indicative of just how badly this index is needed at OHS. They often had to be reminded to be critical for the betterment of the index model. They all seemed pleased with the record lay-out, and proceeded with logic and determination when they got unexpected results. They were encouraged to browse and to search for known items provided by the writer. Overall impressions were uniformly positive.
Technically, several interesting factors were revealed. The use of any punctuation in the search string will completely void the search. For example, one librarian searched for a known item (a photograph of Charles K. Crumit) with the following search string:
"Charles K. Crumit"
This search yielded no hits, and we were both perplexed because we were looking at a print-out of the record, which the above string matched exactly. We then deleted the period after the initial K, and the record was retrieved as expected. ZyINDEX ignores all punctuation that is followed by a space, but it "sees" punctuation that has characters on both sides of it. Thus, it sees the period in "CWI09.txt", but not in "Charles K. Crumit".
The order of the name entry also caused some problems. In the DESCRIPTION (field #2) the soldier's name is always entered in direct order, "Charles K Crumit", but in the SUBJECTS fields, it is entered in inverted order, in keeping with the Library of Congress Name Authority File practice. If an experienced searcher enters a name in inverted order, "Crumit Charles", and the name appears only in the DESCRIPTION field, he will get no hits. To avoid this problem, the indexer determined that all names should appear in both fields, to make name entry order as transparent as possible for the searcher.
The testing performed on this index was limited, but it provided valuable data for index and record customization. The goal was to identify index problems and ambiguities before the generation of large quantites of records made large scale record modification cost-prohibitive.
Several research possibilities are suggested by the initial phases of the index creation.