NAME: NetTalk Corpus SUMMARY: This is an updated and corrected version of the data set used by Sejnowski and Rosenberg in their influential study of speech generation using a neural network [1]. The file "nettalk.data" contains a list of 20,008 English words, along with a phonetic transcription for each word. The task is to train a network to produce the proper phonemes, given a string of letters as input. This is an example of an input/output mapping task that exhibits strong global regularities, but also a large number of more specialized rules and exceptional cases. SOURCE: The data set was contributed to the benchmark collection by Terry Sejnowski, now at the Salk Institute and the University of California at San Deigo. The data set was developed in collaboration with Charles Rosenberg of Princeton. Approximately 250 person-hours went into creating and testing this database. COPYRIGHT STATUS: The data file carries the following copyright notice: Copyright (C) 1988 by Terrence J. Sejnowski. Permission is hereby given to use the included data for non-commercial research purposes. Contact The Johns Hopkins University, Cognitive Science Center, Baltimore MD, USA for information on commercial use. MAINTAINER: neural-bench@cs.cmu.edu PROBLEM DESCRIPTION: The data set is in the benchmark directory in file "nettalk.data". This data set is in the standard CMU Benchmark format. For those who require it, a reconstructed list of the 1000 most common words is appended to the end of THIS file. These same words form the test data for "nettalk.data". METHODOLOGY: This data set can be used in a number of different ways to test learning speed, quality of ultimate learning, ability to generalize, or combinations of these factors. Sejnowski and Rosenberg used the following experimental setup: The input to the network is a series of seven consecutive letters from one of the training words. The central letter in this sequence is the "current" one for which the phonemic output is to be produced. Three letters on either side of this central letter provide context that helps to determine the pronunciation. Of course, there are a few words in English for which this local seven-letter window is not sufficient to determine the proper output. For the study using this "dictionary" corpus, individual words are moved through the window so that each letter in the word is seen in the central position. Blanks are added before and after the word as needed. Some words appear more than once in the dictionary, with different pronunciations in each case; only the first pronunciation given for each word was used in this experiment. A unary encoding is used. For each of the seven letter positions in the input, the network has a set of 29 input units: one for each of the 26 letters in English, and three for punctuation characters. Thus, there are 29 x 7 = 203 input units in all. The output side of the network uses a distributed representation for the phonemes. There are 21 output units representing various articulatory features such as voicing and vowel height. Each phoneme is represented by a distinct binary vector over this set of 21 units. In addition, there are 5 output units that encode stress and syllable boundaries. Standard back-propagation was used, with update of the weights after the error gradient has been computed by back-propagation of errors for all the letter positions a single word. The number of hidden units in the network was varied from one experiment to another, from 0 to 120. Each layer was totally connected to the next; in the case of 0 hidden units, the input units were directly connected to the outputs. The weight-update formulas used in the Sejnowski and Rosenberg study were slightly different from the standard form; see the paper for a description of the momentum and learning rate parameters used. The network weights were initialized with random values in the range -0.3 to +0.3. A sum-of-squares error measure was used, with errors of magnitude less than 0.1 being treated as 0.0. RESULTS: In [1], Sejnowski and Rosenberg present full error curves for a number of experiments using this corpus. Here we will present only a brief summary of these results. The results reported here all use the "best guess" criterion: an output is treated as correct if it is closer (smallest angle) to the correct output vector than to any other phoneme output vector. A subset, consisting of the 1000 most common English words, was tested with 0, 15, 30, 60, and 120 hidden units. With 0 hidden units, the best performance achieved was about 82% correct by the "best guess" criterion. The rate of learning and final performance improved steadily with increasing numbers of hidden units, up to 98% correct with 120 hidden units. (The original "list of 1000 most common English words" is not available, but a reconstruction of this list is in file "nettalk.list" in the benchmark data base. This should be very close to the original.) This 120-hidden-unit network scored about 80% after training on 5000 word-presentations (i.e. five times through the 1000-word corpus), and the 98% level was reached after 30,000 presentations. This pre-trained network with 120 hidden units was then tried on the full corpus of 20,000 words. Without further training, the correct output was generated in 77% of the cases. After 5 passes through the larger corpus, performance improved to 90% correct. Sejnowski reports that in later (unpublished) experiments, a better rate of generalization was achieved. A window of 11 consecutive letters was used instead of the 7 used in other experiments. The network was trained on 18,000 words from the corpus until about 94% of the output phonemes were correct by the best-guess criterion. The remaining 2,000 words were then presented without further training, and 92% of the output phonemes were correct. REFERENCES: 1. Sejnowski, T.J., and Rosenberg, C.R. (1987). "Parallel networks that learn to pronounce English text" in Complex Systems, 1, 145-168. ------------------------------------------------------------------------------- Note (by Scott Fahlman): The Nettalk paper by Sejnowski and Rosenberg reports a number of experiments in which the net was trained on "a list of the 1000 most common English words". Several people have asked for this list so that they can attempt to duplicate those experiments. Unfortunately, the original list was not saved. However, Terry Sejnowski states that this list was created by scanning the list of most common words in the Brown corpus and selecting the first 1000 of these that also appear in the nettalk dictionary. He also provided me with a portion of this list. After a bit of hacking with editor macros and Lisp, I have prepared the following list of words that should closely approximate the 1000-word list used by Sejnowski and Rosenberg. *************************************************************************** (THE OF AND TO IN THAT IS WAS HE FOR IT WITH AS HIS ON BE AT BY I THIS HAD NOT ARE BUT FROM OR HAVE AN THEY WHICH ONE YOU WERE HER ALL SHE THERE WOULD THEIR WE HIM BEEN HAS WHEN WHO WILL MORE NO IF OUT SO SAID WHAT UP ITS ABOUT INTO THAN THEM CAN ONLY OTHER NEW SOME TIME COULD THESE TWO MAY THEN DO FIRST ANY MY NOW SUCH LIKE OUR OVER MAN ME EVEN MOST MADE AFTER ALSO DID MANY BEFORE MUST THROUGH BACK WHERE MUCH YOUR WAY WELL DOWN SHOULD BECAUSE EACH JUST THOSE PEOPLE HOW TOO LITTLE STATE GOOD VERY MAKE WORLD STILL OWN SEE MEN WORK LONG HERE GET BOTH BETWEEN LIFE BEING UNDER NEVER SAME DAY ANOTHER KNOW WHILE LAST MIGHT US GREAT OLD YEAR OFF COME SINCE GO AGAINST CAME RIGHT TAKE THREE HIMSELF FEW HOUSE USE DURING WITHOUT AGAIN PLACE AMERICAN AROUND HOWEVER HOME SMALL FOUND THOUGHT WENT SAY PART ONCE HIGH GENERAL UPON SCHOOL EVERY DOES GOT UNITED LEFT NUMBER COURSE WAR UNTIL ALWAYS SOMETHING FACT THOUGH WATER LESS PUBLIC PUT THINK ALMOST HAND ENOUGH FAR TOOK HEAD YET GOVERNMENT SYSTEM SET BETTER TOLD NOTHING NIGHT END WHY FIND LOOK GOING POINT KNEW NEXT CITY BUSINESS GIVE GROUP YOUNG LET ROOM PRESIDENT SIDE SOCIAL SEVERAL GIVEN PRESENT ORDER NATIONAL RATHER POSSIBLE SECOND FACE PER AMONG FORM OFTEN EARLY WHITE CASE LARGE BECOME NEED BIG FOUR WITHIN FELT ALONG CHILDREN SAW BEST CHURCH EVER LEAST POWER THING LIGHT FAMILY INTEREST WANT MIND COUNTRY AREA DONE OPEN GOD SERVICE CERTAIN KIND PROBLEM THUS DOOR HELP SENSE WHOLE MATTER PERHAPS ITSELF TIMES HUMAN LAW LINE ABOVE NAME EXAMPLE ACTION COMPANY LOCAL SHOW WHETHER FIVE HISTORY GAVE EITHER TODAY FEET ACT ACROSS TAKEN PAST QUITE HAVING SEEN DEATH BODY EXPERIENCE REALLY HALF WEEK WORD FIELD CAR ALREADY THEMSELVES INFORMATION TELL TOGETHER SHALL COLLEGE PERIOD MONEY SURE HELD KEEP PROBABLY REAL FREE CANNOT MISS POLITICAL QUESTION AIR OFFICE BROUGHT WHOSE SPECIAL HEARD MAJOR AGO MOMENT STUDY FEDERAL KNOWN AVAILABLE STREET RESULT ECONOMIC BOY REASON POSITION CHANGE SOUTH BOARD INDIVIDUAL JOB SOCIETY WEST CLOSE TURN LOVE TRUE COMMUNITY FULL FORCE COURT SEEM COST AM WIFE FUTURE AGE VOICE CENTER WOMAN COMMON CONTROL NECESSARY POLICY FRONT SIX GIRL CLEAR FURTHER LAND ABLE FEEL PARTY MUSIC PROVIDE MOTHER UNIVERSITY EDUCATION EFFECT LEVEL CHILD SHORT RUN STOOD TOWN MILITARY MORNING TOTAL OUTSIDE FIGURE RATE ART CENTURY CLASS NORTH LEAVE THEREFORE PLAN TOP SOUND EVIDENCE MILLION BLACK HARD STRONG VARIOUS BELIEVE PLAY TYPE SURFACE VALUE SOON MEAN NEAR MODERN TABLE PEACE RED ROAD TAX SITUATION PERSONAL PROCESS ALONE GONE NOR IDEA WOMEN ENGLISH INCREASE LIVING LONGER BOOK CUT FINALLY NATURE PRIVATE SECRETARY THIRD SECTION CALL FIRE KEPT GROUND VIEW DARK PRESSURE BASIS SPACE FATHER EAST SPIRIT UNION EXCEPT COMPLETE WROTE RETURN SUPPORT ATTENTION LATE PARTICULAR RECENT HOPE LIVE ELSE BROWN BEYOND PERSON COMING DEAD INSIDE REPORT LOW STAGE MATERIAL INSTEAD READ HEART LOST DATA AMOUNT PAY SINGLE COLD MOVE HUNDRED RESEARCH BASIC INDUSTRY TRIED HOLD COMMITTEE ISLAND EQUIPMENT DEFENSE ACTUALLY SON SHOWN TEN RIVER RELIGIOUS SORT CENTRAL DOING REST INDEED CARE PICTURE DIFFICULT SIMPLE FINE SUBJECT RANGE WALL MEETING FLOOR BRING FOREIGN CENT PAPER SIMILAR FINAL NATURAL PROPERTY COUNTY MARKET POLICE GROWTH INTERNATIONAL START TALK WRITTEN STORY HEAR ANSWER NEEDS HALL ISSUE CONGRESS WORKING LIKELY EARTH SAT PURPOSE LABOR STAND MEET DIFFERENCE HAIR PRODUCTION FOOD FALL STOCK WHOM SENT LETTER PAID CLUB KNOWLEDGE HOUR YES CHRISTIAN SQUARE READY BLUE BILL TRADE INDUSTRIAL DEAL BAD MORAL DUE ADDITION METHOD NEITHER THROUGHOUT COLOR TRY ANYONE READING LAY NATION FRENCH REMEMBER SIZE PHYSICAL UNDERSTAND RECORD WESTERN MEMBER SOUTHERN NORMAL STRENGTH POPULATION VOLUME DISTRICT TEMPERATURE TROUBLE SUMMER MAYBE RAN TRIAL LIST FRIEND EVENING LITERATURE LED MET ARMY ASSOCIATION INFLUENCE CHANCE HUSBAND STEP FORMER SCIENCE STUDENT CAUSE MONTH HOT AVERAGE SERIES AID DIRECT WRONG LEAD PIECE MYSELF THEORY SOVIET ASK FREEDOM BEAUTIFUL MEANING FEAR NOTE LOT SPRING CONSIDER BED PRESS ORGANIZATION TRUTH HOTEL EASY WIDE DEGREE HERSELF RESPECT FARM PLANT MANNER REACTION APPROACH RUNNING LOWER GAME FEED COUPLE CHARGE EYE DAILY PERFORMANCE BLOOD RADIO STOP TECHNICAL PROGRESS ADDITIONAL MARCH MAIN CHIEF WINDOW DECISION RELIGION TEST IMAGE CHARACTER MIDDLE APPEAR BRITISH RESPONSIBILITY GUN LEARNED HORSE ACCOUNT WRITING SERIOUS LENGTH GREEN ACTIVITY FISCAL CORNER FORWARD HIT AUDIENCE SPECIFIC NUCLEAR DOUBT STRAIGHT LATTER QUALITY JUSTICE DESIGN PLANE SEVEN STAY POOR BORN CHOICE OPERATION PATTERN STAFF FUNCTION INCLUDE WHATEVER SUN SHOT FAITH POOL WISH LACK SPEAK HEAVY MASS HOSPITAL BALL STANDARD AHEAD VISIT DEEP LANGUAGE FIRM PRINCIPLE CORPS INCOME DEMOCRATIC NONE EXPECT DISTANCE IMPORTANCE PRICE ANALYSIS SERVE PRETTY ATTITUDE CONTINUE DETERMINE EXISTENCE DIVISION STRESS HARDLY WRITE SCENE REACH LIMITED APPLIED AFTERNOON DRIVE PROFESSIONAL STATION HEALTH ATTACK SEASON SPENT EIGHT ROLE CURRENT NEGRO ORIGINAL BUILT DATE MOUTH RACE UNIT TEETH MACHINE COUNCIL COMMISSION NEWS SUPPLY RISE DEMAND UNLESS BIT SUNDAY OFFICER MEANT WALK DOCTOR ACTUAL CLAY GLASS POET JAZZ CAUGHT HAPPY FIGHT POPULAR CONCERN SHARE STYLE BRIDGE GAS CLAIM FOLLOW THOUSAND SUPPOSE HEAT STATUS CHRIST CATTLE RADIATION USUAL FILM OPINION PRIMARY BEHAVIOR CONFERENCE SEA PROPER ATTEMPT MARRIAGE SIR HELL CONSTRUCTION WORTH PRACTICE SIGN SOURCE WAIT ARM PARK TRADITION REMAIN PROJECT AUTHORITY LORD ANNUAL JUNE OIL OBVIOUS THIN FELL PRINCIPAL JACK CONDITION DINNER BASE STRUCTURE MEASURE WEIGHT OBJECTIVE CIVIL COMPLEX MANAGEMENT MIKE EQUAL NOTED KITCHEN DANCE BALANCE CORPORATION PASS FAMOUS REGARD DEVELOP FAILURE CLOTHES COVER BREAK CARRY MOREOVER KEY KING ADD ACTIVE CHECK BOTTOM PAIN MANAGER ENEMY POETRY TOUCH FIXED POSSIBILITY SPOKE BRIGHT BATTLE PRODUCT BUILD SIGHT ROSE LOSS PREVIOUS FINANCIAL PHILOSOPHY REQUIRE SCIENTIFIC SHAPE MARKED MUSICAL VARIETY GERMAN CAPITAL CAPTAIN CONCEPT DISTRIBUTION IMPOSSIBLE LEARN BEGIN AWARE BROAD STRANGE SEX POST CATHOLIC REGULAR OPENING WINTER CAPACITY SHIP SPREAD HOUSES PREVENT MARK SPEED YESTERDAY TEAM BANK GOVERNOR INSTANCE TRAIN YOUTH PRODUCE FRESH CRISIS BAR DRINK IMMEDIATE ROUND WATCH LIVES ESSENTIAL TRIP NINE EVENT APARTMENT CAMPAIGN FILE OPPOSITE NECK INDEX TWENTY OFFER GRAY LADY FULLY INDICATE SESSION RUSSIAN PROVIDENCE STUDIED SEPARATE ATMOSPHERE PROCEDURE TERM EXPRESSION REALITY MAXIMUM ECONOMY SECRET MISSION FAST FAVOR EDGE TONE ENTER LITERARY COFFEE SOLID LAID FAIR PERMIT RESPONSE TITLE JUDGE ADDRESS MODEL ELECTION ANODE)