This database comprises recordings from 306 speakers recorded in 600 different sessions. Speech signals were recorded in a car and simultaneously transmitted by GSM and recorded in a fixed platform connected to an ISDN line.
The SpeechDat Car Spanish Database was recorded within the scope of the SpeechDat Car project (LE4-8334) which was sponsored by the European Commission and the Spanish Government.
Collection was performed at the Department of Signal Theory and Communications of the Technical University of Catalonia (UPC) (Spain) with the collaboration of SEAT and Volkswagen. The owner of the database is UPC.
The following table shows the contents and corpus codes of the SpeechDat car Spanish Database. All items are read, unless marked as spontaneous.
Spain has a population of 38 Million people. The official language is Spanish (Castiliano) and some regions have also other official languages as Catalan, Galiziano and Basque. Due to the limited number of speakers to be recorded, the number of regions is small. The number of dialectal regions has been defined taking into account phonetic differences among regions and four groups are defined:
Region |
Description |
NORTH WEST |
Galicia, Asturias |
CENTER |
Aragon, Cantabria, Castilla_La_Mancha, Castilla_León, La_Rioja, Madrid, Extremadura (North), País Vasco, Navarra |
SOUTH |
Andalucía, Canarias, Extremadura (South), Murcia |
EAST |
Cataluña, Valencia, Baleares |
The distribution of sessions recorded as function of the accent region of the recorded speakers is shown in the next table.
Number |
Name of accent/region |
Number of speakers |
Number of sessions |
Number of sessions (%) |
1 |
NORTHWEST |
53 |
106 |
17.6 |
2 |
CENTER |
78 |
154 |
26 |
3 |
SOUTH |
54 |
105 |
17.1 |
4 |
EAST |
121 |
235 |
39.3 |
TOTAL |
|
306 |
600 |
100 |
The total number of different speakers is 306. 149 are female and 157 are male speakers. Next table shows the number of sessions spoken by females and males speakers and their age groups
Age groups |
Male speakers |
Female speakers |
Percentage of total |
|||
Number |
Sessions |
Number |
Sessions |
Speakers |
Sessions |
|
18-30 |
84 |
165 |
76 |
150 |
52 |
52.1 |
31-45 |
41 |
80 |
39 |
75 |
26.1 |
26.1 |
46-60 |
30 |
59 |
35 |
69 |
21.6 |
21.5 |
over 60 |
1 |
2 |
0 |
0 |
0.3 |
0.3 |
TOTAL |
156 |
306 |
150 |
294 |
100 |
100 |
Two types of recordings compose the database. First, wideband recordings (60-7000 Hz) were performed for systems which are installed and operate in the car itself; second, narrow band recordings (300-3400 Hz) were performed for systems that operate centrally outside the car and obtain their spoken input from the driver over the cellular telephone network. Two recording platforms were used
Multi-channel recordings were performed simultaneously in the car and through the GSM network.
There are defined 7 environment conditions:
In addition, some information was collected during the recordings:
The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added. Transcriptions are CASE INSENSITIVE.
Non-Speech Acoustic Events have been arranged into 5 categories and transcribed. Events only are transcribed if they are clearly distinguishable. Very low-level, non-intrusive events are ignored. The event will be transcribed at the place of occurrence, using the defined symbols in square brackets. For noise events that occur over a span of one or more words, the transcription indicates the beginning of the noise, just before the first word it affects.
The first two categories of acoustic events originate from the speaker, and the other three categories originate from another source. The 5 categories are:
[fil]: Filled pause. These sounds can well be modeled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.
[spk]: Speaker noise. All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.
[sta]: Stationary noise. This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples voice babble (cocktail-party noise), sirens, wind, rain, cobble stones.
[int]: Intermittent noise. This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their color over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, door bell, paper rustle, cross talk, ticks by the direction indicator.
[dit]: DTMF and prompt tone. In fact this is a special case of [int]. But since this sound can be expected to be present in nearly each speech file, a special symbol was defined.
Only signals from microphone 0 have been transcribed. All the signals contain the prompt beep.
The database includes a lexicon. The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859). We have included a frequency count for each entry in the lexicon e.g. to help indicate rare words whose transcriptions are perhaps less important or reliable.
The pronunciation lexicon was produced after the transcription phase; it contain, alphabetically sorted, all words found in the transcription (one occurrence for each word), their number of occurrences and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected.
A software tool developed at UPC (SAGA: Spanish Automatic Graphemes to Allophones Transcriber) has been used to translate the transcribed words to phonemic strings by using the SAMPA phonemic notation. The complete lexicon was manually supervised.
This database is commercially available.
Copyright © 2017 - Designed by Madstudio