We are happy to announce that 2 new Written Corpora and 4 new Speech resources are now available in our catalogue.
ELRA-W0126 Training and test data for Arabizi detection and transliteration
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0126/
ELRA-W0127 Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-W0127/
ELRA-S0396 Mbochi speech corpus
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0396/
ELRA-S0397 Chinese Mandarin (South) database
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0397/
ELRA-S0398 Chinese Mandarin (North) database
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0398/
ELRA-S0401 Persian Audio Dictionary
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
For more information, see: http://catalog.elra.info/en-us/repository/browse/ELRA-S0401/