Linguistic Data Consortium
CALLHOME German Lexicon Second Edition was developed by LDC and contains 318,809 German words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18). The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)) and from 100 transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition LDC2026S04. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format. https://catalog.ldc.upenn.edu/LDC2026L04
CALLHOME German Second Edition was developed by LDC and contains 48 hours of speech from 100 telephone conversations between native German speakers. It is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech LDC97S43 and CALLHOME German Transcripts LDC97T15 with additional transcription and updated directory structure, file formats, and documentation. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.
In addition to the original transcripts published in CALLHOME German Transcripts, this release has updated transcripts addressing normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes. https://catalog.ldc.upenn.edu/LDC2026S06
MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output, for a total of 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement and paper. Each page was scanned and the images annotated.
The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. https://catalog.ldc.upenn.edu/LDC2026T05
MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones from a variety of environments. Transcripts cover 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations. The MATERIAL program focused on underserved languages with the ultimate goal of building cross language information retrieval systems to find speech and text content using English search queries. https://catalog.ldc.upenn.edu/LDC2026S05
Check out our April newsletter for LDC’s latest publications – DEFT Chinese and English Light and Rich ERE Parallel Annotation, MATERIAL Tagalog-English Language Pack and LORELEI Somali Representative Language Pack http://ldc-upenn.blogspot.com/
Click here to claim your Sponsored Listing.
Category
Contact the organization
Telephone
Website
Address
3600 Market Street, Ste 810
Philadelphia, PA
19104