Linguistic Data Consortium

05/21/2026

CALLHOME German Lexicon Second Edition was developed by LDC and contains 318,809 German words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18). The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)) and from 100 transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition LDC2026S04. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format. https://catalog.ldc.upenn.edu/LDC2026L04

05/20/2026

CALLHOME German Second Edition was developed by LDC and contains 48 hours of speech from 100 telephone conversations between native German speakers. It is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech LDC97S43 and CALLHOME German Transcripts LDC97T15 with additional transcription and updated directory structure, file formats, and documentation. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.

In addition to the original transcripts published in CALLHOME German Transcripts, this release has updated transcripts addressing normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes. https://catalog.ldc.upenn.edu/LDC2026S06

05/19/2026

MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output, for a total of 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement and paper. Each page was scanned and the images annotated.

The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. https://catalog.ldc.upenn.edu/LDC2026T05

04/20/2026

MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones from a variety of environments. Transcripts cover 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations. The MATERIAL program focused on underserved languages with the ultimate goal of building cross language information retrieval systems to find speech and text content using English search queries. https://catalog.ldc.upenn.edu/LDC2026S05

04/16/2026

Check out our April newsletter for LDC’s latest publications – DEFT Chinese and English Light and Rich ERE Parallel Annotation, MATERIAL Tagalog-English Language Pack and LORELEI Somali Representative Language Pack http://ldc-upenn.blogspot.com/

Want your organization to be the top-listed Non Profit Organization in Philadelphia?
Click here to claim your Sponsored Listing.

Contact the organization

Click here to send a message to the organization

Telephone

+12158980464

Website

http://www.ldc.upenn.edu/

Address

3600 Market Street, Ste 810
Philadelphia, PA
19104

Linguistic Data Consortium

Share

Category

Contact the organization

Telephone

Website

Address