One of the hurdles in learning Shanghainese, or for that matter any dialect of Wu, is the lack of easily accessible data. There are services that provide phonetic transcription based on character input, but often the results given are some proprietary form of pinyin. For more phonetically accurate results, i.e. IPA, there are books that provide that information, though often for only a limited number of characters.
In an effort to fix that, I’ve compiled a list of characters with their corresponding IPA pronunciation. It’s a tabbed text file, UTF-8 encoding. The original file is loosely based on a similar list of about 450 entries provided by Tatoeba.org
The data set covers the most commonly used characters for writing Wu, as well as a number of other characters to cover things like family names and Wu-specific 语气词. It started as a list of just over 450, quickly expanding to 1400 entries and recently to
just over 5300
now over 7520. More entries are continually being added.Usage
Who uses this? For starters, this data set has been integrated into Tatoeba.com
for both entries in Shanghainese to IPA tool
as well is in their general Shanghainese sentences. Sentences entered on the site using characters will be converted as below.
ɦi⁵³ ɦɑ̃⁵³ ʦɤ lɛ⁵³ gəˀ¹²
It will also be included as part of the upcoming release of the Eclectus dictionary
created by Christoph Burgmer
and the related cjklib
Expect to see the data appear elsewhere in the near future.
If you’re interested in using the data for your
project, send me an email at kellenparker在sinoglot.com explaining what the project is and how you plan to use the data.
The only thing I ask is that you credit me in some way for the many many hours I’ve put into collecting the data. I’m releasing this under the Creative Commons CC-BY
I’m looking in a few different directions as to what else to do to improve the data. I don’t want to get too much into it just yet, but keep an eye on this space for updates. Thanks
Thanks to Allan Simon of Tatoeba.org for providing me with an initial 450+ word set and for allowing me to contribute to Tatoeba’s data set. Also thanks to Christoph Burgmer for helping work out some kinks and for being willing to include the data into Eclectus.