These flashcards contain Chinese/English sentence and pinyin.
It's been said that it's best to study a language using sentence flashcards,
instead of individual words, but it can be difficult to find good sets of
entire sentences with tranlsations and pronunciations... so I built this set.
The pinyin may have some errors. It was generated programatically by searching
the CEDICT dictioanry.
These sample sentences are broken up into two files:
zh-en_sentences.xml:
I find this one more useful. The questions in this file are Chinese
sentences. The answers are pinyin pronunciations and English
translations.
en-zh_sentences.xml:
This is more for practicing writing. The questions contain English and
pinyin, and the answer is the Chinese expression.
Each file is broken up into a number of categories labeled as described here:
HSK level:
All of the sentences came from sample sentences intended to describe a
particular word. HSK level (in the category name) signifies the HSK
level of the word this sentence describes. Note that "HSK level" is
1-4, ... I have no idea how that corelates to actual HSK scores, but
since HSK scores range from 1-11, I know they are not equivalent.
Source of words and HSK "level":
http://www.chinese-forums.com/vocabulary/
Limited to:
Sentences are then broken up further into 5 categories based on the HSK
level of the words those sentences contain.
This is a search of all characters in each level, including the
characters that loner words are composed of. This is why even HSK
level 4 sentences can contain sentences in "limited 1."
For example, 作主 (zuo4zhu3) is an HSK level 4 word. It contains 2
characters which both appear in other HSK level 1 words, and so the
sample sentence for 作主 (assuming that sentence contains no other
difficult words) might appear in the category "HSK 4; limited 1;"
Since some characters are not found in any of the HSK level sets, there
are categories containing "limited 5."
Part number:
Within each HSK level there are many sentences. I've divided them up
into parts so that the maximum size would be somewhere around 500
sentences.
Before doing so, I sorted by length of the sentence. This means that
sentences in categories labeled "part 1" will be shorter (and
presumably gramatically simpler) then sentences in categories labeled
"part 4."
The sentences in this collection are the example sentences on dict.cn. I
couldn't find specific licensing information associated with the example
sentences, so, if there's a problem, someone let me know and I'll gladly take
it down. I'm under the impression that dict.cn was built with a free
share-and-share-alike corpus... it being web-based and all anyways.
The sentences are freely available on the internet anyways, so, if I do have to
take this down, I'll gladly share the code I used to generate the lists.
contributed by:
Brian Vaughan
http://brianvaughan.net/
nairbv AT yahoo DOT com
Import speedup in 2.x
BTW, in the current 2.0 codebase, the import of large files like these is much sped up, and no longer gives the impression that nothing happens.
Trad. characters
Is there a way to convert these files to traditional characters? Or perhaps a similar list elsewhere?
Import failure on Windows
Neither of the 2 XML files will import. The program simply stops responding. I'm using Windows XP.
Import failure on Windows
The files are pretty big. It took a few minutes to load the data on my computer. I could imagine that on some computers you might run out of RAM.
You could try opening the XML file in a text editor and removing a few categories of sentences, making the XML files smaller and more manageable, or create a new XML file by copy/pasting all of the sentences for a single category.
import failure on OS X...
When I try to import the xml file, nothing happens when I click OK, and the following message appears in my Console:
IOError: zipimport: can not open file /Applications/Mnemosyne.app/Contents/Resources/lib/python2.4/site-packages.zip
Does anyone know how to solve this? Thanks!
import failure on OS X...
Looks like this might also have been related to the size of the file?
http://www.mail-archive.com/mnemosyne-proj-users@googlegroups.com/msg00798.html
pinyin errors
Brian has done a great job making this available, and he cautions there may be errors in the pinyin. There are some that could be problematic particularly for novice learners. It's a good idea to check anything suspicious or new. For example:
In HSK Level 1 limited 1 Part 1 the wrong pinyin variant pronunciation is given for these characters:
喝 = he1 (to drink) and he4 (to shout loudly)
吃 = chi1 (to eat etc.) and ji2 (to stammer – in MDBG Chinese-English online dictionary but absent from some other dictionaries which give 口吃 for 'to stammer')
会 = hui4 (to be able etc.) and kuai4 (to balance an account, accounting).
E.g. 这星期他会很忙。Here 会 should be hui4 and not kuai1. The same in 你会习惯的。他不喜欢吃鱼 and 我很想吃饺子。 - should be ‘chi1’ and not ‘ji2’ (you eat fish, not stammer it!). And 我们喝了一些鸡汤。- you he1 (drink) soup, you don’t he4 (shout loudly) it! I’ve checked this with the MDGB and other online dictionaries, and with a Chinese colleague.
pinyin errors
yeah, it's quite annoying. The problem occurred mostly with simple characters I already knew well, and once I started I didn't want to overwrite my learning data.
It can be a tricky problem since many characters have multiple pronunciations, but most of them are common characters so it's not so difficult to manually code in the exceptions. I've written a slightly better pinyin converter, but haven't had time to regenerate the XML files yet. I'll try to get back to it soon.
I was hoping a newer version of mnemosyne would use sqlite, then I could just write a sql script to migrate the pinyin without affecting existing learning data.
Thanks, when I get around to fixing it I'll be sure to include the errors you pointed out here.
HSK Levels
The vocabulary levels (1-4) don't correlate to the HSK scores (1-11), but they do serve to let you know what to expect at different exams.
HSK1 is the vocab used for the Basic test (scores 1-3)
HSK1-3 is the vocab used for the Elementary-Intermediate test (scores 3-8)
HSK1-4 is the vocab used for the Advanced test (scores 9-11).
Of course, these don't cover absolutely everything, and are slightly out of date. But they can help you know what to expect in terms of vocabulary.