-
First you'll have to download the Tatoeba dataset. Run
./download-data.sh
You will find the data under
data-tatoeba/inside this folder. -
Download any required sentence pairs you are interested in and place it in
data-tatoeba/.
As this sentence pairs are always created on the fly, this step has to happen manually.
Sentence pair data should be namedsentences_{SRC_LANG}_{TGT_LANG}.tsv, whereSRC_LANGandTGT_LANGare lowercased two-letter language codes. For examplesentences_uk_de.tsv. -
Run the
./generate_data.ipynbnotebook to process raw data into TSVs. Change input variables as needed. -
Run the
./import_sql.ipynbnotebook to import the generated CSVs into a local SQLite database. -
Run the
./generate_exercise_precursors.ipynbnotebook to generate exercise precursors. -
Run the
./similar_words.ipynbnotebook to generate similar words. -
Optionally do a manual quality control check over the generated data.
-
To populate the exercise table which is the main entity in the API server, run
python3 populate_exercise_table.py
The script expects the following two files to exist (they have to be moved to that folder manually):
data-import/exercise-import.tsv: which is the output of the./generate_exercise_precursors.ipynbnotebookdata-import/similar-words-import.tsv: which is the output of the./similar_words.ipynbnotebook
-
Optionally run the
./generate_sentence_audionotebook to generate audio files.
After all those above steps. You should have a ready-to-use tasbkpool.db SQLite DB in the parent folder which
will be used by the API server.