Skip to content

Latest commit

Β 

History

History
40 lines (25 loc) Β· 1.72 KB

File metadata and controls

40 lines (25 loc) Β· 1.72 KB

How to generate exercises

  1. First you'll have to download the Tatoeba dataset. Run

    ./download-data.sh

    You will find the data under data-tatoeba/ inside this folder.

  2. Download any required sentence pairs you are interested in and place it in data-tatoeba/.
    As this sentence pairs are always created on the fly, this step has to happen manually.
    Sentence pair data should be named sentences_{SRC_LANG}_{TGT_LANG}.tsv, where SRC_LANG and TGT_LANG are lowercased two-letter language codes. For example sentences_uk_de.tsv.

  3. Run the ./generate_data.ipynb notebook to process raw data into TSVs. Change input variables as needed.

  4. Run the ./import_sql.ipynb notebook to import the generated CSVs into a local SQLite database.

  5. Run the ./generate_exercise_precursors.ipynb notebook to generate exercise precursors.

  6. Run the ./similar_words.ipynb notebook to generate similar words.

  7. Optionally do a manual quality control check over the generated data.

  8. To populate the exercise table which is the main entity in the API server, run

    python3 populate_exercise_table.py

    The script expects the following two files to exist (they have to be moved to that folder manually):

    1. data-import/exercise-import.tsv: which is the output of the ./generate_exercise_precursors.ipynb notebook
    2. data-import/similar-words-import.tsv: which is the output of the ./similar_words.ipynb notebook
  9. Optionally run the ./generate_sentence_audio notebook to generate audio files.

After all those above steps. You should have a ready-to-use tasbkpool.db SQLite DB in the parent folder which will be used by the API server.