How to Use Training Data Generator to Automate Generating Training Data
Introduction
Training Data Generator is a feature that enables you to automatically generate training data up to 100 data using sentence samples, or we call it data samples.
Before creating a data sample, you will need to define the keyword and synonyms inside the “word bank” feature. Why is it required? We do permutations from data samples and keyword variants to generate the training data. Let’s try a case:
- You create a word bank from Kenapa which consists of variant words or synonyms from Kenapa. For example: knp, ngap, ngaps, kenaps. So, you have 4 word variants.
- Next, you create a data sample. For example: @Kenapa ATM saya bermasalah ya.
- We will create a permutation or combination from a data sample with variant words. For example, expected generated data will be like this:
knp ATM saya bermasalah
ngap ATM saya bermasalah
ATM saya bermasalah ngaps
kenaps ATM saya bermasalah
- And many more
- Also, we have a feature that is bulk tagging to tag all generated training data with your available entities and labels. It enables faster tagging rather than manually annotating.
This guidance will elaborate on how to use training data generators and do bulk tagging after generated data shows up. Let’s get started.
Create Entity and Labels
Before you start, you will need to create at least one entity.
- To create an entity, go to the NLU > Models menu. Then, click the Entity menu.
- Click Create Entity to set up a new entity. It will show a form on the right screen. Fill in the entity name.
Here is the explanation of each field
- Inherit lets you copy an entity from any public NLU on the Kata Platform. To inherit an NLU, simply type the NLU ID with
[username]:[project name]
. Inherited entities will sync to the original entity, including the labels and training data. For example, in the getting started chatbot section, your NLU ID format ismuhfadhiilkata:test_simple_bot
. - Type is intended for the entity type. In Kata Platform, we have 3 entity types which are
- Trait is a text classifier that classifies a sentence into particular labels. This type is suitable for making your bot recognize nuance in a sentence.
- Dictionary is a word tagger which has keys and labels. It will form an array.
- The Phrase is a word tagger.
-
The Profile is an AI model that is available to use by users. You can choose any profiles you want depending on data training and the dataset you have.
- The “default” profile is suitable for a large number of data samples so that this profile is available in all entity types.
- “intent” profile is suitable for entity type “trait”.
- “default_v2” profile is our newest profile and is suitable if you have a large amount of data.
-
Root enables you to create an entity using other users' NLU to copy the training data and add new labels. Therefore, it is suitable if you have a trained NLU and want to add more labels in the NLU. To use this, you can type an NLU ID which consists of username and project with format
[username]:[project name]
. -
Labels is a feature to determine classes or categories from an entity. To add a new label, you can type and press “enter” on your keyboard.
-
If you choose entity type “dictionary”, it will show a dropdown name Belongs to. This feature enables you to define relationships () between entities. Belongs to feature can only be used for entities in the same NLU with entity type “phrase” or “dictionary”. It is suitable if your data sample is like this:
Saya mau beli es kopi susu 2 dan less sugar
Es kopi susu
is the main menu. 2 and less sugar
are modifiers to detailing the order. So in Belongs To concept, you will tag 2 and choose belongsTo
.
For this guide, you will create 1 entity type “trait” to extract sentences, 1 entity type “phrase” and 1 entity type “dictionary” to extract words.
- Click the Create Entity button.
- Then, fill in as follows for the first entity:
Name: intent
Type: Trait
Label: statement
Here is the screenshot:
- Click the Create Entity button to create the following entity.
- Fill in as follows for the second entity:
Name: object
Type: phrase
Label: person
Here is the screenshot:
- Fill in as follows for the last entity:
Name: things
Type: dictionary
Key: car
Labels: suv, van, lcgc
Here is the screenshot:
- Click Create to finalize.
The next step is to create a word bank when you’re done.
Create a Word Bank
In this part, you will create synonyms or variants from the keyword. Here is the guidance for adding variants:
- You can add 2 words as 1 variant by using space. Maximum 20 characters.
- You must create at least 2 variants to create a word bank with a maximum of 20 characters for each variant.
- You can remove a variant by clicking the “x” button next to the word.
Step by step to create a word bank:
- Click the Word Bank tab, then click the Create button. It will show a dialog.
- Fill in the name for your word bank. For example, you can fill in Kenapa.
- Then, create variants by typing a synonym from Kenapa word. For example:
ngaps
ngapa
kenp
knp
kenopo
- Press “Enter” on your keyboard.
- Finally, you have variants. Then, click the Create button to create a word bank.
Create a Data Sample
Once you’ve created a word bank, now you’re ready to make a data sample. Data sample is a sample sentence that consists of a word bank and several words as a base to generate training data or called a “pattern”. Here are the steps:
- Go to NLU > Training.
- Click on Bulk Training dropdown, then choose Data Generator.
- You will see a row to input a data sample. For example, to add your available word bank, type “@“ to show the word bank list.
- Input a data sample, for example:
@Kenapa kemarin Sinta tidak sekolah
- Press “Add Row” to submit.
- You can edit or delete your data sample after it has been created.
On the data sample page, there is information:
- Total data possibilities mean total generated data that can be created from available data samples. So for this example, you have 2 data samples multiplied with 1 word bank, then multiply again with 5 word variants. So finally, there will be 10 data possibilities.
- Data to Generate is several generated data that you wanted. The number must be below or equal to the total data possibilities.
Important note:
Make sure you put a space before the next word or after the previous word. Example: @Kenapa Sinta tidak sekolah?
or Aku lagi @Kenapa ya hari ini
Generate Training Data
This step will guide you to generate training data from created data samples. Generated training data might not match your expectations because we are duplicating words. However, you can prepare the training data and do bulk training instead if you want specific training data. Learn more about bulk training.
- Before you start, ensure your data samples are created. The generate button will enable if data samples are available and data to generate is equal to or below the total data possibilities.
- Let’s put 10 data to generate.
- Click the Generate button on the top right of your screen.
- It redirects you to the Generated Data page, where you can train using the bulk labeling feature.
- You will see a training data list as follows.
- You can delete generated data by clicking on the remove button in the list.Bulk labeling
You’re going to train generated data by using bulk labeling. Bulk labeling is a feature to tag entity type “trait”, “phrase” or “dictionary”, by doing it at once.
Let’s start to tag an entity type “trait” that you’ve created before.
- In the Bulk Labeling section, select the entity name intent. This entity type is “trait”.
- Next, it will show available labels in the entity. Because you only create 1 label, then select the statement label.
Important note: Bulk labeling for entity type “trait” will tag all generated data.
- Click the button Add more entity. Then, select the entity name object.
- Next, select the person label. Then, it will show a text area named Words to Tag. Words to Tag is a feature to search words you want to tag. For example, you will tag
sinta
word. - Press “Enter” on your keyboard to submit the word. It will show like this.
Important note: Bulk labeling for entity type “phrase” and “dictionary” will tag words only.
- Then, add the entity “dictionary” type. Click Add more Entity to start. Then, select entity name things. Then, it will show a text area named Words to Tag. Words to Tag is a feature to search words you want to tag. For example, you will tag a
sekolah
word.
Important note: Bulk labeling for entity type “dictionary” will not show specific labels to tag. So, it will train the entity.
- Finally, you can click the button Train. It will train all generated data into entity
intent
withstatement
label and entityobject
withperson
label. - When successfully trained, you will redirect to the Training page. On this page, you can see the generated data has been trained.
- You can click the arrow on the right of the box to open the trained data detail.
Limitations
There are several important notes when using a data generator, such as:
- You can create unlimited data samples and word banks, but the system only generates up to 100 training data.
- You can create data again if you want to add more than 100 generated training data or sentences. It recommended using new data samples when generating new training data to make similar combination training data or sentence results.
- If you’re not satisfied with generated training data, we recommend these approaches:
- Create a new data sample with a different pattern from previous data samples.
- If you want a specific sentence, but the system does not show what you want, then we recommend you to use the training feature in NLU > Training instead.
- If you have another NLU and want to use word banks or data samples from the current project, it cannot be done now.
- The bulk labeling feature is only for the Data Generator feature. You cannot use it for now independently.
This is the end of the guidance. You can contact support@kata.ai if you have any difficulties when implementing this.