Introduction

What is Kata Platform

Benefits with Kata Platform

Kata Platform Features

Use Case

All Use Cases

Documentation Content

How-to Guides

FAQs

General FAQs

Bot Development FAQs

Glossary

General

Kata Platform

Inside Kata Flow

Coming Soon

Release Notes

How to Use Training Data Generator to Automate Generating Training Data

Introduction

Training Data Generator is a feature that enables you to automatically generate training data up to 100 data using sentence samples, or we call it data samples.

Before creating a data sample, you will need to define the keyword and synonyms inside the “word bank” feature. Why is it required? We do permutations from data samples and keyword variants to generate the training data. Let’s try a case:

You create a word bank from Kenapa which consists of variant words or synonyms from Kenapa. For example: knp, ngap, ngaps, kenaps. So, you have 4 word variants.
Next, you create a data sample. For example: @Kenapa ATM saya bermasalah ya.
We will create a permutation or combination from a data sample with variant words. For example, expected generated data will be like this:
- knp ATM saya bermasalah
- ngap ATM saya bermasalah
- ATM saya bermasalah ngaps
- kenaps ATM saya bermasalah
- And many more
Also, we have a feature that is bulk tagging to tag all generated training data with your available entities and labels. It enables faster tagging rather than manually annotating.

This guidance will elaborate on how to use training data generators and do bulk tagging after generated data shows up. Let’s get started.

Create Entity and Labels

Before you start, you will need to create at least one entity.

To create an entity, go to the NLU > Models menu. Then, click the Entity menu.

Figure 1Models menu.

Click Create Entity to set up a new entity. It will show a form on the right screen. Fill in the entity name.

Here is the explanation of each field

Inherit lets you copy an entity from any public NLU on the Kata Platform. To inherit an NLU, simply type the NLU ID with [username]:[project name]. Inherited entities will sync to the original entity, including the labels and training data. For example, in the getting started chatbot section, your NLU ID format is muhfadhiilkata:test_simple_bot.
Type is intended for the entity type. In Kata Platform, we have 3 entity types which are
- Trait is a text classifier that classifies a sentence into particular labels. This type is suitable for making your bot recognize nuance in a sentence.
- Dictionary is a word tagger which has keys and labels. It will form an array.
- The Phrase is a word tagger.

Figure 2Entity type "dictionary"

The Profile is an AI model that is available to use by users. You can choose any profiles you want depending on data training and the dataset you have.
- The “default” profile is suitable for a large number of data samples so that this profile is available in all entity types.
- “intent” profile is suitable for entity type “trait”.
- “default_v2” profile is our newest profile and is suitable if you have a large amount of data.
Root enables you to create an entity using other users' NLU to copy the training data and add new labels. Therefore, it is suitable if you have a trained NLU and want to add more labels in the NLU. To use this, you can type an NLU ID which consists of username and project with format [username]:[project name].
Labels is a feature to determine classes or categories from an entity. To add a new label, you can type and press “enter” on your keyboard.
If you choose entity type “dictionary”, it will show a dropdown name Belongs to. This feature enables you to define relationships () between entities. Belongs to feature can only be used for entities in the same NLU with entity type “phrase” or “dictionary”. It is suitable if your data sample is like this:

Saya mau beli es kopi susu 2 dan less sugar

Es kopi susu is the main menu. 2 and less sugar are modifiers to detailing the order. So in Belongs To concept, you will tag 2 and choose belongsTo.

For this guide, you will create 1 entity type “trait” to extract sentences, 1 entity type “phrase” and 1 entity type “dictionary” to extract words.

Click the Create Entity button.
Then, fill in as follows for the first entity:

Name: intent
Type: Trait
Label: statement

Here is the screenshot:

Figure 3Create an entity.

Click the Create Entity button to create the following entity.
Fill in as follows for the second entity:

Name: object
Type: phrase
Label: person

Here is the screenshot:

Figure 4Create an entity.

Fill in as follows for the last entity:

Name: things
Type: dictionary
Key: car
Labels: suv, van, lcgc

Here is the screenshot:

Figure 5Create an entity.

Click Create to finalize.

The next step is to create a word bank when you’re done.

Create a Word Bank

In this part, you will create synonyms or variants from the keyword. Here is the guidance for adding variants:

You can add 2 words as 1 variant by using space. Maximum 20 characters.
You must create at least 2 variants to create a word bank with a maximum of 20 characters for each variant.
You can remove a variant by clicking the “x” button next to the word.

Step by step to create a word bank:

Click the Word Bank tab, then click the Create button. It will show a dialog.
Fill in the name for your word bank. For example, you can fill in Kenapa.
Then, create variants by typing a synonym from Kenapa word. For example:

ngaps
ngapa
kenp
knp
kenopo

Press “Enter” on your keyboard.

Figure 6Create a word bank.

Finally, you have variants. Then, click the Create button to create a word bank.

Create a Data Sample

Once you’ve created a word bank, now you’re ready to make a data sample. Data sample is a sample sentence that consists of a word bank and several words as a base to generate training data or called a “pattern”. Here are the steps:

Go to NLU > Training.
Click on Bulk Training dropdown, then choose Data Generator.

Figure 7Go to Data Generator feature.

You will see a row to input a data sample. For example, to add your available word bank, type “@“ to show the word bank list.

Figure 8Type “@” to show word bank list.

Input a data sample, for example:

@Kenapa kemarin Sinta tidak sekolah

Figure 9Type a data sample.

Press “Add Row” to submit.
You can edit or delete your data sample after it has been created.

Figure 10Update and delete data sample.

On the data sample page, there is information:

Total data possibilities mean total generated data that can be created from available data samples. So for this example, you have 2 data samples multiplied with 1 word bank, then multiply again with 5 word variants. So finally, there will be 10 data possibilities.

Figure 11Total data possibilities from available data samples.

Data to Generate is several generated data that you wanted. The number must be below or equal to the total data possibilities.

Figure 12Data to generate must be equal to or below than total data possibilities.

Important note:

Make sure you put a space before the next word or after the previous word. Example: @Kenapa Sinta tidak sekolah? or Aku lagi @Kenapa ya hari ini

Generate Training Data

This step will guide you to generate training data from created data samples. Generated training data might not match your expectations because we are duplicating words. However, you can prepare the training data and do bulk training instead if you want specific training data. Learn more about bulk training.

Before you start, ensure your data samples are created. The generate button will enable if data samples are available and data to generate is equal to or below the total data possibilities.
Let’s put 10 data to generate.
Click the Generate button on the top right of your screen.

Figure 13Generate button.

It redirects you to the Generated Data page, where you can train using the bulk labeling feature.
You will see a training data list as follows.

Figure 14Generated data from data sample and a word bank.

You can delete generated data by clicking on the remove button in the list.Bulk labeling

You’re going to train generated data by using bulk labeling. Bulk labeling is a feature to tag entity type “trait”, “phrase” or “dictionary”, by doing it at once.

Let’s start to tag an entity type “trait” that you’ve created before.

In the Bulk Labeling section, select the entity name intent. This entity type is “trait”.
Next, it will show available labels in the entity. Because you only create 1 label, then select the statement label.

Important note: Bulk labeling for entity type “trait” will tag all generated data.

Figure 15Tag entity type “trait”.

Click the button Add more entity. Then, select the entity name object.
Next, select the person label. Then, it will show a text area named Words to Tag. Words to Tag is a feature to search words you want to tag. For example, you will tag sinta word.
Press “Enter” on your keyboard to submit the word. It will show like this.

Figure 16Tag `sinta` word in bulk labeling.

Important note: Bulk labeling for entity type “phrase” and “dictionary” will tag words only.

Then, add the entity “dictionary” type. Click Add more Entity to start. Then, select entity name things. Then, it will show a text area named Words to Tag. Words to Tag is a feature to search words you want to tag. For example, you will tag a sekolah word.

Important note: Bulk labeling for entity type “dictionary” will not show specific labels to tag. So, it will train the entity.

Figure 17Tag `sekolah` word in bulk labeling.

Finally, you can click the button Train. It will train all generated data into entity intent with statement label and entity object with person label.
When successfully trained, you will redirect to the Training page. On this page, you can see the generated data has been trained.

Figure 18Data has been trained as `intent:statement`, `object:person` and `things`.

You can click the arrow on the right of the box to open the trained data detail.

Figure 19How to show trained data detail.

Figure 20Trained data detail.

Limitations

There are several important notes when using a data generator, such as:

You can create unlimited data samples and word banks, but the system only generates up to 100 training data.
You can create data again if you want to add more than 100 generated training data or sentences. It recommended using new data samples when generating new training data to make similar combination training data or sentence results.
If you’re not satisfied with generated training data, we recommend these approaches:
- Create a new data sample with a different pattern from previous data samples.
- If you want a specific sentence, but the system does not show what you want, then we recommend you to use the training feature in NLU > Training instead.
If you have another NLU and want to use word banks or data samples from the current project, it cannot be done now.
The bulk labeling feature is only for the Data Generator feature. You cannot use it for now independently.

This is the end of the guidance. You can contact support@kata.ai if you have any difficulties when implementing this.

Designing Conversation

Start Your First Chatbot

Analyze Your Bot Conversations

Error Log

How to Deploy Your Chatbot Using Generic Channel

How to Improve Your Chatbot Intelligence by Training Your NLU

How to Use NLU Threshold to Enhance Customer Experience

How to Use Super Model (“Kata Entity”) to Improve Your Bot Intelligence