The Samsung SDS R&D Center won first rank of the shared task of Vietnamese Natural Language Processing Competition

What is VLSP?

VLSP is an annual conference on Vietnamese Language and Speech Processing organized by the VLSP Community. The VLSP Consortium regroups all academic and industrial research teams involved in Vietnamese language and speech processing.

Since 2012, VLSP Consortium has organized a series of workshops, in conjunction with big international conferences organized in Vietnam. The 9th International Workshop held on 26th November 2020 attracted the attention of community.

Shared Task on Machine Translation in VLSP Evaluation campaign marked a successful comeback of Machine Translation in VLSP activities.

The shared task on machine translation competition in VLSP

PARTICIPANTS

Total 25 teams:
✓ Academic : Stanford University, JAIST, HUST, UIT-VNUHCM, UET-VNU,...
✓ Industry : Samsung SDS, VinBigData, VCCorp, Ftech,...
Final 5 teams:
✓ Samsung SDS (SDS) , VinBigData (VBD-MT) , JAIST (JNLP) , VCCorp (VC - Datamining) , HUST (S-NLP)

TASK DESCRIPTION

→ Chinese → Vietnamese Machine Translations
→ Vietnamese → Chinese Machine Translations

EVALUATION

→ By BLEU
→ By Human (the final evaluation)

Training and test data

VLSP is an annual conference on Vietnamese Language and Speech Processing organized by the VLSP Community. The VLSP Consortium regroups all academic and industrial research teams involved in Vietnamese language and speech processing.

Since 2012, VLSP Consortium has organized a series of workshops, in conjunction with big international conferences organized in Vietnam. The 9th International Workshop held on 26th November 2020 attracted the attention of community.

Shared Task on Machine Translation in VLSP Evaluation campaign marked a successful comeback of Machine Translation in VLSP activities.

Flow of data processing and model training

Technical solution

Used mBART-50 pre-trained model (https://arxiv.org/abs/2001.08210)

Trained our model using bilingual dataset => baseline model

Selected 200K sentences from monolingual dataset which its domain is similar to the training bilingual data set, to enerate extra bilingual dataset (200K) using our baseline model

Trained the model with the extra biligual dataset

The final evaluation

→ Chinese -> Vietnamese Translation Task

  • ✓ Rank 1 : SDS , BLEU = 34.19 , Human = 74.73
  • ✓ Rank 2 : VBD-MT , BLEU = 34.21 , Human = 71.42
  • ✓ Rank 3 : S-NLP , BLEU = 29.91 , Human = 68.74

→ Chinese -> Vietnamese Translation Task

  • ✓ Rank 1 : VBD-MT , BLEU = 17.95 , Human = 69.19
  • ✓ Rank 2 : vc-datamining , BLEU = 17.1 , Human = 67.80
  • ✓ Rank 3 : SDS , BLEU = 21.87 , Human = 67.68

→ Overall ranking: (Task 1 + Task 2)/2

  • ✓ Rank 1 : SDS , 71.21
  • ✓ Rank 2 : VBD-MT , 70.30
  • ✓ Rank 3 : JNLP , 66.18

SDSRV - SAMSUNG SDS R&D CENTER IN VIETNAM

Share