An Investigation Into Supervision for Seq2Seq Techniques for Natural Language to Code Translation

Yeditha, Meheresh Sai

An Investigation Into Supervision for Seq2Seq Techniques for Natural Language to Code Translation

Files

Yeditha_washington_0250O_25027.pdf (552.08 KB)

Date

2023-01-21

Authors

Yeditha, Meheresh Sai

Abstract

This thesis examines the role of supervised data using small-scale datasets for the natural language to code task. The primary angles of inquiry are from analyzing the balance between unsupervised learning and supervised learning, as well as experimenting with several training techniques. To do so, two publicly available datasets were utilized, the CodeSearchNet task for English documentation to Python code, and the Mostly Basic Python Problems (MBPP) dataset, using the mBART seq2seq framework for running experiments. The best performing models pretrained on the full CodeSearchNet dataset, and finetuned on the MBPP dataset. Several avenues for future inquiry and effective experimentation were discovered and solidifed, including lample masking, creation of more datasets fitting the NL2C paradigm, and size and division of datasets. Finetuning is significantly more important than the pretraining phase, although both are crucial when using the seq2seq framework. Overall, this thesis solidifies the utility of seq2seq frameworks for the NL2C task, and the promise of transfer learning and inquiries for this task going forward.