Skip to content
from-top-math-student-to-us-computer-science-professor:-researcher-builds-largest-vietnamese-language-dataset-–-vnexpress-international

From top math student to US computer science professor: Researcher builds largest Vietnamese language dataset – VnExpress International

Nguyen Huu Thien built on his passion for mathematics to become an associate professor of computer science in the U.S. and develop an AI model trained on the largest Vietnamese dataset to date.

Thien, 37, who is at the University of Oregon’s department of computer science, focuses on multi-domain and multilingual natural language processing.

In November his team released a new Vietnamese language model called SaoLa-3B-Instruct, trained on the largest Vietnamese dataset to date, with more than 230 billion tokens. Thien received the prestigious CAREER Award from the U.S. National Science Foundation in 2023, which honors promising early-career professors. His research papers have been cited more than 10,000 times on Google Scholar.

“I am happy to work on what I am passionate about, to create useful products, and to be supported in doing so,” he says.

His passion for mathematics emerged even while he was in middle school, when he developed a fascination with solving complex problems. He went on to study at Hung Yen High School for the Gifted, winning second prize in the national mathematics competition in his final year. The achievement earned him direct admission in 2006 to the elite talent program at the Hanoi University of Science and Technology, where he majored in computer science.

His career in research began when a professor, impressed by one of his assignments on knowledge systems, invited him to assist in a project on Vietnamese information extraction. At a time when machine learning trends were growing, Thien experimented with general programming models to automate data processing, helping accelerate the project.

Nguyen Huu Thien, 37, an associate professor at the University of Oregon’s department of computer science. Photo courtesy of Thien

Nguyen Huu Thien, 37, an associate professor at the University of Oregon’s department of computer science. Photo courtesy of Thien

Encouraged by his supervisor, he compiled his findings and wrote his first scientific paper. He later expanded that research into his graduation thesis, earning second prize in the Ministry of Education and Training’s national scientific research competition for students. From having no prior research experience, he had learned every step of the process, from defining problems and experimentation to writing and presentation, and found himself deeply suited to academic inquiry.

After graduation, he decided to pursue further education abroad to deepen his research. He contacted renowned professors specializing in NLP and information extraction, one of whom was Ralph Grishman, a pioneer in the field. Grishman responded warmly and invited him to join his lab at New York University as a PhD student in 2012.

However, the early years were challenging. “I tried all the research directions suggested by my advisor but could not find one that truly inspired me,” he recalls.

That changed when he took a course taught by Yann LeCun, the “godfather of AI,” and became intrigued by deep learning. At the time deep learning was primarily used in computer vision rather than language processing. “I thought, why not try applying it to information extraction?” he says.

The results exceeded expectations as his initial experiments outperformed traditional approaches, faster, more generalizable, and adaptable across text types. With his advisor’s approval, he pursued this new direction, which led to a series of innovative papers.

In 2016 he received the Harold Grad Award from NYU’s Courant Institute of Mathematical Sciences for outstanding PhD research potential. After earning his doctorate, Thien worked as a postdoctoral researcher at the University of Montreal with AI pioneer Yoshua Bengio, before joining the University of Oregon as a faculty member in 2018.

“This work gives me the freedom to choose topics, methods and collaborators,” he says, adding that what he enjoys most is working with students, as they bring a “spark of positive energy” that keeps him moving forward.

Reza Rejaie, chair of the department of computer science at the University of Oregon, describes Thien as a “star,” praising his leadership in developing deep learning methods for information extraction and multilingual NLP. “Thien has played a key role in major AI projects since joining the university. He has led cutting-edge, high-efficiency deep learning research for large-scale data applications.”

Among Thien’s proudest projects is CulturaX, a multilingual dataset covering 167 languages. Launched in 2022 as large language models like ChatGPT gained popularity, CulturaX was built in response to the lack of transparency around training data used by major tech firms. He describes it as a multi-stage project involving tasks such as filtering out poor-quality, biased or duplicate data, and it reached tens of terabytes in size.

Upon its release, CulturaX received positive feedback and has since been used by companies and research labs, including Stability AI and Eleuther AI, to train their language models. Building on that success, Thien and his team later developed Vistral, an open-source Vietnamese language model, before launching SaoLa-3B-Instruct.

The two-year project involved collecting and verifying a massive dataset while refining tools tailored for Vietnamese language processing. “The name SaoLa reflects our pride in a rare and iconic Vietnamese animal,” Thien says. “We hope our models embody the same spirit: unique, high-quality and authentically Vietnamese.”

For him, creating something useful and contributing to the community is the natural goal of any research project. “The best part of doing research is discovering and learning new things along the way. When that happens, even failure does not matter.”

He emphasized the importance of balancing resources and working methodically. “Working carefully rather than taking shortcuts” is one of the most important lessons he has drawn from nearly two decades in the field.

“The outcomes of fundamental research and knowledge discovery do not come quickly. Instead of feeling pressured by stories of early success on social media, young people should patiently build their own foundations.”

He believes there are still many major AI challenges awaiting creative solutions. Looking ahead he hopes to mentor more Vietnamese students in advanced research environments while continuing to enhance Vietnamese language models and datasets for global use.

colind88

Back To Top