Empowering Language Diversity: Sarvam AI Unveils OpenHathi-Hi-v0.1, a Groundbreaking Hindi Language Model

Sarvam AI, a promising India based AI startup, has recently unveiled OpenHathi-Hi-v0.1, marking a significant milestone as the inaugural Hindi Large Language Model (LLM) in the OpenHathi series. Leveraging Meta AI’s advanced Llama2-7B architecture, the model boasts performance capabilities parallel to GPT-3.5, specifically tailored for Indic languages.

Translation plays a pivotal role in advancing Indic language AI. Sarvam AI approach involves fine-tuning the base model using a human-verified subset of the BPCC dataset. To gauge its efficacy, Sarvam AI has conducted comprehensive comparisons against generative models (GPT-3.5/GPT-4) and leading translation models (IndicTrans2/Google Translate) using the FLoRes-200 benchmark.

Devanagari Hindi → English Translation:
In this direction, Sarvam AI’s model showcases superior performance, outperforming both GPT-3.5 and GPT-4 in terms of BLEU score. While it trails slightly behind IndicTrans2 and Google Translate, which are specifically fine-tuned for translation tasks, our model establishes itself as a formidable contender.

English → Devanagari Hindi Translation:
When translating from English to Devanagari Hindi, its model stands on par with the specialized IndicTrans2 and Google Translate models. Notably, it outshines GPT-3.5 and GPT-4 by a significant margin, affirming its efficacy in this language direction.

While the predominant focus in translation efforts has been on Devanagari Hindi, their exploration extends to the translation between Romanized Hindi and English. Acknowledging the popularity of Romanized input for Hindi, this task opens up new avenues for AI models to enhance content accessibility. Notably, existing translation systems do not support Romanized Hindi, necessitating benchmarking solely against GPT models.

Romanized Hindi ↔ English Translation:
Sarvam AI’s model excels significantly in both directions when compared to GPT models. The enhanced performance underscores its potential to bridge gaps in Romanized Hindi ↔ English translation, offering improved accessibility for a broader audience. IndicTrans2 and Google Translate API do not support Romanized Hindi in their capabilities.

As we continue to refine and evolve our approach to translation tasks, these findings highlight the effectiveness of our model in various language directions. The commitment to addressing linguistic diversity and making content accessible remains at the forefront of our endeavors in the realm of Indic language AI.

This innovative AI model, meticulously developed by Sarvam AI, extends Llama2-7B’s tokenizer with a 48,000-token extension and undergoes a sophisticated two-phase training process. The initial phase focuses on embedding alignment, strategically aligning randomly initialized Hindi embeddings. The subsequent phase, bilingual language modeling, involves training the model to attend cross-lingually across tokens.

Sarvam AI proudly asserts, “We show that our model works as well as, if not better than GPT-3.5 on various Hindi tasks while maintaining its English performance.” This achievement, documented in a post on X (formerly Twitter), highlights the company’s commitment to advancing language-specific AI capabilities.

Beyond conventional Natural Language Generation (NLG) tasks, Sarvam AI conducted evaluations of the model’s performance on real-world applications. The company’s emphasis on practical usability showcases the versatility and potential impact of OpenHathi-Hi-v0.1.

Devanagari Hindi → English: the model outperforms GPT-3.5 and GPT-4 in terms of BLEU score, while slightly trailing behind IndicTrans2 and Google Translate, which have been specifically fine-tuned for translation. (all Google Translate scores are picked from the IndicTrans2 report)

In a strategic collaboration, the five-month-old AI startup partnered with KissanAI to fine-tune its base model using conversational data gathered from interactions with farmers. This dataset, originating from a GPT-powered bot engaging farmers in various languages, contributed to enhancing the Hindi skills of Llama-2.

To optimize the tokenization process for Hindi text, Sarvam AI initiated a pivotal step by reducing the fertility score, i.e., the average number of tokens a word is split into. This adjustment not only accelerates training and inferencing but also boosts overall efficiency.

In a blogpost, the company elaborated on its methodology, stating, “We train a sentence-piece tokenizer from a subsample of 100K documents from the Sangraha corpus, created at AI4Bharat, with a vocabulary size of 16K. We then merge this with the Llama2 tokenizer and create a new tokenizer with a 48K vocabulary (32K original vocabulary plus our added 16K).”

Founded in July 2023 by Vivek Raghavan and Pratyush Kumar, Sarvam AI has rapidly gained recognition and secured a substantial $41 million in a recent funding round. The investment, led by Lightspeed Ventures and featuring participation from Peak XV Partners and Khosla Ventures, attests to the industry’s confidence in Sarvam AI’s innovative approach to language models and its potential impact on advancing AI capabilities tailored to linguistic diversity. As Sarvam AI continues to push boundaries, OpenHathi-Hi-v0.1 emerges as a pioneering force in bridging the gap for Hindi language processing in the evolving landscape of artificial intelligence.

Empowering Language Diversity: Sarvam AI Unveils OpenHathi-Hi-v0.1, a Groundbreaking Hindi Language Model

Anika V

Leave a Reply Cancel reply

Next Post

Navigating the Future of AML Compliance: A Comprehensive ebook from C3.ai

Anika V

Leave a Reply Cancel reply

You May Like