If you want to fine tune Llama 2 or similar, then embed each pair together and separately and store them. Then, use the unlabeled data (the source text without translation) to query the embeddings for similar matches. You then send in the necessary prompt text with the matches, plus the text to translate. You'll want to do this with a foundational model, like GPT-x.
As noted below, extracting words or keyterms would maybe be a good idea, as they could be included in the training set.
The training set would the be comprised of the prompt, the translation, and keyterms. As you will want to vet the generated texts anyway, you could then decide if the foundational model was working enough. You could also try to run the largest "open" model you could find on the prompts, to see if those needed training as well. There are many different Llama models trained on HuggingFace for language pairs, so see if your languages are already built and test those.
I'm building a simple, Open Source ML pipeline manager at https://ai.featurebase.com/. I'd be down to help you with this!
As noted below, extracting words or keyterms would maybe be a good idea, as they could be included in the training set.
The training set would the be comprised of the prompt, the translation, and keyterms. As you will want to vet the generated texts anyway, you could then decide if the foundational model was working enough. You could also try to run the largest "open" model you could find on the prompts, to see if those needed training as well. There are many different Llama models trained on HuggingFace for language pairs, so see if your languages are already built and test those.
I'm building a simple, Open Source ML pipeline manager at https://ai.featurebase.com/. I'd be down to help you with this!