I’m trying to write a multi lingual model, where I need to use multiple embeddings based on a SequenceLabelField, for example: “The garçon is going to maison” with language tags of E,F,E,E,E,F (English and French). How do I assign pre trained vectors from different files to this combination?
The closest I’ve come to is writing a custom TextFieldEmbedder, having a similar functionality to BasicTextFieldEmbedder. But that means modifying the model code, which currently I’ve been using the defaults present in the library, and (possibly) having to pass the SequenceLabelField in the embedder call. This approach just seems ugly tbh.
Do you guys have a better idea to do the same, or just any suggestions to improve the above approach?
I’m pretty sure that something like the approach I suggested here should work: Fine-tuning only part of an embedding matrix, given a vocabulary. That assumes that you have language labels at test time, though, which may or may not be a good assumption for what you’re doing. (I think you fundamentally have to make that assumption if you’re going to get different base embeddings based on the language, though.)
I’m going light on details for now, in case what I said in that other topic is good enough to get you started. Feel free to ask more questions if you get stuck.