Instructions to use zhangtaolab/agront-1b-H3K4me3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhangtaolab/agront-1b-H3K4me3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="zhangtaolab/agront-1b-H3K4me3")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("zhangtaolab/agront-1b-H3K4me3") model = AutoModelForSequenceClassification.from_pretrained("zhangtaolab/agront-1b-H3K4me3") - Notebooks
- Google Colab
- Kaggle
Plant foundation DNA large language models
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
Developed by: zhangtaolab
Model Sources
- Repository: Plant DNA LLMs
- Manuscript: Versatile applications of foundation DNA language models in plant genomes
Architecture
The model is trained based on the InstaDeepAI/agro-nucleotide-transformer-1b model.
This model is fine-tuned for predicting H3K4me3 histone modification.
How to use
Install the runtime library first:
pip install transformers
Here is a simple code for inference:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = 'agront-1b-H3K4me3'
# load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
# inference
sequences = ['TTCATCTCGTCCGACGCTTCAACCCGCACCGATCCTGCGCCACCCCTTCGCCGGCGGCTTCTCCCCTCCTCTTCCTCCGCCGCTGCATCGCCGTCCCAGGAACTTGGACACGTCGCCTCTCGCCGGCGACCATGTACCGCGCCCTCCGCTCTCTCAAGGTTTCCCCGTCTGCACCCCCCCAACCTTCTACGACGTGTGGCGTTGCGTGTCTCGATCCATTTGGGATGAATGCGCTGGAGTGTTAGA',
'ATCAATATTCCCAACAGGTTTTGAAGCAATGGATGAAACATCATCCTTCACGGAACTGGATTATGGGATTCGCCGGCTGGACCACGCTGTTGGGAATGTGCCGGAGCTGGGTCCTGTAGTGGATTACATCAAGGCGTTTACGGGGTTTCATGAATTTGCGGAGTTTACAGCT']
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
trust_remote_code=True, top_k=None)
results = pipe(sequences)
print(results)
Training data
We use EsmForSequenceClassification to fine-tune the model.
Detailed training procedure can be found in our manuscript.
Hardware
Model was trained on a NVIDIA RTX4090 GPU (24 GB).
- Downloads last month
- 3