Huggingfaceのtransformersライブラリで固有表現抽出

Huggingfaceのtransformersライブラリでv3.4.0を使う

固有表現抽出をtransformersライブラリで行う。東北大学のBERTモデルを使う場合は、Huggingfaceのtransformersライブラリでv3.4.0を使う必要がある。

- 東北大モデル以外(NICT, 京大など)なら、最新のtransformersが使える(examples/pytorch/token-classification/run_ner.py)。入力ファイルはjsonにするのがおそらく楽で、jsonファイルは1行に1文の情報で、単語列とラベル列からなるもの。

何か見落としなどあれば教えてください。
— Tomohide Shibata (@stomohide) 2021年9月14日

上記のツイート通り、BertJapaneseTokenizerがfastに対応していないので、最新のtransformersのrun_ner.pyでは使えない https://github.com/huggingface/transformers/issues/12381

transformersのtokenizerはv4.0.0以降でFastTokenizerがデフォルトで使用されるようになった

これまでtransformersのtokenizerはライブラリ内に同梱される、python実装のものであった。一方、v4.0.0以降でFastTokenizerという名称でとして使用されるようになった。また、このトークナイザはtokenizersとして分離された。

動かす

cloneしてくる

git clone git@github.com:huggingface/transformers.git -b v3.4.0
cd examples/token-classification

データの準備と環境変数の設定

curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp


export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased

python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt

cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt


export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1

実行

python3 run_ner.py --data_dir ./ --labels ./labels.txt --model_name_or_path $BERT_MODEL --output_dir $OUTPUT_DIR --max_seq_length $MAX_LENGTH --num_train_epochs $NUM_EPOCHS --per_device_train_batch_size $BATCH_SIZE --save_steps $SAVE_STEPS --seed $SEED --do_train --do_eval --do_predict

train.txtなど一番初めに実行されたファイルを元にcachedファイルが作成される

上記を実行したあと、途中でtrain.txt,test.txtを変更しても反映されてないっぽかった。結論としては、一番初めに実行した際のtrain.txtが反映されているようだ。

TokenClassificationDatasetクラス https://github.com/huggingface/transformers/blob/eb0e0ce2adf66d2b1106b9e0852f6e063ab9ae7c/examples/token-classification/run_ner.py#L179

以下の部分でcachedを作る、このcachedファイルを削除し、run_ner.pyを動かすと、以下のエラーで止まる。 https://github.com/huggingface/transformers/blob/91ff480e2693f36b11aaebc4e9cc79e4e3c049da/examples/legacy/token-classification/utils_ner.py#L236

Traceback (most recent call last):
  File "run_ner.py", line 317, in <module>
    main()
  File "run_ner.py", line 183, in main
    TokenClassificationDataset(
  File "/Users/xxxxx/Documents/GitHub/transformers-v3.4.0/examples/token-classification/utils_ner.py", line 247, in __init__
    examples = token_classification_task.read_examples_from_file(data_dir, mode)
  File "/Users/xxxxx/Documents/GitHub/transformers-v3.4.0/examples/token-classification/tasks.py", line 24, in read_examples_from_file
    with open(file_path, encoding="utf-8") as f:
FileNotFoundError: No such file or directory: './train.txt'

cachedファイルの削除を行い、もう一度、train.txtを少なく作り直すと、作り直したtrain.txt等で動作した。

run_ner.pyのis_world_masterをis_world_process_zeroに変更しておく

他に、以下のエラーが出たが、モデルはOUTPUT_DIRで指定したgermeval-modelへ保存された。

Traceback (most recent call last):
  File "run_ner.py", line 317, in <module>
    main()
  File "run_ner.py", line 258, in main
    if trainer.is_world_master():
AttributeError: 'Trainer' object has no attribute 'is_world_master'

このエラーについてはis_world_masterをis_world_process_zeroに変更することでエラーが解消した

qiita.com

tossy diary

日常の記録を残す