基于深度学习的不良应用域名早期识别方法

胡安磊* ** ****; 田语* ****; 陈勇**; 李振宇* ****; 谢高岗*** ****

文章摘要

胡安磊* ** ****,田语* ****,陈勇**,李振宇* ****,谢高岗*** ****.基于深度学习的不良应用域名早期识别方法[J].高技术通讯(中文),2024,34(2):151~161

基于深度学习的不良应用域名早期识别方法

A deep learning based approach for early detection of abused domain names

DOI：10. 3772/ j. issn. 1002-0470. 2024. 02. 005

中文关键词: 域名系统（DNS）；域名分类；深度学习；预训练语言模型

英文关键词: domain name system(DNS), domain name classification, deep learning, pre-training model

基金项目:

作者	单位
胡安磊* **	（中国科学院计算技术研究所北京 100190）（中国互联网络信息中心北京 100190）（中国科学院计算机网络信息中心北京 100083）（**中国科学院大学北京 100049）
田语* ****
陈勇**
李振宇* ****
谢高岗* **

摘要点击次数: 1705

全文下载次数: 992

中文摘要:

不良应用网站依赖域名系统（DNS）实现不良内容传播，严重影响互联网的健康发展。尽早识别出不良应用网站对应的域名（即不良应用域名），并进行相应治理，对域名系统的管理与运行至关重要。本文从国家顶级域名(.CN)管理的角度出发，关注如何在注册阶段识别不良应用域名。分析发现不良应用域名在注册特征与文本结构2个维度，与正常域名存在显著差异。为此，提出了一种基于深度学习的不良应用域名早期识别方法。该方法首先提取域名的注册信息特征，并利用预训练语言模型基于Transformer的双向编码器（BERT）提取域名本身的文本语义特征，其次基于注意力机制融合2类特征，并最终使用全连接神经网络，构建域名分类器，实现不良应用域名的早期识别。基于真实网络数据的实验结果表明，所提方法分类准确率（F1分数）可达到0.99；消融实验结果也验证了所选特征的有效性和必要性。

英文摘要:

The harmful websites rely on the domain name system (DNS) to achieve the dissemination of unhealthy content; these websites adversely affect the Internet’s development. Therefore, it is of great importance for the DNS’s operation and management to identify the domain names that correspond to the harmful website (i.e., the abused domain names) as early as possible and dispose of them accordingly. From the perspective of the country top-level domain (. CN) management, this paper focuses on the detection of the abused domain names at the registration stage. We find distinct differences between the abused domain names and normal domain names in terms of registration characteristics and text structure. Based on this observation, a deep learning based approach for early detection of abused domain names is proposed. Specifically, the proposed method first extracts the registration information features of the domain names, as well as the text semantic features of the domain names themselves using the pre-training bidirectional encoder representation from transformers (BERT). Next, the method leverages the attention mechanism to coordinate the two types of features. Finally, a fully connected neural network is used to construct the domain name classification model, where the output indicates whether a given domain is an abused one or not. Extensive experiments based on real-life network data show that the F1 score of the proposed method can reach as high as 0.99. The ablation results also demonstrate the effectiveness and necessity of using the selected features to construct the classification model.

查看全文查看/发表评论下载PDF阅读器

关闭