In actuality, phishing attacks remain one of the most prevalent cybersecurity risks in existence today, with malevolent actors constantly changing their strategies to successfully trick users. This paper presents an AI model for a phishing detection system that uses an ensemble approach to combine character-level Convolutional Neural Networks (CNN) and LightGBM with engineered features. Our system uses a character-level CNN to extract sequential features after extracting 36 lexical, structural, and domain-based features from the URLs. On a test dataset of 19,873 URLs, the ensemble model achieves an accuracy of 99.819 percent, precision of 100 percent, recall of 99.635 percent, and ROC-AUC of 99.947 percent. Through a FastAPI-based service with an intuitive user interface, the suggested system has been utilised to offer real-time detection. In contrast, the results demonstrate that the suggested solution performs better than individual models; LightGBM contributes 40 percent and character-CNN contributes 60 percent to the final prediction. The suggested method maintains extremely low false positive rates while doing a good job of identifying contemporary phishing techniques. Index Terms - Phishing detection, machine learning, deep learning, CNN, ensemble methods, cybersecurity, URL analysis
翻译:事实上,钓鱼攻击至今仍是网络安全领域最普遍的威胁之一,恶意攻击者不断调整策略以成功欺骗用户。本文提出了一种用于钓鱼检测系统的人工智能模型,采用集成方法将字符级卷积神经网络(CNN)与LightGBM结合,并融合了工程化特征。我们的系统在从URL中提取36个词汇、结构和基于域名的特征后,利用字符级CNN提取序列特征。在包含19,873个URL的测试数据集上,该集成模型实现了99.819%的准确率、100%的精确率、99.635%的召回率以及99.947%的ROC-AUC值。通过基于FastAPI的服务及直观的用户界面,所提出的系统已实现实时检测功能。实验结果表明,该方案性能优于单一模型,其中LightGBM对最终预测贡献40%,字符级CNN贡献60%。所提方法在有效识别现代钓鱼技术的同时,保持了极低的误报率。关键词 - 钓鱼检测、机器学习、深度学习、CNN、集成方法、网络安全、URL分析