用R语言实现汉语转拼音及英语

2017 年 9 月 18 日 数萃大数据 周世荣

今天给大家分享一个有趣的问题，用R语言实现汉语转拼音及英语。文中给出的方法不仅适用于词汇，也适用于简单句子。在汉译英部分会教大家如何调用词霸翻译，但词霸只能翻译词语，为了进行单句翻译，我采用了调用百度翻译API的方法，不幸的是百度只给了200万字符限额，不过对个人的小项目应该足够。为了方便大家使用，拼音转汉语部分给出了大小写以及title三种输出方式。

加载R包

  
    
    
    
   
     
     
     library(rvest)
   
     
     
     library(stringr)
   
     
     
     library(rjson)
   
     
     
     library(digest)
   
     
     
     library(jiebaR)
   
     
     
     library(jiebaRD)
   
     
     
     library(rlist)

抓取全部汉字及保存

网上大多方法算法过于复杂。如果有一个字库，具有所有中文对应拼音（常用汉字才1万多，所以速度方面不会有多大影响），那么就可以通过汉字返回所需要的拼音字母。在网上搜索的时候找到了一个字库，直接抓来用了。我们‘李’字以这个字库中汉字及拼音的表示方法，‘李li3'表示汉字‘李’，拼音为‘li'，音调是3（这里只需要拼音）

  
    
    
    
   
     
     
     url='https://github.com/haiwen/seahub/blob/master/seahub/convert-utf-8.txt'
   
     
     
     web <- read_html(url)
   
     
     
     text=web%>%html_nodes('td:nth-child(2)')%>%html_text()%>%str_trim()
   
     
     
     saveRDS(text,'D://汉字.rds')

提取汉字

  
    
    
    
   
     
     
     text=readRDS('D://汉字.rds')
   
     
     
     onlytext=str_split(text,'')%>%unlist%>%str_match_all('[\u4e00-\u9fa5]')%>%unlist()

构造汉字-拼音|英语函数

下面代码块给出四个封装函数，fun实现汉语到英语的转换，funx调用百度API进行汉译英转换（需要申请百度翻译API），ciba使用词霸进行汉译英转换，funa是终极整句翻译函数，它有两个参数，x接受句子，y接受词语粘贴方式（如：以“/”粘贴）。

  
    
    
    
   
     
     
     #使用jiebaR设置分词引擎
   
     
     
     engine1 = worker()
   
     
     
     #汉语——拼音
   
     
     
     fun = function(x){
   
     
     
       if(x %in% onlytext){
   
     
     
         re=text[which(str_detect(text,x))]
   
     
     
         lower=re%>%str_match('[a-z]{1,}')
   
     
     
         title=re%>%str_match('[a-z]{1,}')%>%str_to_title()
   
     
     
         upper=re%>%str_match('[a-z]{1,}')%>%str_to_upper()
   
     
     
       }else{
   
     
     
         lower=x%>%str_to_lower()
   
     
     
         title=x%>%str_to_title()
   
     
     
         upper=x%>%str_to_upper()
   
     
     
       }
   
     
     
       return(list(lower=lower,title=title,upper=upper))
   
     
     
     }
   
     
     
     #汉译英函数（百度API版）
   
     
     
     funx=function(x,fromLang,toLang){
   
     
     
       appid = '你的API'
   
     
     
       secretKey = '你的API密码'
   
     
     
       fromLang = 'zh'
   
     
     
       toLang = 'en'
   
     
     
       salt=sample(32768:65536,1)
   
     
     
       myurl = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
   
     
     
       q = x
   
     
     
       q=segment(q , engine1)%>%str_c(collapse = '')
   
     
     
       sign = paste0(appid,q,salt,secretKey)
   
     
     
       hash <- digest(sign, algo="md5", serialize=F)
   
     
     
       url=paste0(myurl,'?appid=',appid,'&q=',q,'&from=',fromLang,'&to=',toLang,'&salt=',salt,
   
     
     
                  '&sign=', hash)
   
     
     
       re=read_html(url)%>%html_text()%>%fromJSON()
   
     
     
       lang_from=re$trans_result[[1]]$src
   
     
     
       lang_to = re$trans_result[[1]]$dst
   
     
     
       return(list(lang_from=lang_from,lang_to=lang_to))
   
     
     
     }
   
     
     
     #汉译英函数（词霸版）
   
     
     
     ciba<-function(x){
   
     
     
       link=url(paste0('http://dict.youdao.com/m/search?keyfrom=dict.mindex&vendor=&q=',
   
     
     
                      iconv(x,to='UTF-8')),encoding='UTF-8')
   
     
     
       readLines(link)->a
   
     
     
       gsub('(<[^<>]*>)|(^ )|(\t)','',a)->a;gsub(' {2,}','',a)->a
   
     
     
       head(a,-11)->a;tail(a,-35)->a;a[a!='']->a
   
     
     
       paste(a,collapse='\n')->a
   
     
     
       gsub('(\n *){2,}','\n',a)->a;gsub(' *\n *','\n',a)->a
   
     
     
       print(str_split(a,pattern='\n'))
   
     
     
     }
   
     
     
     #整句转换
   
     
     
     funa=function(x,y){
   
     
     
       xt=lapply(x%>%str_extract_all('[\u4e00-\u9fa5]|\\w')%>%unlist,fun)
   
     
     
       lower=list.map(xt,lower)%>%unlist%>%str_c(collapse = y)
   
     
     
       upper=list.map(xt,upper)%>%unlist%>%str_c(collapse = y)
   
     
     
       title=list.map(xt,title)%>%unlist%>%str_c(collapse = y)
   
     
     
       re = funx(x)
   
     
     
       return(list(lang_from=x,Eng=re$lang_to,lower=lower,title=title,upper=upper))
   
     
     
     }

测试

  
    
    
    
   
     
     
     test1='我正在学习R语言'
   
     
     
     test2='南京市长江大桥欢迎你'
   
     
     
     
   
     
     
     funa(test1,' ')
   
     
     
     funa(test2,' + ')

  
    
    
    
   
     
     
     ciba('中国')

推荐阅读

微课|lattice:条形图

微课|lattice：直方图

趣事分享 | python与微信

基于R语言对《平凡的世界》人物进行挖掘

《春风十里不如你》人物关系挖掘

Python微课：用Python验证你的策略吧！——Zipline回测

Python微课：教你如何正确使用Python表白

Python微课 | Seaborn——Python优雅绘图（上）

Python微课 | Seaborn——Python优雅绘图（下）

【统计思想之终】——莫愁前路无知己，天下谁人不识君。

跟我学R爬虫|HTML基础与R语言解析

跟我学R爬虫|XML&XPath表达式与R爬虫应用

易图秒懂の机器学习诞生奠基篇

易图秒懂の深度学习诞生问题篇

易图秒懂の深度学习诞生发展篇

更多微课请关注【数萃大数据】公众号，点击学习园地—可视化

欢迎大家关注微信公众号：数萃大数据

课程公告

网络爬虫与文本挖掘培训班【宁波站】

时间：2017年9月23日-25日

地点：维也纳国际酒店（机场店）

更多详情，请扫描下面二维码

登录查看更多

相关内容

R语言

关注 31

【2020新书】实战R语言4，323页pdf

专知会员服务

102+阅读 · 2020年7月1日

【2020新书】Pharo中的敏捷人工智能，实现神经网络、遗传算法和神经进化，394页pdf

专知会员服务

41+阅读 · 2020年6月23日

【ACL2020】利用模拟退火实现无监督复述

专知会员服务

14+阅读 · 2020年5月26日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【经典书】Python算法第二版，303页pdf，掌握Python语言中的基本算法

专知会员服务

220+阅读 · 2020年3月29日

【干货书】流畅Python，766页pdf，中英文版

专知会员服务

226+阅读 · 2020年3月22日

【2020新书】使用Google Dialogflow构建虚拟助手对话机器人，201页pdf

专知会员服务

72+阅读 · 2020年3月19日

《动手学深度学习》(Dive into Deep Learning)PyTorch实现

专知会员服务

120+阅读 · 2019年12月31日

新书《给数据科学家的Python技能秘籍》，87页pdf，简单上手实用！

专知会员服务

110+阅读 · 2019年12月26日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

下载 | 最全中文文本分类模型库，上手即用

机器学习算法与Python学习

30+阅读 · 2019年10月17日

R语言自然语言处理：文本分类

R语言中文社区

7+阅读 · 2019年4月27日

已删除

架构文摘

3+阅读 · 2019年4月17日

R语言自然语言处理：情感分析

R语言中文社区

16+阅读 · 2019年4月16日

R语言自然语言处理：关键词提取与文本摘要（TextRank）

R语言中文社区

4+阅读 · 2019年3月18日

R语言自然语言处理：词性标注与命名实体识别

R语言中文社区

7+阅读 · 2019年3月5日

手把手教你用R语言制作网络爬虫机器人（一）

R语言中文社区

4+阅读 · 2019年1月26日

word2vec中文语料训练

全球人工智能

12+阅读 · 2018年4月23日

隐马尔科夫模型 python 实现简单拼音输入法

Python开发者

3+阅读 · 2017年12月6日

Python3爬虫之入门和正则表达式

全球人工智能

7+阅读 · 2017年10月9日

Teacher-Student Training for Robust Tacotron-based TTS

Arxiv

5+阅读 · 2019年11月7日

Zero-Shot Entity Linking by Reading Entity Descriptions

Arxiv

6+阅读 · 2019年6月18日

Optimization Models for Machine Learning: A Survey

Arxiv

18+阅读 · 2019年1月16日

Speaker Recognition from raw waveform with SincNet

Arxiv

6+阅读 · 2018年7月29日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder

Arxiv

4+阅读 · 2018年5月24日

Chinese NER Using Lattice LSTM

Arxiv

14+阅读 · 2018年5月15日

Approaches for Enriching and Improving Textual Knowledge Bases

Arxiv

15+阅读 · 2018年4月20日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

PointCNN

Arxiv

8+阅读 · 2018年1月25日

VIP会员