Journey of Rural Studies 期刊全部文獻(xiàn)翻譯庫(kù)

(共3007篇)

后期排版一下,便于閱讀

①預(yù)覽題目、關(guān)鍵字等 → ②找到需要文獻(xiàn) → ③點(diǎn)擊鏈接 → ④文章細(xì)讀

靈感來(lái)源

快速選擇文獻(xiàn)——中文 VS 英文

     要想找到一篇與自己研究?jī)?nèi)容相關(guān)的文獻(xiàn),應(yīng)該先從廣度上進(jìn)行篩選,找到合適自己研究的文章,接著細(xì)讀,研究其觀點(diǎn)、研究方法等。相對(duì)于閱讀英文,我們對(duì)中文的閱讀速度更快、能夠更有效的通過(guò)關(guān)鍵信息篩選出自己需要的文章。避免文章看了好久終于看完了發(fā)現(xiàn)與自己的研究相關(guān)性不大。

     于是想法誕生,如果能批量的將英文文獻(xiàn)自動(dòng)翻譯形成數(shù)據(jù)庫(kù),不僅能夠方便閱讀,而且能在相同時(shí)間內(nèi)閱讀更多的信息,延展文獻(xiàn)篩選的廣度,并且沒(méi)有網(wǎng)絡(luò)延遲,提高了思維的流暢性。寒假在家順便實(shí)驗(yàn)了一下,感覺(jué)效果不錯(cuò),實(shí)現(xiàn)方式見(jiàn)下文。

邏輯設(shè)計(jì)

路徑 & 框架

? ? 要想實(shí)現(xiàn)這個(gè)想法,總體操作流程應(yīng)該有兩大塊。一是利用 Python 爬取相關(guān)的數(shù)據(jù),二是調(diào)用 百度翻譯API 接口進(jìn)行自動(dòng)翻譯。詳細(xì)流程整理如下圖:

物理設(shè)計(jì)

源碼 & 實(shí)現(xiàn)

1、文獻(xiàn)數(shù)據(jù)抓取

     本次使用 Journal of Rural Studies期刊作為測(cè)試,具體的網(wǎng)址如下,任務(wù)就是爬取該期刊從創(chuàng)刊以來(lái)到現(xiàn)在所有的文章信息。

https://www.journals.elsevier.com/journal-of-rural-studies/

# 導(dǎo)入庫(kù)
import requests as re
from lxml import etree
import pandas as pd
import time

# 構(gòu)造請(qǐng)求頭
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

先拿一個(gè)網(wǎng)頁(yè)做一個(gè)測(cè)試,看看X path解析結(jié)果

url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'
res = re.get(url,headers = headers).text
res = etree.HTML(res)
testdata = res.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
testdata

? ?結(jié)果發(fā)現(xiàn)一級(jí)網(wǎng)頁(yè)解析結(jié)果只有一個(gè)次級(jí)鏈接,按照道理來(lái)說(shuō)應(yīng)該有一級(jí)網(wǎng)頁(yè)的全部鏈接,通過(guò)多次嘗試發(fā)現(xiàn),網(wǎng)頁(yè)設(shè)計(jì)過(guò)程中第一個(gè)次級(jí)鏈接為get請(qǐng)求,而其余的次級(jí)鏈接都是POST請(qǐng)求,該網(wǎng)頁(yè)一共有page2,為了方便,將所有鏈接都點(diǎn)開(kāi)之后將網(wǎng)頁(yè)保存為HTML文件之后再導(dǎo)入較為方便

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)

得到 TLINKS 是所有一級(jí)網(wǎng)頁(yè)的鏈接,觀察長(zhǎng)度共有158條數(shù)據(jù),數(shù)據(jù)獲取正確。接下來(lái)獲取所有的二級(jí)網(wǎng)絡(luò)鏈接,這個(gè)時(shí)候就看看直播之類(lèi)的吧,訪問(wèn)國(guó)外網(wǎng)站有點(diǎn)慢。完成之后共得到3007個(gè)次級(jí)鏈接(即3007篇文章)

SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"條數(shù)據(jù)OK")
time.sleep(0.2)
print('ALL IS OK')

LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)

得到二級(jí)網(wǎng)頁(yè)網(wǎng)頁(yè)鏈接之后,需要分析三級(jí)網(wǎng)頁(yè)的網(wǎng)頁(yè)結(jié)構(gòu),并將需要的信息進(jìn)行篩選,構(gòu)造字典比存儲(chǔ)。

allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("http://div[@class = 'text-xs']/text()")
timu = res.xpath("http://span[@class = 'title-text']/text()")
givenname = res.xpath("http://span[@class='text given-name']/text()")
surname = res.xpath("http://span[@class='text surname']/text()")
web = res.xpath("http://a[@class='doi']/@href")
abstract = res.xpath("http://p[@id='abspara0010']/text()")
keywords = res.xpath("http://div[@class='keyword']/span/text()")
highlights = res.xpath("http://dd[@class='list-description']/p/text()")

info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進(jìn)度是",(LINKS.index(LINK)+1)/len(LINKS))

df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

?由此數(shù)據(jù)的爬取工作完成,得到了擁有所有文章信息的DataFrame

2、數(shù)據(jù)清洗

? ? 去除掉數(shù)據(jù)中多余的字符、將一些爬取時(shí)合并的信息進(jìn)行拆分,形成面向翻譯的Data Frame

# 刪除多余的字符
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')

# 分割合并的信息
data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)

3、關(guān)鍵部分批量翻譯

? ? 得到具有全部文獻(xiàn)信息的Data Frame之后,需要調(diào)用 百度翻譯 API 進(jìn)行批量翻譯。需要具體看一下官方的技術(shù)文檔,所需要的請(qǐng)求參數(shù)在文檔中有詳細(xì)的說(shuō)明。

[https://api.fanyi.baidu.com/doc/21],

字段名類(lèi)型必填參數(shù)描述備注
qTEXTY請(qǐng)求翻譯queryUTF-8編碼
fromTEXTY請(qǐng)求翻譯的源語(yǔ)言zh中文、en英語(yǔ)
toTEXTY譯文語(yǔ)言zh中文、en英語(yǔ)
saltTEXTY隨機(jī)數(shù)
appidTEXTYAPP ID自己申請(qǐng)
signTEXTY簽名appid+q+salt+密鑰的MD5值
# 導(dǎo)入相應(yīng)的庫(kù)
import http.client
import hashlib
import urllib
import random
import json
import requests as re

# 構(gòu)造自動(dòng)翻譯函數(shù) translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()
try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 轉(zhuǎn)換為json格式之后需要分析json的格式,并取出相應(yīng)的返回翻譯結(jié)果
dst = str(jres['trans_result'][0]['dst'])
return dst

except Exception as e:
print(e)

?構(gòu)造完成后測(cè)試一下,結(jié)果返回正確,當(dāng)輸入?yún)?shù)為空時(shí),返回‘trans_result’

萬(wàn)事具備,現(xiàn)在只需要將 爬取到的文獻(xiàn)的數(shù)據(jù)利用translateBaidu進(jìn)行翻譯并構(gòu)造新的 DateFrame即可。

# 在DataFrame中構(gòu)建相應(yīng)的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'

# 開(kāi)始翻譯并賦值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文檔要求,每秒的請(qǐng)求不能超過(guò)10條
time.sleep(0.5)
print('ALL FINISHED')

?看一下翻譯的效果

最后調(diào)用 ODBC 接口把數(shù)據(jù)存入數(shù)據(jù)庫(kù)中,保存OK,以后過(guò)一段時(shí)間睡覺(jué)之前跑一下程序就能不斷更新文獻(xiàn)庫(kù)了。可以把經(jīng)??吹钠诳篮J畫(huà)瓢都編寫(xiě)一下,以后就可以很輕松的監(jiān)察文獻(xiàn)動(dòng)態(tài)了……

質(zhì)量測(cè)評(píng)

機(jī)翻 vs 人翻

? ? ?在翻譯完成之后,還是有點(diǎn)擔(dān)心百度機(jī)翻的質(zhì)量(谷歌接口有點(diǎn)難搞),所以隨機(jī)抽樣幾條數(shù)據(jù)來(lái)檢查一下翻譯的質(zhì)量。emmmm,大概看了一下,感覺(jué)比我翻的好(手動(dòng)滑稽)…….

[關(guān)鍵詞翻譯的準(zhǔn)確度 > 題目翻譯的準(zhǔn)確度 > 摘要 > highlights ]

? ? ?但是粗粗的看一下還是沒(méi)有問(wèn)題的,能夠理解大概的意思,不影響理解。

整理后的代碼

# 相應(yīng)庫(kù)的導(dǎo)入
import requests as re
from lxml import etree
import pandas as pd
import time
import http.client
import hashlib
import urllib
import random
import json
import requests as re

# 請(qǐng)求頭的構(gòu)造
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

# 獲取第一層網(wǎng)頁(yè)鏈接
html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)

# 獲取第二層網(wǎng)頁(yè)鏈接
SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"條數(shù)據(jù)OK")
time.sleep(0.2)
print('ALL IS OK')

LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)

# 獲取第三層網(wǎng)頁(yè)的數(shù)據(jù)
allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("http://div[@class = 'text-xs']/text()")
timu = res.xpath("http://span[@class = 'title-text']/text()")
givenname = res.xpath("http://span[@class='text given-name']/text()")
surname = res.xpath("http://span[@class='text surname']/text()")
web = res.xpath("http://a[@class='doi']/@href")
abstract = res.xpath("http://p[@id='abspara0010']/text()")
keywords = res.xpath("http://div[@class='keyword']/span/text()")
highlights = res.xpath("http://dd[@class='list-description']/p/text()")

# 字典內(nèi)部數(shù)據(jù)結(jié)構(gòu)的組織
info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進(jìn)度是",(LINKS.index(LINK)+1)/len(LINKS))

# 保存數(shù)據(jù)到excel文件
df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

# 數(shù)據(jù)的初步清洗
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)

# 構(gòu)造自動(dòng)翻譯函數(shù) translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()

try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 轉(zhuǎn)換為json格式之后需要分析json的格式,并取出相應(yīng)的返回翻譯結(jié)果
dst = str(jres['trans_result'][0]['dst'])
return dst

except Exception as e:
print(e)

# 在DataFrame中構(gòu)建相應(yīng)的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'

# 開(kāi)始翻譯并賦值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文檔要求,每秒的請(qǐng)求不能超過(guò)10條
time.sleep(0.5)
print('ALL FINISHED')

# 保存文件
data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

本文章轉(zhuǎn)載微信公眾號(hào)@OCD Planners

上一篇:

帶有 Django API 的機(jī)器學(xué)習(xí)預(yù)測(cè)模型

下一篇:

Python應(yīng)用 | 網(wǎng)易云音樂(lè)熱評(píng)API獲取教程
#你可能也喜歡這些API文章!

我們有何不同?

API服務(wù)商零注冊(cè)

多API并行試用

數(shù)據(jù)驅(qū)動(dòng)選型,提升決策效率

查看全部API→
??

熱門(mén)場(chǎng)景實(shí)測(cè),選對(duì)API

#AI文本生成大模型API

對(duì)比大模型API的內(nèi)容創(chuàng)意新穎性、情感共鳴力、商業(yè)轉(zhuǎn)化潛力

25個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)

#AI深度推理大模型API

對(duì)比大模型API的邏輯推理準(zhǔn)確性、分析深度、可視化建議合理性

10個(gè)渠道
一鍵對(duì)比試用API 限時(shí)免費(fèi)