人妻无码中文专区久久五月婷,人妻精油按摩bd高清中文字幕

Journey of Rural Studies 期刊全部文獻(xiàn)翻譯庫(kù)

（共3007篇）

后期排版一下，便于閱讀

①預(yù)覽題目、關(guān)鍵字等 → ②找到需要文獻(xiàn) → ③點(diǎn)擊鏈接 → ④文章細(xì)讀

靈感來(lái)源

快速選擇文獻(xiàn)——中文 VS 英文

要想找到一篇與自己研究?jī)?nèi)容相關(guān)的文獻(xiàn)，應(yīng)該先從廣度上進(jìn)行篩選，找到合適自己研究的文章，接著細(xì)讀，研究其觀點(diǎn)、研究方法等。相對(duì)于閱讀英文，我們對(duì)中文的閱讀速度更快、能夠更有效的通過(guò)關(guān)鍵信息篩選出自己需要的文章。避免文章看了好久終于看完了發(fā)現(xiàn)與自己的研究相關(guān)性不大。

于是想法誕生，如果能批量的將英文文獻(xiàn)自動(dòng)翻譯形成數(shù)據(jù)庫(kù)，不僅能夠方便閱讀，而且能在相同時(shí)間內(nèi)閱讀更多的信息，延展文獻(xiàn)篩選的廣度，并且沒(méi)有網(wǎng)絡(luò)延遲，提高了思維的流暢性。寒假在家順便實(shí)驗(yàn)了一下，感覺(jué)效果不錯(cuò)，實(shí)現(xiàn)方式見(jiàn)下文。

邏輯設(shè)計(jì)

路徑 & 框架

? ? 要想實(shí)現(xiàn)這個(gè)想法，總體操作流程應(yīng)該有兩大塊。一是利用 Python 爬取相關(guān)的數(shù)據(jù)，二是調(diào)用百度翻譯API 接口進(jìn)行自動(dòng)翻譯。詳細(xì)流程整理如下圖：

物理設(shè)計(jì)

源碼 & 實(shí)現(xiàn)

1、文獻(xiàn)數(shù)據(jù)抓取

本次使用 Journal of Rural Studies期刊作為測(cè)試，具體的網(wǎng)址如下，任務(wù)就是爬取該期刊從創(chuàng)刊以來(lái)到現(xiàn)在所有的文章信息。

https://www.journals.elsevier.com/journal-of-rural-studies/

# 導(dǎo)入庫(kù)

import requests as re

from lxml import etree

import pandas as pd

import time



# 構(gòu)造請(qǐng)求頭

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

先拿一個(gè)網(wǎng)頁(yè)做一個(gè)測(cè)試，看看X path解析結(jié)果

url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'

res = re.get(url,headers = headers).text

res = etree.HTML(res)

testdata = res.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

testdata

? ?結(jié)果發(fā)現(xiàn)一級(jí)網(wǎng)頁(yè)解析結(jié)果只有一個(gè)次級(jí)鏈接，按照道理來(lái)說(shuō)應(yīng)該有一級(jí)網(wǎng)頁(yè)的全部鏈接，通過(guò)多次嘗試發(fā)現(xiàn)，網(wǎng)頁(yè)設(shè)計(jì)過(guò)程中第一個(gè)次級(jí)鏈接為get請(qǐng)求，而其余的次級(jí)鏈接都是POST請(qǐng)求，該網(wǎng)頁(yè)一共有page2，為了方便，將所有鏈接都點(diǎn)開(kāi)之后將網(wǎng)頁(yè)保存為HTML文件之后再導(dǎo)入較為方便

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)

得到 TLINKS 是所有一級(jí)網(wǎng)頁(yè)的鏈接，觀察長(zhǎng)度共有158條數(shù)據(jù)，數(shù)據(jù)獲取正確。接下來(lái)獲取所有的二級(jí)網(wǎng)絡(luò)鏈接，這個(gè)時(shí)候就看看直播之類(lèi)的吧，訪問(wèn)國(guó)外網(wǎng)站有點(diǎn)慢。完成之后共得到3007個(gè)次級(jí)鏈接（即3007篇文章）

SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"條數(shù)據(jù)OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)

得到二級(jí)網(wǎng)頁(yè)網(wǎng)頁(yè)鏈接之后，需要分析三級(jí)網(wǎng)頁(yè)的網(wǎng)頁(yè)結(jié)構(gòu)，并將需要的信息進(jìn)行篩選，構(gòu)造字典比存儲(chǔ)。

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("http://div[@class = 'text-xs']/text()")

    timu = res.xpath("http://span[@class = 'title-text']/text()")

    givenname = res.xpath("http://span[@class='text given-name']/text()")

    surname = res.xpath("http://span[@class='text surname']/text()")

    web = res.xpath("http://a[@class='doi']/@href")

    abstract = res.xpath("http://p[@id='abspara0010']/text()")

    keywords = res.xpath("http://div[@class='keyword']/span/text()")

    highlights = res.xpath("http://dd[@class='list-description']/p/text()")



    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進(jìn)度是",(LINKS.index(LINK)+1)/len(LINKS))



df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

?由此數(shù)據(jù)的爬取工作完成，得到了擁有所有文章信息的DataFrame

2、數(shù)據(jù)清洗

? ? 去除掉數(shù)據(jù)中多余的字符、將一些爬取時(shí)合并的信息進(jìn)行拆分，形成面向翻譯的Data Frame

# 刪除多余的字符

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



# 分割合并的信息

data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)

3、關(guān)鍵部分批量翻譯

? ? 得到具有全部文獻(xiàn)信息的Data Frame之后，需要調(diào)用百度翻譯 API 進(jìn)行批量翻譯。需要具體看一下官方的技術(shù)文檔，所需要的請(qǐng)求參數(shù)在文檔中有詳細(xì)的說(shuō)明。

[https://api.fanyi.baidu.com/doc/21]，

字段名	類(lèi)型	必填參數(shù)	描述	備注
q	TEXT	Y	請(qǐng)求翻譯query	UTF-8編碼
from	TEXT	Y	請(qǐng)求翻譯的源語(yǔ)言	zh中文、en英語(yǔ)
to	TEXT	Y	譯文語(yǔ)言	zh中文、en英語(yǔ)
salt	TEXT	Y	隨機(jī)數(shù)
appid	TEXT	Y	APP ID	自己申請(qǐng)
sign	TEXT	Y	簽名	appid+q+salt+密鑰的MD5值

# 導(dǎo)入相應(yīng)的庫(kù)

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 構(gòu)造自動(dòng)翻譯函數(shù) translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()

    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 轉(zhuǎn)換為json格式之后需要分析json的格式，并取出相應(yīng)的返回翻譯結(jié)果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)

?構(gòu)造完成后測(cè)試一下，結(jié)果返回正確，當(dāng)輸入?yún)?shù)為空時(shí)，返回‘trans_result’

萬(wàn)事具備，現(xiàn)在只需要將爬取到的文獻(xiàn)的數(shù)據(jù)利用translateBaidu進(jìn)行翻譯并構(gòu)造新的 DateFrame即可。

# 在DataFrame中構(gòu)建相應(yīng)的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 開(kāi)始翻譯并賦值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文檔要求，每秒的請(qǐng)求不能超過(guò)10條

    time.sleep(0.5)

print('ALL FINISHED')

?看一下翻譯的效果

最后調(diào)用 ODBC 接口把數(shù)據(jù)存入數(shù)據(jù)庫(kù)中，保存OK，以后過(guò)一段時(shí)間睡覺(jué)之前跑一下程序就能不斷更新文獻(xiàn)庫(kù)了。可以把經(jīng)?？吹钠诳篮J畫(huà)瓢都編寫(xiě)一下，以后就可以很輕松的監(jiān)察文獻(xiàn)動(dòng)態(tài)了……

質(zhì)量測(cè)評(píng)

機(jī)翻 vs 人翻

? ? ?在翻譯完成之后，還是有點(diǎn)擔(dān)心百度機(jī)翻的質(zhì)量（谷歌接口有點(diǎn)難搞），所以隨機(jī)抽樣幾條數(shù)據(jù)來(lái)檢查一下翻譯的質(zhì)量。emmmm，大概看了一下，感覺(jué)比我翻的好（手動(dòng)滑稽）…….

[關(guān)鍵詞翻譯的準(zhǔn)確度 > 題目翻譯的準(zhǔn)確度 > 摘要 > highlights ]

? ? ?但是粗粗的看一下還是沒(méi)有問(wèn)題的，能夠理解大概的意思，不影響理解。

整理后的代碼

# 相應(yīng)庫(kù)的導(dǎo)入

import requests as re

from lxml import etree

import pandas as pd

import time

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 請(qǐng)求頭的構(gòu)造

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}



# 獲取第一層網(wǎng)頁(yè)鏈接

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("http://a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)



# 獲取第二層網(wǎng)頁(yè)鏈接

    SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("http://a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"條數(shù)據(jù)OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)



# 獲取第三層網(wǎng)頁(yè)的數(shù)據(jù)

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("http://a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("http://div[@class = 'text-xs']/text()")

    timu = res.xpath("http://span[@class = 'title-text']/text()")

    givenname = res.xpath("http://span[@class='text given-name']/text()")

    surname = res.xpath("http://span[@class='text surname']/text()")

    web = res.xpath("http://a[@class='doi']/@href")

    abstract = res.xpath("http://p[@id='abspara0010']/text()")

    keywords = res.xpath("http://div[@class='keyword']/span/text()")

    highlights = res.xpath("http://dd[@class='list-description']/p/text()")



    # 字典內(nèi)部數(shù)據(jù)結(jié)構(gòu)的組織

    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"條數(shù)據(jù) IS FINISHED,總進(jìn)度是",(LINKS.index(LINK)+1)/len(LINKS))



# 保存數(shù)據(jù)到excel文件

df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')



# 數(shù)據(jù)的初步清洗

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)



# 構(gòu)造自動(dòng)翻譯函數(shù) translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()



    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 轉(zhuǎn)換為json格式之后需要分析json的格式，并取出相應(yīng)的返回翻譯結(jié)果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)



# 在DataFrame中構(gòu)建相應(yīng)的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 開(kāi)始翻譯并賦值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文檔要求，每秒的請(qǐng)求不能超過(guò)10條

    time.sleep(0.5)

print('ALL FINISHED')



# 保存文件

data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')