斯坦福在线机器学习课程字幕下载脚本
scturtle
posted @ 2011年10月11日 10:40
in python
, 9121 阅读
updated at Oct. 2012: 字幕打包下载 http://dl.vmall.com/c00blsic2a
斯坦福在线机器学习(Machine Learning)课程终于在昨天(10月10号)开课了,网址在 http://www.ml-class.org
和AI课程的视频放在youtube上不同,ML的在线视频提供下载,但是下载的版本没有英文字幕,故此放出自产自用字幕下载脚本,仿照以前的豆瓣电台歌曲信息脚本,也是从firefox获得cookie,然后爬得xml的字幕,再转换为srt格式的,这样就可以离线或者在iPad上看了!^_^
第一个是偷firefox的cookie爬网页,下载xml版的脚本,注意修改其中的firefox的cookie文件的目录,需要安装pysqlite包:
# coding: utf-8 import cookielib,urllib2 from cStringIO import StringIO from pysqlite2 import dbapi2 as sqlite import re,os # a useful function from others def sqlite2cookie(filename,host): con = sqlite.connect(filename) con.text_factory = str cur = con.cursor() cur.execute("select host, path, isSecure, expiry, name, value from moz_cookies where host like ?" ,['%%%s%%' % host]) ftstr = ["FALSE","TRUE"] s = StringIO() s.write("""\ # Netscape HTTP Cookie File # http://www.netscape.com/newsref/std/cookie_spec.html # This is a generated file! Do not edit. """) for item in cur.fetchall(): s.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % ( item[0], ftstr[item[0].startswith('.')], item[1], ftstr[item[2]], item[3], item[4], item[5])) s.seek(0) #print "cookie:",s.read() ;s.seek(0) cookie_jar = cookielib.MozillaCookieJar() cookie_jar._really_load(s, '', True, True) return cookie_jar # get cookie cookiejar = sqlite2cookie(r'C:\Users\lenovo\AppData\Roaming\Mozilla\Firefox\Profiles\osfuexqh.default\cookies.sqlite','ml-class.org') opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar)) urllib2.install_opener(opener) # post cookie print 'Getting page, if it takes too long time, kill me!' content=urllib2.urlopen('http://www.ml-class.org/course/video/list?mode=view').read() #print content content=''.join(content.split('\n')) print "Page got!" # get file name namefinder=re.compile(r"file: ([^,]*),") found=namefinder.findall(content) print 'Number of files:',len(found) found=map(lambda s: s[1:-1],found) #print found # down xml baseurl='http://s3.amazonaws.com/stanford_videos/cs229/subtitles/%s-subtitles.xml' for i,fn in enumerate(found): fn=fn.replace(r'\'','\'') outfile=fn+'.xml' if os.path.exists(outfile): continue print 'Getting:',fn fo=file(outfile,'w') fo.write(urllib2.urlopen(baseurl % fn).read()) fo.close() print 'Done:',i,outfile
第二个是批量把xml脚本转化为srt字幕文件的脚本,各种字符串处理比较挫,见笑了:
# coding: utf-8 import os,sys,re,codecs limit=[60,60,60,1000] def xml2srt(fi,fo): data=''.join((fi.read().split('\n')[9:-4])).strip().split('</p>') for i in range(0,len(data)-1): #print i,data[i] if data[i]: st_st=data[i].index('"') st_ed=data[i].index('"',st_st+1) if i+1<len(data)-1: nx_st=data[i+1].index('"') nx_ed=data[i+1].index('"',nx_st+1) fo.write(str(i+1)+' \n') stamps=[data[i][st_st+1:st_ed], data[i+1][nx_st+1:nx_ed] if i+1<len(data)-1 else "99:59:59.999"] word=data[i][data[i].index('>')+1:].replace('\n',' ')+' \n\n\n' for i,stamp in enumerate(stamps): stamp=stamp.split('.') stamps[i]=map(int,stamp[0].split(':')) stamps[i].append(int(stamp[1])) stamps=map(lambda s:"%02d:%02d:%02d,%03d" % tuple(s),stamps) fo.write("%s --> %s \n" % tuple(stamps)) fo.write(word) #print 'OK!' if __name__=='__main__': for fn in os.listdir('.'): if fn[-4:]!='.xml': continue print 'Converting:',fn[:-4] fi=file(fn,'r') if os.path.exists(fn[:-4]+'.srt'): continue fo=file(fn[:-4]+'.srt','w') xml2srt(fi,fo) fo.close() print 'Done'
感谢wr大牛(@isnowfy)的支持,附上wr大牛的单个字幕下载的简化版本,还有exe可执行文件哦!
2011年10月22日 23:18
怎么还有字幕信息么?我一直以为没有呢。。。
能详细说说么?
2011年10月23日 08:39
在线视频播放器右下角第二个按钮就是开字幕,有时候字幕比视频出的晚,而且加载慢,用firebug可以发现字幕是放在amazon s3上的,名字很规律
2019年7月08日 19:03
Hi,
I am Alex editor at Guru99. There is 69% chance you will not open this email considering its automated cold mail.
But I must highlight I enjoyed your content at http://scturtle.is-programmer.com/posts/30120
I could not help noticing you linked to https://www.coursera.org/learn/machine-learning . I have created a more-in depth article at https://www.guru99.com/machine-learning-courses.html
Can you link to us? I did be happy to share your page with our 25k Facebook/Twitter/Linkedin Followers as a thank you.
Best,
Alex
2019年7月16日 17:48
Hi,
I am curious to know your thoughts on my proposal below.
I welcome your feedback or rejection ( I pray not : - ) )
========Original Message=========
Hi,
I am Alex editor at Guru99. There is 69% chance you will not open this email considering its automated cold mail.
But I must highlight I enjoyed your content at http://scturtle.is-programmer.com/posts/30120
I could not help noticing you linked to https://www.coursera.org/learn/machine-learning . I have created a more-in depth article at https://www.guru99.com/machine-learning-courses.html
Can you link to us? I did be happy to share your page with our 25k Facebook/Twitter/Linkedin Followers as a thank you.
Best,
Alex