Python 替换行符导致文字丢失:
从网页爬取一段文字:
Consider the half-eigenvalue problem [公式] a.e.
[公式], where [公式], [公式], [公式] for [公式], and [公式] and [公式] are indefinite integrable weights in the Lebesgue
space [公式]. We characterize the spectra structure under periodic, antiperiodic, Dirichlet, and Neumann boundary conditions, respectively. Furthermore, all these half-eigenvalues are continuous in [公式], where [公式] denotes the weak topology in [公式] space. The Dirichlet and the Neumann half-eigenvalues are continuously Fréchet differentiable in [公式], where [公式] is the [公式] norm of [公式].
我使用 python 字符串 replace()方法将 \n 替换为空,出现了这样的情况:
space [公式]. We characterize the spectra structure under periodic, antiperiodic, Dirichlet, and Neumann boundary conditions, respectively. Furthermore, all these half-eigenvalues are continuous in [公式], where [公式] denotes the weak topology in [公式] space. The Dirichlet and the Neumann half-eigenvalues are continuously Fréchet differentiable in [公式], where [公式] is the [公式] norm of [公式].
对,没错,上面两行数据莫名消失了!!
我仅仅是这么做了而已:
content.replace('\n', '')
到底出了什么问题呢?
答案超乎我的意料——
我们加一个方括号,这样就能查看编码了,于是:
print([content])
它的打印结果是:
['Consider the half-eigenvalue problem [公式] a.e. \r\n[公式], where [公式], [公式], [公式] for [公式], and [公式] and [公式] are indefinite integrable weights in the Lebesgue\r\nspace [公式]. We characterize the spectra structure under periodic, antiperiodic, Dirichlet, and Neumann boundary conditions, respectively. Furthermore, all these half-eigenvalues are continuous in [公式], where [公式] denotes the weak topology in [公式] space. The Dirichlet and the Neumann half-eigenvalues are continuously Fréchet differentiable in [公式], where [公式] is the [公式] norm of [公式].']
居然含有\r(极有可能是 windows 操作系统特有的问题)我们如何还原呢?当然是把\r 也替换掉
print(content.replace('\r','').replace('\n',''))
这样就没有问题了,至于为什么不直接写为 replace(‘\r\n’,’’),你猜有没有可能有些字符串中没有\r 但有些有,所以分开写不会报错