1 | ===============
|
---|
2 | Parsing a URI
|
---|
3 | ===============
|
---|
4 |
|
---|
5 | There are two ways to parse a URI with |rfc3986|
|
---|
6 |
|
---|
7 | #. :meth:`rfc3986.api.uri_reference`
|
---|
8 |
|
---|
9 | This is best when you're **not** replacing existing usage of
|
---|
10 | :mod:`urllib.parse`. This also provides convenience methods around safely
|
---|
11 | normalizing URIs passed into it.
|
---|
12 |
|
---|
13 | #. :meth:`rfc3986.api.urlparse`
|
---|
14 |
|
---|
15 | This is best suited to completely replace :func:`urllib.parse.urlparse`.
|
---|
16 | It returns a class that should be indistinguishable from
|
---|
17 | :class:`urllib.parse.ParseResult`
|
---|
18 |
|
---|
19 | Let's look at some code samples.
|
---|
20 |
|
---|
21 |
|
---|
22 | Some Examples
|
---|
23 | =============
|
---|
24 |
|
---|
25 | First we'll parse the URL that points to the repository for this project.
|
---|
26 |
|
---|
27 | .. testsetup:: *
|
---|
28 |
|
---|
29 | import rfc3986
|
---|
30 | url = rfc3986.urlparse('https://github.com/sigmavirus24/rfc3986')
|
---|
31 | uri = rfc3986.uri_reference('https://github.com/sigmavirus24/rfc3986')
|
---|
32 |
|
---|
33 | .. code-block:: python
|
---|
34 |
|
---|
35 | url = rfc3986.urlparse('https://github.com/sigmavirus24/rfc3986')
|
---|
36 |
|
---|
37 |
|
---|
38 | Then we'll replace parts of that URL with new values:
|
---|
39 |
|
---|
40 | .. testcode:: ex0
|
---|
41 |
|
---|
42 | print(url.copy_with(
|
---|
43 | userinfo='username:password',
|
---|
44 | port='443',
|
---|
45 | ).unsplit())
|
---|
46 |
|
---|
47 | .. testoutput:: ex0
|
---|
48 |
|
---|
49 | https://username:password@github.com:443/sigmavirus24/rfc3986
|
---|
50 |
|
---|
51 | This, however, does not change the current ``url`` instance of
|
---|
52 | :class:`~rfc3986.parseresult.ParseResult`. As the method name might suggest,
|
---|
53 | we're copying that instance and then overriding certain attributes.
|
---|
54 | In fact, we can make as many copies as we like and nothing will change.
|
---|
55 |
|
---|
56 | .. testcode:: ex1
|
---|
57 |
|
---|
58 | print(url.copy_with(
|
---|
59 | scheme='ssh',
|
---|
60 | userinfo='git',
|
---|
61 | ).unsplit())
|
---|
62 |
|
---|
63 | .. testoutput:: ex1
|
---|
64 |
|
---|
65 | ssh://git@github.com/sigmavirus24/rfc3986
|
---|
66 |
|
---|
67 | .. testcode:: ex1
|
---|
68 |
|
---|
69 | print(url.scheme)
|
---|
70 |
|
---|
71 | .. testoutput:: ex1
|
---|
72 |
|
---|
73 | https
|
---|
74 |
|
---|
75 | We can do similar things with URI References as well.
|
---|
76 |
|
---|
77 | .. code-block:: python
|
---|
78 |
|
---|
79 | uri = rfc3986.uri_reference('https://github.com/sigmavirus24/rfc3986')
|
---|
80 |
|
---|
81 | .. testcode:: ex2
|
---|
82 |
|
---|
83 | print(uri.copy_with(
|
---|
84 | authority='username:password@github.com:443',
|
---|
85 | path='/sigmavirus24/github3.py',
|
---|
86 | ).unsplit())
|
---|
87 |
|
---|
88 | .. testoutput:: ex2
|
---|
89 |
|
---|
90 | https://username:password@github.com:443/sigmavirus24/github3.py
|
---|
91 |
|
---|
92 | However, URI References may have some unexpected behaviour based strictly on
|
---|
93 | the RFC.
|
---|
94 |
|
---|
95 | Finally, if you want to remove a component from a URI, you may pass ``None``
|
---|
96 | to remove it, for example:
|
---|
97 |
|
---|
98 | .. testcode:: ex3
|
---|
99 |
|
---|
100 | print(uri.copy_with(path=None).unsplit())
|
---|
101 |
|
---|
102 | .. testoutput:: ex3
|
---|
103 |
|
---|
104 | https://github.com
|
---|
105 |
|
---|
106 | This will work on both URI References and Parse Results.
|
---|
107 |
|
---|
108 |
|
---|
109 | And Now For Something Slightly Unusual
|
---|
110 | ======================================
|
---|
111 |
|
---|
112 | If you are familiar with GitHub, GitLab, or a similar service, you may have
|
---|
113 | interacted with the "SSH URL" for some projects. For this project,
|
---|
114 | the SSH URL is:
|
---|
115 |
|
---|
116 | .. code::
|
---|
117 |
|
---|
118 | git@github.com:sigmavirus24/rfc3986
|
---|
119 |
|
---|
120 |
|
---|
121 | Let's see what happens when we parse this.
|
---|
122 |
|
---|
123 | .. code-block:: pycon
|
---|
124 |
|
---|
125 | >>> rfc3986.uri_reference('git@github.com:sigmavirus24/rfc3986')
|
---|
126 | URIReference(scheme=None, authority=None,
|
---|
127 | path=u'git@github.com:sigmavirus24/rfc3986', query=None, fragment=None)
|
---|
128 |
|
---|
129 | There's no scheme present, but it is apparent to our (human) eyes that
|
---|
130 | ``git@github.com`` should not be part of the path. This is one of the areas
|
---|
131 | where :mod:`rfc3986` suffers slightly due to its strict conformance to
|
---|
132 | :rfc:`3986`. In the RFC, an authority must be preceded by ``//``. Let's see
|
---|
133 | what happens when we add that to our URI
|
---|
134 |
|
---|
135 | .. code-block:: pycon
|
---|
136 |
|
---|
137 | >>> rfc3986.uri_reference('//git@github.com:sigmavirus24/rfc3986')
|
---|
138 | URIReference(scheme=None, authority=u'git@github.com:sigmavirus24',
|
---|
139 | path=u'/rfc3986', query=None, fragment=None)
|
---|
140 |
|
---|
141 | Somewhat better, but not much.
|
---|
142 |
|
---|
143 | .. note::
|
---|
144 |
|
---|
145 | The maintainers of :mod:`rfc3986` are working to discern better ways to
|
---|
146 | parse these less common URIs in a reasonable and sensible way without
|
---|
147 | losing conformance to the RFC.
|
---|